logo
ResearchBunny Logo
Building a core rule-based decision tree to explain the causes of insolvency in small and medium-sized enterprises more easily

Business

Building a core rule-based decision tree to explain the causes of insolvency in small and medium-sized enterprises more easily

S. Lee, K. Choi, et al.

This study introduces a novel approach, the harmonic average of support and confidence method (HSC), to derive key rules from decision trees, ultimately creating a core rule-based decision tree (CorDT) that unravels the factors contributing to insolvency in SMEs. Conducted by Sanghoon Lee, Keunho Choi, and Donghee Yoo, this research not only predicts insolvency but also provides tailored prevention strategies.

00:00
00:00
~3 min • Beginner • English
Introduction
The study addresses the challenge of predicting insolvency among SMEs, where financial statements are often unreliable, infrequently updated, or unavailable, particularly for startups and young firms. Traditional insolvency prediction relies on financial ratios, but for SMEs these data are limited and lag operational realities. The research question is whether non-financial technological feasibility assessment data can effectively predict SME insolvency and, furthermore, how to extract and communicate the most important decision rules from decision trees for actionable insights. The purpose is to build a decision tree-based insolvency prediction model using non-financial technological feasibility assessments, accommodate SME heterogeneity by company type (general, technology development, toll processing), and propose a new rule selection method (HSC) to construct a concise Core Rule-based Decision Tree (CorDT) that clearly explains insolvency causes and informs tailored prevention strategies.
Literature Review
Prior work in insolvency prediction has predominantly used financial data (e.g., Altman Z-score; Ohlson), with more recent studies incorporating machine learning methods (SVM, RF, XGBoost, neural networks). Increasingly, research blends financial and non-financial information, showing that combined datasets can improve default prediction. For SMEs, non-financial data such as technological capabilities, managerial attributes, market factors, CSR, and governance have demonstrated additional predictive value, especially given the scarcity or unreliability of SME financials. Studies using technological feasibility assessments indicate that variables such as managerial competence, financing ability, technical development capacity, product commercialization, and market-related indicators contribute to predicting insolvency and growth. In rule selection from decision trees, traditional measures like classification accuracy and Laplace correction (which adjusts for small sample rules) and Weighted Relative Accuracy (WRA) have been used. However, these may either underweight support or overweight large-coverage rules. The paper proposes HSC—harmonic mean of normalized support and confidence—to better balance rule generality and accuracy when selecting core rules from decision trees.
Methodology
Research framework: (1) Data acquisition and preprocessing; (2) Addressing class imbalance via resampling and cost-sensitive learning; (3) Feature selection; (4) Decision tree modeling (train/test split 70/30); (5) Rule selection using HSC and building Core Rule-based Decision Trees (CorDT) for interpretability and strategy derivation. Data and types: Non-financial technological feasibility assessment data from KOSME for manufacturing SMEs were used, comprising three evaluation categories: management ability, business feasibility, and technical ability. SMEs were categorized into three technology types: General Type (GT: own technology and production base), Technology Development Type (TDT: technology but weak/no production base), and Toll Processing Type (TPT: specialized processing services without complete product development capability). Target and features: Insolvency defined as delinquency >3 months post-lending. The target is binary (insolvent vs. healthy). Thirty-two assessment indicators from 2014 technological feasibility assessments served as independent variables (e.g., management stability, financing ability, credit status, sales management, competitive strength, technical application capacity; full list summarized in Table 2 of the paper). Imbalance handling: Base dataset size (no sampling): 4356. Class balancing produced 2334 under-sampling datasets and 6378 over-sampling datasets across types (Table 3). Under-sampling methods: Random Under-Sampling, SpreadSubsample, ClusterCentroids. Over-sampling methods: Random Over-Sampling (ROS), SMOTE, ADASYN. A cost-sensitive approach was also tested, assigning class weight heuristics of 3.25, 2.52, and 2.38 for GT, TDT, and TPT, respectively. Feature selection: Backward elimination and gain ratio were used to identify influential variables for each type and to remove variables detrimental to performance. Modeling: Decision tree algorithm implemented using Weka 3.8.3 and Python 3.8.5. Train/test split of 70/30. Across the three SME types and seven sampling settings (no sampling, cost-sensitive, 3 under-sampling, 3 over-sampling), 21 datasets/models were developed and evaluated by hit ratio and AUC (Table 4). Proposed rule selection (HSC): For each leaf node (rule) in the best-performing decision trees per type, compute support (instances in leaf/total), confidence (correctly classified instances in leaf/instances in leaf), normalize each by its maximum across rules, then compute the harmonic mean: HSC = 2 * (N_support * N_confidence) / (N_support + N_confidence). Rank rules by HSC and select the top-n to build CorDT. Evaluation of rule selection methods: Compared Laplace, WRA, and HSC using distances between curves of classification accuracy (CA) and cumulative ratio of correctly classified cases (CRCC): Euclidean, Manhattan, and Canberra distances (Table 6). Lower distances indicate better alignment between rule importance ranking and desirable properties (high CA and CRCC).
Key Findings
- Sampling and model performance: Over-sampling outperformed under-sampling; cost-sensitive learning did not improve performance on this dataset. SMOTE achieved the highest average hit ratio across types at 77.6% (Table 4). Per type: General Type best hit ratio 80.5% with ADASYN (AUC 0.851), Technology Development Type 77.7% with SMOTE (AUC 0.780), Toll Processing Type 75.1% with SMOTE (AUC 0.792). Average AUCs under over-sampling were superior to no sampling and under-sampling. - Important features by type (Table 5): • General Type: Management stability (M2), financing ability (M6), credit status (M7), business propulsion (M8), CEO’s professionalism (M10), transaction stability (B1), production efficiency (T5), quality & process improvement (T7). Management ability indicators dominated. • Technology Development Type: Internal control (M3), financing ability (M6), transaction stability (B1), sales management (B2), competitive strength (B8), market growth (B9), technical application capacity (T10). Business feasibility indicators (B2, B8) and internal control (M3) were critical; only one technical ability indicator (T10) emerged. • Toll Processing Type: Management stability (M2), business propulsion (M8), CEO’s reliability (M9), sales management (B2), future profitability (B4), market position (B6), competitive strength (B8), market environment (B10), technical application capacity (T10). Business feasibility and management ability were more influential than technical ability. - HSC vs. Laplace and WRA (Table 6): Averaged over types, HSC achieved the smallest distances between CA and CRCC curves, indicating better balance between rule accuracy and coverage: Euclidean distance—Laplace 208.2, WRA 106.6, HSC 104.6; Manhattan—Laplace 671, WRA 328, HSC 319; Canberra—Laplace 5.3, WRA 2.6, HSC 2.4. - Core Rule-based Decision Trees (CorDT): Using top-5 HSC rules, the models explained substantial portions of insolvency cases: • General Type: 666/1044 cases (63.8%). Most descriptive rule (G10): low credit status (M7 ≤3) and specific range of management stability (M2) strongly associated with insolvency. • Technology Development Type: 175/301 cases (58.1%). Most descriptive rule (TD2): low competitive strength (B8), low sales management (B2), low financing ability (M6), with moderate transaction stability (B1) indicate high insolvency risk. • Toll Processing Type: 786/990 cases (79.4%). Most descriptive rule (TP5): low scores across management stability (M2), CEO’s reliability (M9), sales management (B2), competitive strength (B8), market position (B6), market environment (B10), with business propulsion (M8) >2.3 still yielding high risk if others are weak. - Strategy implications from CorDT: For General Type, prioritize improving credit status (M7), management stability (M2), and financing ability (M6). For Technology Development Type, strengthen competitive positioning (B8) and sales management (B2), ensure internal control (M3), and monitor market growth (B9). For Toll Processing Type, emphasize business propulsion (M8), management stability (M2), market environment readiness (B10), and leadership credibility (M9).
Discussion
The findings demonstrate that non-financial technological feasibility assessments can effectively predict SME insolvency, mitigating limitations of SME financial data. Over-sampling, particularly SMOTE, improves classifier performance in imbalanced settings. Crucially, insolvency drivers differ by SME technology type, validating the need for type-specific models. The proposed HSC method better balances rule accuracy and coverage than Laplace and WRA, enabling the construction of compact CorDTs that retain high explanatory power and facilitate managerial interpretation. These interpretable rules help credit evaluators and SME managers identify critical weaknesses (e.g., credit status and management stability for General Type; competitive strength and sales management for Technology Development Type; business propulsion and management stability for Toll Processing Type) and design targeted interventions. The approach aligns with literature showing the value of non-financial indicators and offers a practical, transparent tool for lenders and policymakers managing SME credit risk.
Conclusion
The study develops decision tree-based insolvency prediction models for SMEs using non-financial technological feasibility assessment data, addressing data scarcity and timeliness issues in SME financials. By proposing the HSC rule selection method and building streamlined CorDTs, the work enhances interpretability and provides type-specific explanations and prevention strategies. Empirically, over-sampling (notably SMOTE) yielded the best average performance (77.6% hit ratio), and HSC outperformed Laplace and WRA in balancing rule accuracy and support. Future research will incorporate environmental and market variables, additional non-financial data, and anomaly detection methods (cluster-, distance-, and density-based) to further improve prediction performance and robustness.
Limitations
The models used only technological feasibility assessment (non-financial) data and did not account for environmental and macroeconomic variables over the repayment horizon. While class imbalance was addressed via resampling and cost-sensitive heuristics, these methods can introduce bias or overfitting. Some indicators (e.g., sales management scoring granularity) may require redesign for better discrimination, and external validity beyond the studied context (Korean SMEs assessed by KOSME) may be limited.
Listen, Learn & Level Up
Over 10,000 hours of research content in 25+ fields, available in 12+ languages.
No more digging through PDFs, just hit play and absorb the world's latest research in your language, on your time.
listen to research audio papers with researchbunny