Business
Credit risk assessment using the factorization machine model with feature interactions
J. Quan and X. Sun
Credit risk—the likelihood a borrower defaults on obligations under a credit agreement—is the primary risk faced by financial institutions. Historical failures (e.g., Bank Herstatt) and recent crises underscore the need for accurate credit risk analysis. Credit risk assessment typically classifies applicants as default or non-default based on demographic, financial, and behavioral attributes. Traditional parametric methods such as logistic regression (LR) and linear discriminant analysis (LDA) can estimate class probabilities and handle mixed variable types but rely on linearity and independence assumptions that limit their ability to capture complex nonlinear relationships and interactions. Numerous machine-learning (ML) techniques (SVM, kNN, decision trees, and ANN) have been applied to improve performance over traditional statistical models. However, many approaches still inadequately model feature interactions, which can be critical in credit risk. Factorization machines (FM), introduced by Rendle, provide a universal predictor that explicitly models pairwise feature interactions with factorized parameters and can be trained in polynomial time, making them well-suited to sparse, high-dimensional data resulting from one-hot encoding. This study addresses the research gap by applying FM to credit risk assessment with an explicit focus on capturing feature interactions. We evaluate FM against LR, SVM, kNN, and ANN on four real-world datasets to determine whether modeling interactions improves predictive performance. We present preliminaries on FM, detail our methodology, report experimental results, and conclude with implications and future directions.
Prior work in credit risk assessment includes parametric methods (LR, LDA) widely used due to interpretability and probability estimation but limited by linear assumptions and independence of features. Studies have shown non-parametric and ML methods (e.g., random forests, SVM, kNN, decision trees, ANN) often outperform traditional statistical models in credit scoring. SVM, in particular, has achieved strong performance in various credit scoring applications. Research has also explored optimization enhancements and ensemble strategies for credit scoring and the integration of ML with expert rules to improve decision-making and manage risk guardrails. Despite these advances, many models do not fully capture interactions among features, especially in sparse, high-dimensional spaces caused by categorical encodings. Factorization machines, combining advantages of SVM and factorization models, explicitly parameterize pairwise interactions via low-dimensional latent factors, enabling effective learning under sparsity and offering polynomial-time training. Empirical evidence on FM in credit risk contexts has been limited; this study provides a comparative evaluation against established models across multiple public datasets.
Preliminaries and FM model: We consider supervised learning for binary classification (default vs non-default), with features often including categorical variables transformed via one-hot encoding, yielding sparse representations. Linear models (LR) and margin-based models (SVM) treat features independently, while FM augments a linear term with pairwise interaction terms parameterized by latent vectors. The second-order FM predicts y(x) = w0 + sum_i w_i x_i + sum_{i<j} (v_i^T v_j) x_i x_j, where w0 is a bias, w is a weight vector for linear effects, and V contains k-dimensional latent embeddings for features. The interaction term can be computed efficiently, enabling polynomial-time training even on sparse data. For binary classification, hinge or logit loss with regularization is used; parameters can be learned via stochastic gradient descent, alternating least squares, or MCMC.
Datasets: Four UCI datasets were used: (1) Bank marketing (45,211 instances; 16 features; 5,289 bad; 39,922 good), (2) Credit approval (300 instances; 15 features; mixed categorical/real; missing values present), (3) South German credit (1,000 instances; 21 features; 300 defaulters, 700 non-defaulters with 7 numerical and 13 categorical features), and (4) Statlog (Australian credit approval) (690 instances; 14 features; 307 bad; 383 good; mix of continuous and categorical).
Data cleaning and preprocessing: Missing values were handled via deletion or imputation (e.g., mean interpolation, maximum likelihood); features with >90% missingness could be removed. Outliers were assessed using statistical tests (z-score, modified z-score, box plots) or domain thresholds; irrational outliers were capped/replaced. Features were normalized, and non-numerical attributes were one-hot encoded.
Compared models and settings: Baselines included LR (solvers: L-BFGS or SAG), SVM (RBF kernel; sigma set via data-driven estimate or grid {1e-3,...,1e4}), kNN (k in [3,10]), and ANN (feedforward; hidden nodes 10–50; learning-rate choices {0.001, 0.01, 0.1, 1}). FM settings were analogous to SVM for hyperparameter search. Unspecified parameters used library defaults.
Experimental protocol: Each dataset was randomly split approximately 3:1 into training and testing sets; cross-validation was used to tune model parameters. Max iterations were set to 1000. Experiments ran in Python 3.8 on a standard desktop environment.
Evaluation metrics: Performance was assessed via ACC, MCC, precision (PRE), recall (REC), F-score, TPR, TNR, FNR, FPR, AUC, and G-mean, computed from the confusion matrix. AUC was emphasized as a comprehensive measure. G-mean combined sensitivity and specificity.
Across all four datasets, FM achieved the strongest overall performance, particularly on AUC, ACC, MCC, and G-mean, with notable per-dataset results:
- Bank marketing: FM achieved ACC 0.9021, MCC 0.4922, REC/TPR 0.5736, F-score 0.5535, AUC 0.7343, G-mean 0.7318. LR attained the best PRE (0.6428), TNR (0.9754), and the lowest FPR (0.0245). SVM and ANN had AUCs of 0.7066 and 0.7022, respectively.
- Credit approval: FM led with ACC 0.9464, MCC 0.8546, PRE 0.9534, REC/TPR 0.9761, F-score 0.9647, FNR 0.0238, AUC 0.9053, G-mean 0.9147. Baseline AUCs: SVM 0.8259, LR 0.8077, kNN 0.7914, ANN 0.7226.
- South German credit: FM had ACC 0.7696, MCC 0.4725, PRE 0.9329, TPR 0.7637, TNR 0.7917, FPR 0.2083, AUC 0.8165, G-mean 0.7776. SVM achieved the highest REC/TPR (0.9463) and the lowest FNR (0.0457). kNN’s AUC was 0.7188; SVM and LR AUCs were 0.6830 and 0.6733.
- Statlog (Australian): FM achieved ACC 0.8844, MCC 0.7678, PRE 0.8852, F-score 0.8780, TNR 0.9004, FNR 0.0710, AUC 0.8928, G-mean 0.8832. SVM had the highest REC/TPR (0.9124). Baseline AUCs: LR 0.8715, SVM 0.8694, ANN 0.8130, kNN 0.7985. Aggregate observations: FM consistently outperformed LR, SVM, kNN, and ANN on ACC, MCC, F-score, AUC, and G-mean across datasets. While LR sometimes yielded higher precision and specificity (e.g., bank marketing), and SVM sometimes yielded higher recall/TPR (e.g., South German and Australian), FM provided the best overall balance and highest AUC in all datasets.
The study set out to evaluate whether explicitly modeling feature interactions via factorization machines improves credit risk assessment versus widely used models (LR, SVM, kNN, ANN). Results across four heterogeneous real-world datasets show FM delivers higher AUC, ACC, MCC, F-score, and G-mean, indicating better discrimination, balanced performance under class imbalance, and overall robustness. This directly addresses the research gap: FM’s latent factorization captures interaction effects among features that linear or distance-based models overlook, and does so efficiently on sparse, high-dimensional data resulting from one-hot encoding. In some situations, other models optimized for specific criteria excelled (e.g., LR with higher precision and specificity on bank marketing; SVM with higher recall on South German and Australian data), but FM typically achieved the best composite performance. The analysis supports FM as a strong default choice for credit risk scoring tasks requiring robust generalization and efficient learning under sparsity. The authors discuss why FM excels: it integrates linear and nonlinear interaction effects with factorized parameters, mitigates overfitting in high dimensions, and trains in polynomial time. They also note occasional metric-specific deviations likely due to dataset imbalance and size. Generalizability is demonstrated by testing across datasets from different sources and distributions, highlighting FM’s consistent advantage.
This work demonstrates that factorization machines, which explicitly model pairwise feature interactions, provide superior predictive performance for credit risk assessment compared to LR, SVM, kNN, and ANN on four public datasets. FM achieved the highest ACC, MCC, F-score, AUC, and G-mean across datasets, indicating improved discrimination and balanced error rates in sparse, high-dimensional settings common in credit scoring. The study provides empirical evidence supporting FM as a strong candidate model for credit risk, with computational efficiency benefits. Future research directions include refining loss functions and optimization strategies to further improve metric-specific performance, exploring feature engineering tailored to FM, and investigating online/streaming adaptations of FM for real-time credit risk assessment where continuous model updates are required.
While FM outperformed comparators overall, its performance was not uniformly best on all metrics for every dataset. In particular, on some datasets FM did not achieve the top precision, specificity (TNR), false positive rate (FPR), or recall/TPR; SVM and LR occasionally led specific metrics (e.g., higher TPR for SVM on some datasets; higher PRE/TNR and lower FPR for LR on bank marketing). Dataset imbalance and training set size may have influenced these outcomes. Optimization choices (SGD, ALS, MCMC) yielded varying results across datasets, suggesting sensitivity to training procedures and hyperparameters. The evaluation used four UCI datasets; broader validation on additional, larger, and more diverse real-world portfolios would further assess generalizability. The study did not implement or evaluate online/streaming FM updates, which remains an open question for real-time credit risk systems.
Related Publications
Explore these studies to deepen your understanding of the subject.

