Medicine and Health

Permutation-based identification of important biomarkers for complex diseases via machine learning models

X. Mi, B. Zou, et al.

Discover how PermFIT, a groundbreaking feature importance test developed by Xinlei Mi, Baiming Zou, Fei Zou, and Jianhua Hu, revolutionizes the identification of key biomarkers in complex diseases. This innovative tool enhances prediction accuracy without requiring model refitting, demonstrating its practical utility through rigorous analysis of TCGA kidney tumor and HITChip atlas data.

00:00

~3 min • Beginner • English

Index

Introduction

With the rise of high-throughput omics, large high-dimensional datasets (e.g., TCGA) enable discovery of molecular mechanisms of disease, yet complex etiologies and interactions hinder interpretability of flexible machine learning models such as DNNs, RFs, and SVMs. The central problem addressed is how to identify and statistically validate important features (biomarkers) within black-box models to aid hypothesis generation for prevention, diagnosis, treatment, and prognosis in complex diseases. The authors propose PermFIT, a permutation-based feature importance test that provides valid inference and model-agnostic feature importance, aiming to overcome non-transparency and to improve predictive performance by selecting informative features.

Literature Review

Existing strategies for feature importance include: surrogate models (e.g., linear models, decision trees) that approximate black-box models but depend on the surrogate specification; Shapley value–based methods (e.g., SHAP) that give local attributions but are computationally intensive and lack valid hypothesis testing; conditional randomization tests (CRT) and model-X knockoffs (including HRT as a holdout approximation and KnockoffGAN) which can provide inference but generally assume known or well-estimated feature distribution/covariance and can suffer when this is misestimated; Gaussian mirror approaches (INGM, SNGM) that provide control but may be computationally costly or lose performance. Permutation-based methods, widely used especially in RF and some DNN contexts, measure change in prediction error upon shuffling a feature and avoid assumptions on feature distributions, but prior approaches often lacked valid inference or generality across model classes. Altmann et al. proposed a corrected permutation importance for RF, but generalization is challenging. This context motivates a general, inference-valid, and computationally efficient permutation-based approach applicable to DNN, RF, and SVM.

Methodology

PermFIT defines a population feature importance score M_j for feature X_j as the expected increase in prediction error due to permuting X_j while holding other features fixed. For continuous outcomes with regression function μ(X)=E(Y|X), M_j is defined as E[(μ(X)−μ(X^(j)))^2], where X^(j) denotes X with its jth component replaced by an independent draw from the marginal distribution of X_j. Under linear models, M_j is proportional to the squared standardized coefficient (2β_j^2 Var(X_j)), linking to classical importance measures. M_j can be expressed as the difference in expected squared errors before and after permutation, enabling estimation via permutation without model refitting once μ is approximated. Estimation and inference: A fitted function μ̂ from a chosen ML model (DNN, RF, SVM) approximates μ. The empirical importance M̂_j is computed on validation data as the average increase in loss (squared error for continuous outcomes; binomial deviance for binary outcomes) when X_j is permuted. To mitigate bias from overfitting, PermFIT employs cross-fitting: data are split into K folds; for each fold, a model μ̂_k is trained on its complement and M̂_j is computed on the held-out fold via permutation; fold-specific estimates are aggregated to yield M̂_j^CV and its variance, enabling a one-sided Z-test for H0: M_j=0. No model refitting is required for each permutation because predictions under μ̂ are reused across permutations of features. Binary outcomes: For Y∈{0,1}, μ(X)=Pr(Y=1|X) and M_j is defined via expected binomial deviance difference between original and permuted X_j; estimation parallels the continuous case using μ̂. DNN implementation: Feedforward fully connected networks with 4 hidden layers (50, 40, 30, 20 nodes) and ReLU activations are trained using mini-batch stochastic optimization with Adam and L2 regularization (λ=1e−4). To enhance robustness and accuracy, bootstrap aggregating (bagging) of DNNs is used with bag size 100 and batch size 50; a scoring system selects a subset of well-performing networks ("many could be better than all" strategy) as implemented in the R package deepTL. Comparator methods: SHAP and LIME are computed via the R package iml (feature importance as mean absolute Shapley/LIME values). HRT is implemented in R. SNGM-DNN follows Xing et al. RF uses the randomForest R package with 1000 trees, providing standard permutation importance and standard errors (Vanilla-RF). SVM uses e1071 with radial kernels; hyperparameters are tuned via fivefold cross-validation. RFE-SVM uses caret's rfe. Evaluation design: Simulations generate correlated block-structured features with signals in 5 blocks, varied within-block correlation ρ∈{0,0.2,0.5,0.8}, sample sizes N∈{1000,5000} (plus smaller-sample settings N=300,p=100 and N=500,p=200), and both continuous and binary outcomes with linear, nonlinear, and interaction effects. Each scenario is replicated 100 times. Methods are assessed by type-I error control, detection frequency of true/false features, and prediction metrics (MSPE for regression; accuracy/AUC for classification) before and after feature selection. Real data applications include TCGA kidney cancer RPPA data (118 proteins; LTS vs STS classification) and HITChip microbiome data (129 genera plus demographics predicting BMI category coded 1–6). Cross-validation (5-fold) is repeatedly performed 100 times to assess performance and stability. Feature inclusion thresholds: p<0.1 for HRT-DNN, Vanilla-RF, and PermFIT; top 20 by importance for SHAP-DNN, LIME-DNN, SNGM-DNN, and RFE-SVM.

Key Findings

Simulations (continuous outcomes): - PermFIT methods control type-I error near nominal levels across scenarios. At ρ=0, PermFIT maintains ~0.05 type-I error across null features, while Vanilla-RF exhibits inflated type-I error (~0.09). HRT-DNN shows slight inflation at N=1000; severe inflation in small-sample/high-dimensional settings (e.g., N=300,p=100; N=500,p=200), likely due to covariance estimation challenges. - Power and precision: PermFIT-DNN most accurately differentiates true from null features across correlations, handling nonlinearities and interactions. PermFIT-SVM consistently identifies interacting causal variables even with correlation; RFE-SVM often selects correlated nulls and misses interactions at higher ρ. - RF frameworks: Vanilla-RF shows higher power for some interactions but many false positives among correlated nulls (S0), especially at high ρ (e.g., >80% false positives in S0 at ρ=0.8, N=1000). PermFIT-RF reduces false positives compared to Vanilla-RF. - Prediction after selection: Using selected features improves MSPE for most models. PermFIT-DNN and HRT-DNN yield the lowest MSPEs overall; PermFIT-DNN often outperforms HRT-DNN at N=1000 and low-to-moderate correlation (ρ≤0.2). LIME-DNN and SNGM-DNN fail to capture key features in some scenarios, degrading performance; RFE-SVM fails at high correlation. Simulations (binary outcomes): Results mirror the continuous case; PermFIT maintains control with strong detection of true features; details in supplementary materials. TCGA kidney cancer RPPA (LTS vs STS classification): - Correlated protein clusters exist; methods that tend to select correlated features (Vanilla-RF, RFE-SVM, SNGM-DNN) identify a cluster (SRC, RAF1, RB1, RPS6, YWHAZ, EGFR) not selected by PermFIT, consistent with simulations indicating false positives under correlation. - Prediction improvement (5-fold CV, 100 repeats): • RF: baseline accuracy 0.694; PermFIT-RF improves to 0.732 on average; Vanilla-RF improves to 0.713. • SVM: baseline 0.69; PermFIT-SVM improves to 0.744; RFE-SVM to 0.709. • DNN: PermFIT-DNN achieves accuracy 0.751; HRT-DNN 0.750; SHAP-DNN 0.731; LIME-DNN 0.650; SNGM-DNN 0.723. AUC results align with accuracy. - Biomarkers: All three PermFIT variants identify CDKNIA, EIF4EBP1, INPP4B, SERPINE1 as significant. INPP4B is the top biomarker with p-values: 1.3E−05 (PermFIT-DNN), 9.1E−07 (PermFIT-RF), 4.5E−05 (PermFIT-SVM). Additional findings include XRCC1 (PermFIT-DNN, PermFIT-SVM), ANXA7 (PermFIT-DNN, PermFIT-RF), MYH9 and NRG1 (PermFIT-DNN), and STK11 (PermFIT-RF). HITChip Atlas (BMI prediction): - Features are highly correlated; Vanilla-RF and RFE-SVM show inflated importance for correlated taxa and fail to improve or worsen performance relative to full models. Examples include highly correlated Streptococcus groups frequently selected by RFE-SVM. - PermFIT-based selections yield the largest improvements in MSPE and Pearson correlation between predictions and true BMI levels across DNN, RF, and SVM. - Identified determinants: Age is the most significant factor; nationality is selected by PermFIT-DNN and PermFIT-SVM. Among genera, Megasphaera elsdenii is selected by all PermFIT methods; Eggerthella lenta by PermFIT-SVM; an uncultured clostridiales group by PermFIT-RF.

Discussion

The study addresses the challenge of interpreting black-box ML models by providing a general, model-agnostic, permutation-based feature importance test with valid statistical inference. Through extensive simulations and two real datasets, PermFIT reliably identifies causal features, controls false positives under correlation, and improves predictive performance when models are refit on selected features. The findings support that permutation importance, when coupled with cross-fitting, yields robust inference without requiring knowledge of feature distributions or covariance. In practice, PermFIT mitigates spurious selection of correlated but non-causal features observed in alternatives such as Vanilla-RF and RFE-SVM. Among model frameworks, pairing PermFIT with DNNs consistently provides strong empirical performance in both selection accuracy and predictive metrics. These results demonstrate PermFIT’s relevance for biomarker discovery and enhanced prediction in complex disease studies.

Conclusion

PermFIT is a computationally efficient, general permutation-based feature importance test that provides valid inference and is applicable across DNN, RF, and SVM models without model refitting. It consistently controls type-I error, effectively detects true features including nonlinear and interaction effects, reduces false positives under correlation, and improves predictive performance after feature selection. Applications to TCGA kidney cancer and HITChip microbiome data highlight practical utility and biologically meaningful biomarkers. The authors note broad applicability of PermFIT across outcome types and machine learning frameworks. No explicit future research directions are specified.

Limitations

Prediction improvement depends on the intrinsic modeling capacity of the underlying ML framework; for example, RF is relatively inefficient for modeling interaction terms, which may limit PermFIT-RF for traits with strong interactions. While PermFIT reduces overfitting bias via cross-fitting and avoids distributional assumptions required by CRT/knockoffs, performance still relies on the quality of the base learner’s approximation to μ(X).

Related Publications

Explore these studies to deepen your understanding of the subject.

Medicine and Health

Interpretable machine learning-based decision support for prediction of antibiotic resistance for complicated urinary tract infections

J. Yang, D. W. Eyre, et al.

Biology

Machine learning approach for discrimination of genotypes based on bright-field cellular images

G. Suzuki, Y. Saito, et al.

Medicine and Health

Machine learning based suicide prediction and development of suicide vulnerability index for US counties

V. Kumar, K. K. Sznajder, et al.

Chemistry

Unraveling the energetic significance of chemical events in enzyme catalysis via machine-learning based regression approach

Z. Song, H. Zhou, et al.

Listen, Learn & Level Up

Over 10,000 hours of research content in 25+ fields, available in 12+ languages.

No more digging through PDFs, just hit play and absorb the world's latest research in your language, on your time.

listen to research audio papers with researchbunny