Introduction
The study of complex human diseases is challenging due to intricate molecular mechanisms and convoluted etiologies at genetic, genomic, and proteomic levels. High-throughput technologies have generated massive omics data, offering valuable insights into disease mechanisms. Machine learning methods, such as support vector machines (SVMs), random forests (RFs), and deep neural networks (DNNs), provide powerful predictive models for biomedical data. However, the complexity of these models often hinders the interpretation of individual feature importance. Identifying crucial biomarkers is paramount for hypothesis generation regarding disease prevention, diagnosis, and treatment. Existing feature importance methods, including surrogate models, Shapley value-based methods (e.g., SHAP), conditional randomization tests (CRTs), knockoff methods (model-X), and permutation-based methods, each have limitations. Surrogate models rely on potentially misspecified explanatory models. Shapley value methods are computationally intensive and lack guaranteed valid tests. CRT and model-X knockoff depend on assumptions about covariance structure, which can impact performance if not accurately estimated. Permutation-based methods, while statistically robust, often lack formal statistical inference. To address these challenges, the authors propose PermFIT.
Literature Review
The authors review several existing feature importance methods for complex machine learning models. Surrogate modeling methods approximate complex models with simpler ones, but are limited by the choice of surrogate. Shapley value methods, like SHAP, offer localized feature characterization but are computationally expensive and do not ensure valid statistical testing. Conditional randomization tests (CRTs) and knockoff methods (model-X) are computationally expensive and rely on assumptions about the covariance structure. Holdout randomization tests (HRTs) attempt to reduce the computational cost of CRTs but still rely on the covariance structure assumption. KnockoffGAN, an extension of model-X, avoids the covariance structure assumption but is challenging to train. Gaussian mirror methods offer an alternative but can be computationally costly (INGM) or suffer performance loss (SNGM). Permutation-based methods offer a robust alternative but often lack formal statistical inference. The authors' proposed PermFIT addresses these limitations.
Methodology
PermFIT is a permutation-based feature importance test designed for complex machine learning models. It utilizes a permutation test coupled with cross-fitting to obtain valid importance score tests, effectively controlling Type I error. The method does not require model refitting, enhancing computational efficiency. PermFIT is implemented for DNNs, RFs, and SVMs. The feature importance score (Mj) for the jth feature is defined as the expected squared difference between the model's prediction with the original feature and the prediction with the permuted feature. Cross-fitting is employed to mitigate bias from model overfitting. For continuous outcomes, the importance score is the expected squared difference between the model's prediction with the original and permuted feature. For binary outcomes, the importance score is the expected binomial deviance. The method uses K-fold cross-fitting to improve the robustness of the estimation of Mj. DNNs are implemented using feedforward fully connected networks with bagging (bootstrap aggregating) to enhance robustness and accuracy. SHAP, LIME, SNGM, and HRT are used for comparison. RFs are implemented using the "randomForest" package in R, and SVMs are implemented using the "e1071" package. Recursive Feature Elimination (RFE) is used with SVMs for comparison. Simulation studies and real data applications are conducted to evaluate PermFIT's performance.
Key Findings
Simulation studies under various scenarios (different sample sizes, correlation structures) showed that PermFIT controls Type I error effectively and possesses high power in detecting true positive features. PermFIT consistently outperformed SHAP, LIME, HRT, and SNGM in identifying causal features while maintaining a low false positive rate. In most scenarios, using features selected by PermFIT improved the prediction accuracy of the machine learning models. Application to TCGA kidney cancer data (KIRC, KIRP, KICH) using RPPA data identified four genes (CDKN1A, EIF4EBP1, INPP4B, SERPINE1) as significantly associated with survival status. These genes are known to be involved in cancer. Other genes identified by PermFIT included XRCC1, ANXA7, MYH9, NRG1, and STK11. The application to HITChip Atlas microbiome data, aiming to predict BMI, identified age as the most significant factor and *Megasphaera elsdenii*, *Eggerthella lenta*, and uncultured Clostridiales as important microbiome features. PermFIT consistently showed superior performance compared to other methods in both applications. The superior performance is further confirmed by higher accuracy and AUC (area under the ROC curve) in the TCGA data and lower MSPE (mean squared prediction error) and higher correlation between predicted and true values in the HITChip data.
Discussion
PermFIT offers a computationally efficient and broadly applicable tool for identifying important features in complex machine learning models. It avoids the limitations of existing methods, such as reliance on specific model assumptions or computationally intensive procedures. The superior performance of PermFIT, particularly when coupled with DNNs, highlights its effectiveness in deciphering the complex relationships between features and outcomes in biomedical data. The ability to improve prediction accuracy after feature selection underscores PermFIT's value in building more accurate and interpretable models for complex diseases.
Conclusion
PermFIT provides a valuable tool for researchers studying complex diseases. Its efficiency, broad applicability, and superior performance make it a significant advancement in feature selection for machine learning models in biomedicine. Future research could explore the integration of PermFIT with other feature selection techniques and its application to even larger and more complex datasets.
Limitations
While PermFIT demonstrates strong performance, the improvement in prediction accuracy depends on the capabilities of the underlying machine learning model. For example, RF's limitation in modeling interaction terms might limit PermFIT-RF's performance for traits with strong gene-gene interactions. The study focused on specific types of machine learning models; future work could explore its generalizability to other models.
Related Publications
Explore these studies to deepen your understanding of the subject.