Medicine and Health
Interpretable machine learning prediction of all-cause mortality
W. Qiu, H. Chen, et al.
This research, conducted by Wei Qiu and colleagues, delves into all-cause mortality through advanced machine learning models. The IMPACT framework leverages XAI techniques to outperform traditional models, revealing hidden risk factors and improving mortality predictions with high accuracy and interpretability.
~3 min • Beginner • English
Introduction
The study addresses the need to predict all-cause mortality more accurately while providing interpretable explanations of risk factors. Traditional epidemiological approaches often rely on univariate associations or multivariate linear models, which cannot fully capture non-linear and interaction effects among predictors. Although complex machine learning (ML) models can improve predictive accuracy, their adoption in healthcare is hindered by a lack of interpretability (“black box” concerns). This work introduces the IMPACT framework to combine accurate tree-based ML models with explainable AI (XAI) techniques to systematically study relationships among many variables and mortality. The goals are to: (1) improve prediction of all-cause mortality across multiple follow-up horizons and age groups; (2) elucidate non-linear relationships, thresholds, and interactions between features and mortality; (3) identify both well-known and overlooked risk predictors; and (4) develop accurate, efficient, and interpretable mortality risk scores suitable for clinical and public use. The study applies IMPACT to NHANES (1999–2014) and validates key findings using UK Biobank to assess generalizability.
Literature Review
Prior mortality research has often focused on associations between single predictors and mortality, with fewer studies using multivariate linear models for prediction. Multiple linear mortality risk scores have been proposed, favored historically for interpretability, but they impose assumptions (linearity, normality, independence) and require manual feature transformations to capture non-linear effects. Meanwhile, AI has achieved strong performance in various healthcare applications (e.g., ophthalmology, histopathology, dermatology), but limited interpretability impedes clinical trust and utility. Explainable AI (e.g., SHAP-based methods) offers a path to understand complex ML models’ predictions at both global and local levels. The authors position IMPACT to bridge predictive performance with interpretability in epidemiology, enabling discovery of non-linearities, thresholds, and interaction effects that linear approaches may miss, and facilitating verification or refinement of laboratory reference intervals relevant to mortality risk.
Methodology
Data cohorts: Primary dataset is NHANES (1999–2014) with demographic, laboratory, examination, and questionnaire variables that are harmonized across cycles. After preprocessing, 47,261 samples with 151 features remained. Mortality follow-up extends through December 31, 2015. Prediction tasks include: (1) all-cause mortality at 1-, 3-, 5-, and 10-year follow-up using all ages; (2) 5-year mortality within age strata: <40, 40–65, 65–80, and ≥80. Data split: 80% training, 20% testing.
External validation dataset: UK Biobank with 384,762 samples having confirmed 5-year mortality status, using 51 overlapping features with NHANES; split 80%/20% train/test.
IMPACT framework: Models are Gradient Boosted Trees (GBTs), nonparametric ensembles capturing non-linearities and interactions. Hyperparameters selected via GridSearch with 5-fold cross-validation. Performance metric: AUROC.
Explainability: TreeExplainer computes exact SHAP values for GBTs, providing local explanations (feature contributions that sum to the model’s log-odds output), main effects, and pairwise interaction effects (SHAP interaction values). Higher SHAP indicates larger contribution to predicted mortality risk. Foreground samples and SHAP values for 1-, 3-, 5-, and 10-year models are provided (Supplementary Data 2–9).
Additional analytical techniques: (1) Relative Risk Percentage (RRP) to compare mortality risk within laboratory reference intervals (RIs) against overall risk, identifying potentially sub-optimal RIs. (2) Supervised distance to quantify redundancy between features with respect to the prediction task by training univariate models to predict each feature’s contribution to the outcome; hierarchical clustering used to group redundant features. (3) Recursive feature selection based on supervised distance to select predictive, less-redundant feature sets. (4) Development of interpretable mortality risk scores optimized for different cost-vs-accuracy tradeoffs (e.g., focusing on demographics/examination/questionnaire for public users, or laboratory panels for clinicians). Temporal validation was performed by training on NHANES 1999–2008 and validating on NHANES 2009–2014; external validation tested generalizability on UK Biobank.
Comparators: Logistic regression and neural networks for model performance comparisons; linear and tree-based Cox proportional hazards models for risk score comparisons; Intermountain sex-specific risk scores for common lab panels.
Outcomes and analyses: AUROC comparisons across follow-up times and age groups; detection of non-linear relationships and thresholds (partial dependence, SHAP main effects); interaction effects (SHAP interaction values); evaluation and comparison of feature importances across datasets; supervised distance-based redundancy analysis and feature selection; risk score performance and individualized explanations.
Key Findings
Model performance: Across multiple tasks, GBTs achieved higher AUROC than linear models and neural networks in most comparisons.
- AUROC (Logistic Regression | Gradient Boosted Trees | Neural Network):
• 1-year mortality: 0.5219 | 0.8801 | —
• 3-year mortality: 0.8938 | 0.9032 | 0.8776
• 5-year mortality: 0.8855 | 0.8941 | 0.8842
• 10-year mortality: 0.9095 | 0.879 | —
• Age < 40: 0.7485 | 0.8386 | 0.8150
• 40 ≤ Age < 65: 0.7259 | 0.8230 | 0.7673
• 65 ≤ Age < 80: 0.8170 | 0.7964 | —
• Age ≥ 80: 0.7408 | 0.7334 | 0.7106
Non-linearities and thresholds: The models captured complex effects not easily detected by linear methods.
- U-shaped relationship between serum uric acid and 5-year mortality (contrasts with prior J-shaped findings in CKD when categorized).
- Urine albumin shows an important threshold around 250 µg/mL beyond which further increases have little additional effect on 5-year mortality risk.
Interaction effects:
- Blood lead level interacts with age: for blood lead > 0.1 µmol/L, younger individuals show higher 5-year mortality risk than older individuals.
- Blood lead level interacts with gender: females have higher 5-year mortality risk than males at blood lead > 0.24 µmol/L.
Validated associations:
- Red cell distribution width (RDW) positively associated with mortality; threshold around 12.7% where risk increases.
- Serum albumin negatively associated with mortality; inflection points near 43 g/L and 35 g/L, with dramatically increased risk around 35 g/L.
New/less-studied risk factors:
- Arm circumference strongly and negatively associated with 5-year mortality, especially in older individuals; ranked 4th in importance and more predictive than BMI (BMI ranked 56th), suggesting anthropometrics like arm or waist circumference outperform BMI for mortality prediction.
- Platelet count inversely related to 5-year mortality with a key threshold around 175×10^3/µL; below this, risk increases sharply.
- Serum chloride inversely related to 5-year mortality; levels <98 mmol/L associated with sharply increased risk; stronger effects observed in younger individuals and females.
Reference interval assessment (Relative Risk Percentage, RRP): The framework identified RIs likely aligned with low mortality risk (e.g., gamma glutamyl transferase, globulin, lymphocyte percent, blood urea nitrogen by sex) and RIs potentially sub-optimal for mortality risk (e.g., albumin, blood lead, mean cell volume, ALT by sex showed high RRP values up to 100%).
External validation (UK Biobank):
- Consistency of feature importance among 51 overlapping features: 14 of the top 20 features overlapped (Fisher’s exact test p=0.0004). Spearman correlation of importance rankings across datasets: ρ=0.6654 (p<0.0001).
- Replicated relationships for RDW, serum albumin, and serum uric acid; validated that some anthropometric measures (waist circumference in UKB) are more predictive than BMI.
- Reference interval insights for gamma glutamyl transferase and lymphocyte percent (optimal), and albumin (potentially sub-optimal) were corroborated.
Follow-up time differences: Feature importances vary with horizon; e.g., platelet count increased in importance from 1-year (rank 75) to 10-year (rank 12); serum potassium decreased from 17 to 42. Relationships differ by horizon, e.g., potassium: hyperkalemia higher short-term risk (1-year), hypokalemia similar or higher long-term risk (5-year). Sodium: high sodium (>139 mmol/L) associated with increased 1-year risk; hyponatremia (<135 mmol/L) associated with higher 5-year risk.
Redundancy and feature selection: Supervised distance identified redundant clusters (e.g., weight-related anthropometrics). Arm circumference was the most predictive among weight-related features; serum albumin the most predictive among selected blood test features. A reduced 90-feature set achieved AUROC=0.8845 with high minimum supervised distance (0.9301), balancing performance and interpretability.
Risk scores: IMPACT 5-year risk scores showed strong performance and efficiency.
- Using top 20 features (female): AUROC=0.8971 vs. 0.9030 using all features; fewer than 20 features led to large drops.
- Laboratory-only model AUROC=0.8881; questionnaire+examination-only AUROC=0.8835.
- Models using common test panels outperformed Intermountain risk scores. Temporal validation (NHANES 2009–2014) showed only slight performance degradation, indicating generalizability.
Individualized explanations: Case examples illustrate SHAP-based explanations of risk contributions (e.g., predicted risk 0.02 for a survivor and 0.61 for a decedent), enabling actionable interpretation for individuals and clinicians.
Discussion
The IMPACT framework demonstrates that combining accurate tree-based models with local explainability yields both improved mortality prediction and deeper epidemiological insight. By leveraging SHAP values and interaction effects, IMPACT uncovers non-linear relationships, thresholds, and complex interactions that linear models may miss, validating known predictors (e.g., RDW, serum albumin) and elevating less-studied ones (e.g., arm circumference, platelet count, serum chloride). The framework also informs the suitability of laboratory reference intervals in relation to mortality risk. Supervised distance helps address feature redundancy, improving global interpretability and aiding selection of representative features without substantial loss in performance. The interpretable risk scores, adaptable to different cost and data availability constraints, enable user-specific applications and individualized explanations to support clinical decision-making and personal health awareness. External validation with UK Biobank supports generalizability of key findings and feature importance rankings. While the study focuses on associations rather than causality, it provides a foundation for hypothesis generation and suggests directions for refining reference intervals, tailoring interventions to subpopulations, and understanding horizon-specific risk relationships.
Conclusion
This work introduces IMPACT, an explainable ML framework for all-cause mortality prediction that combines gradient boosted trees with SHAP-based explanations. IMPACT improves predictive performance over linear models and neural networks, reveals non-linear and interaction effects, identifies both known and overlooked risk factors, and offers insights into laboratory reference intervals. The framework supports development of accurate, efficient, and interpretable risk scores suited for both clinicians and the general public, with temporal and external validation indicating generalizability. Future work includes: (1) broader external validation across datasets with similar distributions and mortality rates; (2) integrating causal inference perspectives to distinguish confounders, mediators, and colliders, potentially leveraging supervised distance to prioritize variables; (3) extending IMPACT to other health outcomes; and (4) comparing IMPACT-based risk with biological aging clocks within the same cohorts.
Limitations
- Observational data: relationships discovered are associative, not necessarily causal; potential confounding remains.
- Redundancy and variable relationships: despite supervised distance, disentangling colliders, mediators, and confounders requires domain expertise and causal frameworks.
- Reference interval analysis: RRP identifies potentially sub-optimal ranges but does not establish optimal RIs; dedicated study designs are required.
- External validity: differences between NHANES and UK Biobank (e.g., age ranges, mortality base rates) may affect generalizability; further validation on matched populations is warranted.
- Model comparisons: some AUROC comparisons across horizons and subgroups indicate varying performance advantages; certain tasks lacked neural network benchmarks in reported tables.
Related Publications
Explore these studies to deepen your understanding of the subject.

