
Medicine and Health
Integrative machine learning approaches for predicting disease risk using multi-omics data from the UK Biobank
O. Aguilar, C. Chang, et al.
Explore how Oscar Aguilar, Cheng Chang, Elsa Bismuth, and Manuel A Rivas harness machine learning to analyze multi-omics data from the UK Biobank, unveiling enhanced disease risk prediction for 22 conditions. Discover the surprising impact of integrating diverse biological data.
~3 min • Beginner • English
Introduction
Chronic diseases are leading causes of mortality and morbidity in the United States. While many efforts have focused on predicting chronic disease risks using genomics, lifestyle, nutrition, and demographics, their relative performance for risk assessment remains unclear. This study trains disease risk prediction models using four data types—demographics, genomics, metabolomics, and clinical biomarkers—to assess their contributions to prediction. Demographic (e.g., age, sex) and biomarker (e.g., glucose, cholesterol) data are standard in clinical diagnosis; genetic data capture heritable disease risk; metabolomics may add predictive value beyond clinical and demographic predictors. Prior work has often used singular data types, overlooking complex multi-factor interactions. Recent studies show individually using metabolomics, demographics, and genomics can predict age-related diseases and mortality, suggesting machine learning can leverage high-throughput, multi-omics data to capture non-linear relationships and enable integrative fusion approaches.
Literature Review
The paper situates its work within prior research demonstrating: (1) the clinical utility of curated blood and urine biomarkers in the UK Biobank (Sinnott-Armstrong et al.) and the development of PRS for numerous traits; (2) the heritable basis of many diseases and the use of genomics for risk prediction; (3) the promise of metabolomic profiles for predicting multiple disease outcomes; and (4) multi-omics machine learning methodologies and fusion strategies. It references the Global Biobank Engine as a platform for disseminating biobank-derived summary statistics and highlights cooperative learning as a potential multiview integration approach for future fusion of omics datasets.
Methodology
Data: Four datasets from UK Biobank were used: demographics (age, sex, BMI), genomics (polygenic risk scores per disease computed without demographic features using BASIL from prior work), clinical biomarkers (35 curated blood and urine biomarkers), and metabolomics (249 blood metabolites available for 87,465 White-British individuals). The main modeling dataset comprised 87,465 White-British individuals with multi-omics data; an additional 29,089 non-White British individuals were used to develop PRS incorporation strategies without touching the held-out test set. Twenty-two diseases were included for binary prediction; multiple diseases with incident age-of-onset were used for survival analysis.
Binary classification: Models trained to predict disease onset/diagnosis within 13 years post data collection using combined feature spaces. Classifiers: ADA Boost, XGBoost (tree-based ensembles), Lasso logistic regression (L1), and a Multi-Layer Perceptron (2 hidden layers). Hyperparameter tuning used a predefined 70/10/20 train/validation/test split of White-British individuals (consistent with PRS development) and a 1-fold CV with a fixed train-validation split. Ensembles varied estimators from 5 to 200 (most selected ~10). Linear models varied regularization strength logarithmically from 1e-4 to 1e4, using LogisticRegressionCV with warm-start along regularization paths; preprocessing included standard scaling. MLP explored 2 hidden layers with 5–50 nodes each. Feature importance was assessed using Yellowbrick for ensembles and via coefficients for linear models. Performance was evaluated by ROC-AUC on held-out test sets. To quantify contributions of data types, Lasso models were trained on subsets: demographics; +genomics; +biomarkers; +metabolomics; and all (multi-omics). Pairwise improvements in AUC were tested using permutation tests on paired ROC curves to derive p-values (Table A2).
Survival analysis: Cox proportional hazards models with L1 regularization (CoxnetSurvivalAnalysis from scikit-survival, Python re-implementation of R’s Coxnet) were trained per disease using the same 70/10/20 split. Preprocessing included feature-wise normalization and median imputation. Left-censoring removed prevalent cases (onset before data collection), leaving incident cases; the number of incident cases per disease ranged from 222 to 3012. Models were trained on feature groups: (1) demographics; (2) demographics+genomics; (3) demographics+biomarkers; (4) demographics+biomarkers+metabolomics; (5) all four types. C-index on the test set compared performance across groups. L1 strength (alpha) was tuned on the validation set to maximize C-index; selected models yielded sparse sets of non-zero coefficients (typically 10–50 of 288 features, ranging from 2 to 95 depending on disease). Final models were trained on the combined training data and evaluated on the test set. Risk trajectories (individual hazard over time) were generated to visualize predicted disease onset risk for individuals.
Key Findings
- Model comparison across 22 diseases: Lasso logistic regression generally outperformed other classifiers. Average AUC: Lasso 0.739 vs XGBoost 0.727, ADA Boost 0.726, and MLP 0.675. Lasso achieved the highest test AUC for 18/22 diseases with a median AUC advantage of 0.012 (1.7%) over ensemble classifiers. XGBoost trained fastest (~20 s default), ADA Boost was sparsest (~5–30 non-zero features vs 50–150 for Lasso), Lasso required the most tuning time (~208 s per disease) but performed best.
- Feature importance: ADA Boost and Lasso often agreed on top features but differed in relative rankings and selected metabolites for specific diseases (e.g., diabetes: glucose vs creatinine; myocardial infarction: differing metabolite sets).
- Incremental value of data types (binary AUC):
• Genomics: Limited but notable gains for some diseases—non-melanoma skin cancer (+0.05 AUC), asthma (+0.06), psoriasis (+0.08) when added to demographics.
• Biomarkers (35 clinical): Significant improvements for several diseases—peripheral vascular disease, renal failure, chronic renal failure, and diabetes improved by +0.05 to +0.16 AUC over demographics (all p < 0.0005).
• Metabolomics (249 features): On top of demographics+biomarkers, modest additional gains (+0.01 to +0.02 AUC) for diabetes, renal failure, chronic renal failure, peripheral vascular disease, myocardial infarction, and arthritis (many p < 0.05). Full multi-omics models yielded statistically significant but small further improvements for asthma, gout, gall stones, cholelithiasis, and ulcerative colitis (delta AUC 0.005–0.03). Ulcerative colitis showed synergy: +0.09 (genomics) and +0.11 (biomarker/metabolomics) separately vs +0.14 jointly.
- Survival analysis (C-index):
• Adding genomics to demographics improved C-index for psoriasis (0.56→0.61), ulcerative colitis (0.54→0.64), and insulin-dependent diabetes mellitus (0.67→0.74); minimal or negative changes for most others.
• Adding biomarkers to demographics produced larger gains for non-insulin-dependent diabetes (0.77→0.87), gout (0.75→0.86), and chronic renal failure (0.71→0.86); minimal or negative changes for some diseases (e.g., gallbladder disorders, cholecystitis, psoriasis).
• Adding genomics+metabolomics on top of demographics+biomarkers provided marginal improvements for a few diseases (e.g., psoriasis 0.64→0.66; cholecystitis 0.53→0.55), with many unchanged or decreased.
• Final Cox models: Many disease-specific models achieved test C-index >0.6; some (e.g., diabetes, renal failure) exceeded 0.8.
- Metabolomics utility: While metabolomics rarely improved performance beyond the 35 clinical biomarkers, it served as a broadly comparable substitute when biomarkers were unavailable in classification models (except type 2 diabetes, −0.03 AUC); survival models showed larger losses when replacing biomarkers with metabolites.
- Resource: Multi-omic Risk Scores (MRS) for 22 diseases and survival model outputs were integrated into the Global Biobank Engine for public access.
Discussion
Findings demonstrate that, despite the capacity of non-linear models, regularized linear models (Lasso) generally provide strong and often superior predictive performance for disease risk across diverse conditions when trained on multi-omics data. XGBoost offers speed and ADA Boost yields sparsity and interpretability, but Lasso achieves the best discrimination for most diseases. Incremental data-type analyses show that clinical biomarkers substantially enhance predictive performance over demographics for many diseases. Genomics adds value for specific conditions with strong heritable components (e.g., psoriasis, asthma, non-melanoma skin cancer). Metabolomics adds small but statistically significant improvements beyond biomarkers for some diseases; however, its main utility is as a substitute when standard biomarker panels are unavailable. Survival analyses corroborate classification results, revealing similar patterns in C-index improvements and providing clinically relevant time-to-event risk trajectories. Overall, integrating multi-omics yields modest performance gains beyond strong demographic+biomarker baselines, with the largest benefits in diseases exhibiting both genetic and metabolomic components.
Conclusion
The study develops Multi-omic Risk Scores using integrated demographics, genomics, biomarkers, and metabolomics to predict risk for 22 diseases and complements these with Cox survival models to estimate time-to-event risk trajectories. Lasso logistic regression generally delivers the best discrimination, while biomarkers provide the largest incremental gains over demographics, genomics contribute to select diseases, and metabolomics offers marginal improvements beyond biomarkers but can substitute for them when unavailable. Results are disseminated via the Global Biobank Engine. Future work includes expanding analyses beyond White-British cohorts to assess portability, quantifying uncertainty in classification risk scores, exploring advanced multi-view fusion methods (e.g., cooperative learning), incorporating additional omics (e.g., proteomics), and continued longitudinal updates as more incident cases accrue.
Limitations
- Limited incident case counts for some diseases (e.g., ulcerative colitis n=916 in classification; unspecified renal failure n=233 in survival), increasing instability and variance in estimates.
- Cohort primarily restricted to White-British participants, limiting generalizability to other ancestries and populations; prior evidence indicates limited transferability across ethnic groups.
- For some diseases, inclusion of additional omics (biomarkers or metabolomics) decreased performance, suggesting potential overfitting or cohort/measurement artifacts.
- Metabolomics availability limited to a subset, potentially introducing selection effects; replacement of biomarkers with metabolites led to larger losses in survival models.
- Prevalent cases excluded via left-censoring may affect comparability across diseases with differing onset patterns.
Related Publications
Explore these studies to deepen your understanding of the subject.