logo
ResearchBunny Logo
Integrative machine learning approaches for predicting disease risk using multi-omics data from the UK Biobank

Medicine and Health

Integrative machine learning approaches for predicting disease risk using multi-omics data from the UK Biobank

O. Aguilar, C. Chang, et al.

Explore how Oscar Aguilar, Cheng Chang, Elsa Bismuth, and Manuel A Rivas harness machine learning to analyze multi-omics data from the UK Biobank, unveiling enhanced disease risk prediction for 22 conditions. Discover the surprising impact of integrating diverse biological data.

00:00
00:00
Playback language: English
Introduction
Chronic diseases are leading causes of mortality and morbidity. Predicting these risks early is crucial for prevention and treatment. This research focuses on leveraging multi-omics data – combining demographic, genomic, metabolomic, and clinical biomarker information – to improve disease risk prediction. While existing studies often focus on individual data types, this study aims to build an integrated model to capture the complex interplay between these factors. The researchers hypothesize that integrating these diverse data types will significantly improve the accuracy of disease risk prediction models compared to models using only a single data type. The UK Biobank provides a rich dataset with a large sample size, making it ideal for training and validating such complex models. The integration of multi-omics data offers the potential to identify novel biomarkers and improve our understanding of disease mechanisms, ultimately leading to more effective strategies for prevention and early intervention.
Literature Review
Previous research has explored the predictive power of individual data types like genomics, lifestyle factors, and demographics for chronic disease risk. Studies have shown the potential of using metabolomics, demographics, and genomics individually to predict age-related diseases and mortality. However, there's a lack of research on effectively integrating these diverse data types to create a comprehensive predictive model. The existing literature highlights the potential of machine learning to handle high-throughput data and complex non-linear relationships between predictors, motivating the application of these techniques in this study. The researchers cite several studies demonstrating the individual predictive value of different omics data types but emphasize the need for an integrative approach.
Methodology
The study utilized data from the UK Biobank, encompassing demographic features (age, sex, BMI), genomic data (polygenic risk scores – PRS for each disease), clinical biomarkers (35 blood and urine biomarkers), and metabolomic data (249 blood metabolites). The analysis focused on 22 diseases with at least 1000 cases. For binary disease onset classification, four machine learning models were employed: ADA Boost, XG Boost, Lasso Regression, and Multi-Layer Perceptron. Hyperparameter tuning was performed using 1-fold cross-validation on a 70/10/20 train/validation/test split of White British individuals. Model performance was assessed using the area under the ROC curve (AUC). Feature importance analysis was conducted to identify key predictors. Additionally, survival analysis was performed using Cox proportional hazard models with L1 penalty, incorporating age-of-onset data. Model performance for survival analysis was evaluated using the concordance index (C-index). The researchers explored different combinations of feature types to assess the contribution of each data type to model performance. A permutation test was used to assess the statistical significance of AUC differences between models.
Key Findings
The Lasso regression model consistently showed the best performance among the classifiers, achieving the highest AUC for 18 out of 22 diseases. Integrating multi-omics data significantly improved risk prediction for 8 diseases. However, the added value of metabolomic data was marginal compared to demographic, genomic, and biomarker features. Interestingly, metabolomics served as a suitable replacement for standard biomarker panels when the latter were unavailable. Survival analysis using Cox proportional hazard models largely confirmed the findings from the binary classification models. Adding genomic data significantly improved C-indices for some diseases (e.g., psoriasis, ulcerative colitis), while the addition of biomarker data improved C-indices for others (e.g., diabetes, renal failure). The inclusion of metabolomic data on top of demographic, genomic, and biomarker data yielded only marginal improvements in C-indices for a few diseases. Feature importance analysis revealed that age, sex, and disease-specific PRS were frequently selected as important features in the L1-regularized Cox models. The number of selected features varied across diseases, ranging from 2 to 95.
Discussion
The results demonstrate the potential of integrating multi-omics data to improve disease risk prediction, particularly when combined with established risk factors like age, sex, and existing biomarkers. The superior performance of Lasso regression suggests that linear relationships between predictors and disease risk are prominent. The relatively small contribution of metabolomics compared to other data types requires further investigation. However, its utility as a replacement for standard biomarker panels in situations where these panels are not available is noteworthy. The findings highlight the importance of considering multiple data types when developing disease risk prediction models. Future research should explore more sophisticated methods for integrating multi-omics data, potentially including proteomic data and other modalities.
Conclusion
This study demonstrates the potential benefits of integrating multi-omics data for disease risk prediction. While the inclusion of metabolomic data offered only marginal improvements, the combination of demographic, genomic, and biomarker data significantly enhanced model performance for several diseases. Lasso regression emerged as a superior classifier. Future research should focus on developing more advanced data fusion techniques and extending the analysis to other populations and data types.
Limitations
The study's limitations include the relatively low number of disease cases for some diseases, which could affect model stability. The analysis primarily focused on a White British population from the UK Biobank, limiting the generalizability of findings to other ethnic groups. The lack of detailed uncertainty measures in the final risk scores could also impact practical application. Further research could examine the generalizability to other populations and explore more advanced data fusion methods.
Listen, Learn & Level Up
Over 10,000 hours of research content in 25+ fields, available in 12+ languages.
No more digging through PDFs, just hit play and absorb the world's latest research in your language, on your time.
listen to research audio papers with researchbunny