logo
ResearchBunny Logo
Introduction
Precision medicine necessitates accurate predictive models. Integrating multi-omics data, including genomics, transcriptomics, and epigenomics, holds significant promise for improving prediction accuracy in complex diseases and traits. However, integrating diverse omics data poses challenges due to variations in data types, preprocessing requirements, and dimensionality. Traditional association studies have identified numerous genes and CpGs associated with various phenotypes, but the combined effects across different omics remain largely unexplored. Recent advances in high-throughput sequencing and array technologies have facilitated the acquisition of multi-omics datasets, creating a need for sophisticated analytical tools capable of handling high dimensionality and complex correlation structures. Neural networks, particularly interpretable neural networks informed by prior biological knowledge (visible machine learning), offer a promising solution. These networks not only provide accurate predictions but also offer insights into the underlying biological mechanisms. Existing examples like GenNet and P-net demonstrate the potential of visible machine learning in genomics, but challenges remain in terms of robustness and the reliability of interpretations. This study addresses these challenges by extending the GenNet framework to create interpretable neural networks for multiple omics inputs and applying it to a large multi-cohort dataset to assess performance, interpretability, and generalizability across different populations.
Literature Review
The literature extensively highlights the individual contributions of genomics, transcriptomics, and epigenomics to understanding complex traits and diseases. However, the synergistic effects of integrating these data types are not fully understood. Several studies have explored multi-omics integration using various statistical frameworks and machine learning techniques, demonstrating the potential for improved prediction and biological insight. Challenges in multi-omics analysis include handling high dimensionality, diverse data types, and complex correlation structures within and between omics. Neural networks have emerged as powerful tools for tackling such challenges, but interpretability remains a critical concern. Visible machine learning, which integrates prior biological knowledge into neural network architectures, addresses this by providing insights into the decision-making process. Previous work, such as GenNet and P-net, has applied this approach to genomic data, showing its potential but also highlighting limitations concerning robustness and the reliability of interpretations based on weight initializations. This study builds on these advances by addressing limitations through modifications to existing architectures and focusing on validation across multiple cohorts.
Methodology
This study used multi-omics data from the Biobank-based Integrative Omics Study (BIOS) consortium, encompassing four cohorts: Lifelines (LL), Leiden Longevity Study (LLS), Netherlands Twin Register (NTR), and Rotterdam Study (RS). The data included genome-wide RNA expression and CpG methylation data from blood samples. The researchers developed visible neural networks, extending the GenNet framework, to predict smoking status, age, and LDL levels. CpG methylation sites were annotated using GREAT and linked to the closest gene. Gene expression data was intersected with the methylation data, resulting in a set of overlapping genes used in the analysis. The network architectures included a methylation-only network (ME), a gene expression-only network (GE), and a combined methylation and gene expression network (ME+GE). Deeper networks incorporating KEGG pathway information were also evaluated. A cohort-wise cross-validation strategy was employed to assess generalizability across cohorts. Hyperparameters were tuned using a validation set within each fold. The networks were trained ten times with different random seeds to assess stability. For interpretation, the contribution of each input and node was calculated using absolute weights, with L1 regularization to promote sparsity and improve interpretability. Baseline networks with densely connected layers were used for comparison. Additional analyses included omic-specific L1 penalties to assess the independent contribution of each omic type and the integration of covariates to explore covariate-gene interactions. Principal component analysis (PCA) was used to analyze activation patterns in the network to identify subgroups within the population showing similar activation patterns.
Key Findings
The ME+GE network demonstrated consistently high performance in predicting smoking status across all cohorts (mean AUC = 0.95, 95% CI: 0.93–0.98), identifying AHRR, GPR15, and LRRN3 as key genes. For age prediction, the ME+GE network achieved a mean error of 5.16 years (95% CI: 3.97–6.35), outperforming single-omics models. However, performance varied across folds, potentially due to differences in age distributions among the cohorts. The inclusion of both methylation and gene expression data improved prediction accuracy compared to single-omics approaches, and the interpretation revealed that the neural networks used multifactorial solutions rather than a few genes for age prediction. LDL level prediction showed limited generalizability except for one cohort, where the ME+GE network achieved an R² of 0.07 (95% CI: 0.05–0.08). Deeper networks with pathway layers or fully connected layers did not improve performance. Omic-specific L1 penalty analysis showed that the network reduced its reliance on methylation information when appropriate to achieve similar prediction performance, highlighting the importance of using both omics. PCA of the activation patterns for age prediction revealed a separation between sexes, mostly explained by genes on the X chromosome, and a correlation between the second principal component and age. The impact of sex on the prediction showed how inclusion of covariates might influence results.
Discussion
This study demonstrates the effectiveness of visible neural networks for integrating multi-omics data to predict complex phenotypes. The consistent high performance in smoking status prediction validates the approach, with the identified genes aligning with existing literature. The superior performance of multi-omics networks over single-omics networks highlights the synergistic effects of data integration. The variable performance in age prediction underscores the importance of considering cohort-specific factors and data distributions. The limited generalizability of LDL prediction may reflect the weak relationship between the studied omics and LDL levels. The lack of performance improvement with deeper networks suggests that the optimal architecture may depend on the specific phenotype and data characteristics. The findings emphasize the importance of regularization and careful consideration of prior knowledge in building interpretable models. Further research could explore other omics data, investigate alternative network architectures, and focus on larger datasets to better address the challenges in generalizing the models.
Conclusion
This study successfully applied visible neural networks to integrate multi-omics data for phenotype prediction. The results highlight the value of multi-omics integration, the importance of considering cohort-specific factors, and the need for appropriate regularization. Future work should explore different network architectures, incorporate additional omics data types, and investigate the impact of larger datasets on model generalizability.
Limitations
The study's limitations include the potential influence of batch effects, the focus on blood-based omics data, and the possible limitations of the annotation databases used. The variation in age distributions among the cohorts might have influenced the age prediction results. Further investigation is needed to fully understand the generalizability and robustness of the developed models in diverse populations.
Listen, Learn & Level Up
Over 10,000 hours of research content in 25+ fields, available in 12+ languages.
No more digging through PDFs—just hit play and absorb the world's latest research in your language, on your time.
listen to research audio papers with researchbunny