Introduction
Predicting plant phenotypes from genotypes is a complex challenge due to the intricate genetic mechanisms underlying trait variation. While genomic prediction using genetic variation information is common, other data types like transcriptomic, methylomic, and metabolomic data have also proven successful in predicting various traits in different plant species. The *Arabidopsis* 1001 Genome Project offers a unique resource with phenotypic, genomic (G), transcriptomic (T), and methylomic (M) data for hundreds of accessions. This study leverages this dataset to explore the predictive power of integrating different omics data types for six *Arabidopsis* traits using machine learning approaches. By interpreting these models, the study aims to identify key genes involved in trait prediction and gain insights into the underlying molecular mechanisms beyond the scope of traditional genome-wide association studies (GWAS). The study focuses on six traits: flowering time, rosette leaf number (RLN), cauline leaf number (CLN), diameter of the rosette (DoR), rosette branch number (RBN), and stem length (SL). These traits represent a range of plant development and morphology characteristics, making them suitable for studying the effects of multi-omics integration.
Literature Review
Previous research has demonstrated the utility of individual omics data in predicting plant traits. Transcriptomic data has been used to predict flowering time and yield, as well as pathogen resistance. Methylomic data has been successfully applied to predict flowering time and plant height in *Arabidopsis*. Metabolomic data has shown promise in predicting biomass, bioenergy-related traits in maize, and yield in rice. However, multi-omics studies integrating multiple data types for complex trait prediction in plants are scarce. This study addresses this gap by leveraging the comprehensive dataset from the *Arabidopsis* 1001 Genomes Project, which includes genomic, transcriptomic, and methylomic data aligned with phenotypic information for hundreds of accessions.
Methodology
The study utilized genomic (G), transcriptomic (T), and gene-body methylation (gbM) data from the *Arabidopsis* 1001 Genomes Project for 383 accessions. Six traits (flowering time, RLN, CLN, DoR, RBN, and SL) were collected from published studies. The omics similarity matrices were compared with trait similarity matrices to assess the relationship between omics data variation and trait variation. Machine learning models, specifically ridge regression Best Linear Unbiased Prediction (rrBLUP) and Random Forest (RF), were trained using individual omics data and integrated multi-omics data for each trait. Model performance was evaluated using Pearson Correlation Coefficient (PCC) on a held-out test dataset. Feature importance was assessed using three measures: rrBLUP coefficients, RF gini importance, and SHAP values. The study compared the important genes identified for flowering time prediction with benchmark genes known to regulate flowering. To explore feature interactions, integrated models using all G, T, and M features for benchmark genes were built and interpreted using SHAP. Additionally, the study explored six additional methylomic features (single site-based methylation, ssM) to enhance predictive power and model interpretation. Finally, experimental validation was performed on selected genes using available mutant datasets to assess their impact on flowering time. The use of multiple machine-learning algorithms allowed for the assessment of both linear (rrBLUP) and non-linear (RF) relationships in the prediction of complex traits. SHAP values provided a detailed interpretation of feature effects on both the global level and local level (for individual accessions).
Key Findings
The study found that models built using genomic, transcriptomic, and methylomic data individually showed comparable performance in predicting the six *Arabidopsis* traits. However, different omics data identified distinct sets of important genes, indicating that different molecular mechanisms contribute to trait variation. For flowering time, models built using different omics data identified different sets of benchmark genes. Nine additional genes identified as important for flowering time were experimentally validated. The contribution of genes to flowering time prediction was accession-dependent, with distinct genes contributing in different genotypes. Multi-omics models outperformed single-omics models, revealing both known and novel gene interactions, providing deeper insights into the existing regulatory networks. The analysis of different forms of methylation data (gbM and ssM) showed that the way methylation data is represented significantly impacts the accuracy of prediction and the identification of relevant genes. The study also highlighted the limitations of relying solely on genes identified in one accession (Col-0) to generalize flowering time prediction across multiple accessions, stressing the necessity of considering accession-specific effects. The integration of multi-omics data revealed interactions between genes at different molecular levels. The study identified a substantial number of feature interactions, suggesting complex interplay between different omics data. The interactions involving the expression levels of SOC1, FT, and FLC provided evidence for the accession-specific effects of genes and also revealed some previously unreported interactions. Experimental validation of the predicted genes confirmed the role of some non-benchmark genes in regulating flowering time. These findings underscore the importance of integrating multi-omics data for accurate prediction of complex traits and for revealing novel aspects of gene regulation.
Discussion
This study's findings demonstrate the feasibility of using multi-omics data integration for a deeper understanding of the molecular mechanisms underlying complex traits. The comparable performance of single-omics models suggests that different omics layers offer valuable, albeit distinct, information about trait variation. The identification of novel genes involved in flowering time regulation extends our knowledge of existing regulatory networks. The accession-dependent contributions highlight the complexity of gene interactions and the limitations of generalizing findings from a single accession. The improved performance of integrated multi-omics models underscores the synergistic effects of combining different data types. The revealed interactions provide valuable hypotheses for future experimental investigations. However, certain factors may limit the results' complete generalization. The use of mixed rosette leaves for omics data collection might introduce cell-type-specific noise. Similarly, the temperature difference between flowering time measurements and omics data collection could affect gene contributions and introduce further limitations.
Conclusion
This study demonstrates the power of integrating multi-omics data for improving the prediction of complex plant traits and for elucidating underlying molecular mechanisms. The findings highlight the importance of considering accession-specific effects and using various forms of methylation data representation. Future research should focus on incorporating data from more accessions and environments, and explore additional omics layers (such as chromatin accessibility and histone modifications) to further enhance predictive accuracy and biological insight. The use of single-cell omics data could provide a higher-resolution view of gene regulation and could significantly improve the prediction accuracy. Further experimental validation of the newly identified gene interactions is necessary to solidify these findings and expand our understanding of plant development.
Limitations
The study's limitations include the use of a single time point for transcriptomic data, which might not fully capture the dynamic nature of gene expression. The temperature difference between flowering time measurement and omics data collection could influence the identified gene contributions. The relatively small number of accessions, although sufficient for the analysis, could limit the statistical power to detect genes with small effects on flowering time. The reliance on existing mutant datasets for validation restricts the range of genes that can be experimentally assessed. The use of mixed leaf samples for omics data collection may introduce noise due to cellular heterogeneity. Finally, the study primarily focuses on *Arabidopsis*, and the generalizability of findings to other plant species requires further investigation.
Related Publications
Explore these studies to deepen your understanding of the subject.