Biology
Prediction of plant complex traits via integration of multi-omics data
P. Wang, M. D. Lehti-shiu, et al.
Translating genotypes to phenotypes is difficult because complex traits arise from interacting genetic and molecular processes. Although genomic prediction using SNPs is common, other omics layers can also be predictive. Leveraging the Arabidopsis 1001 Genomes resources, the study asks whether integrating genomic (G), transcriptomic (T), and methylomic (M) data improves prediction of six Arabidopsis traits (flowering time, rosette leaf number, cauline leaf number, diameter of rosette, rosette branch number, stem length). The purpose is twofold: assess predictive power of each omics and their combinations, and interpret models to identify genes and interactions underlying trait variation, particularly for flowering time. The study also evaluates whether similarity in omics profiles corresponds to similarity in traits, and whether model interpretation can reveal known and novel regulators and their interactions.
Prior work has shown success of non-genomic omics for trait prediction in plants: transcriptomics for flowering time, yield, and disease resistance; methylomics for flowering time and height; and metabolomics for biomass and yield. Multi-omics datasets aligned with phenotypes are uncommon in plant systems, but the Arabidopsis 1001 Genomes project provides coordinated G, T, and M data. Previous benchmarking indicated rrBLUP and Random Forest often perform well for genomic prediction, with RF enabling non-linear interaction discovery and interpretability. Known flowering regulators such as FLC, FT, SOC1, and FRI are central to flowering pathways, providing benchmarks to validate model-derived importance. DNA methylation dynamics are heritable yet can be confounded by genetic background, warranting careful consideration when using methylomic data for prediction and interpretation.
Data and traits: Six traits for 383 Arabidopsis accessions were compiled: flowering time (10 °C; also examined at 16 °C), rosette leaf number, cauline leaf number, diameter of rosette, rosette branch number, and stem length. Omics collected from mixed rosette leaves just before bolting at 22 °C: genomic biallelic SNPs (G), RNA-seq transcript abundances (T, TPM transformed as ln(TPM+1)), and gene-body methylation (gbM: CG, CHG, CHH) with imputation for missing values.
Similarity matrices: For each trait, a phenotypic similarity matrix (pCor) was computed as 1 − normalized Euclidean distance across accessions. For omics, kinship from G (centered IBS via TASSEL), eCor from T (pairwise PCC normalized 0–1), and mCor from gbM (pairwise PCC normalized 0–1) were generated.
Single-omics predictive modeling: For each omics type and corresponding similarity matrix (G, T, gbM, K, eCor, mCor), models were trained using rrBLUP and Random Forest regressors. Data splits: random 80% train and 20% hold-out test. Within train, repeated 5-fold cross-validation (CV) 10 times was used for model selection and performance estimation. RF hyperparameters tuned by grid search over max_depth ∈ {3,5,10} and max_features ∈ {0.1,0.25,0.5,0.75, sqrt, log2, None}; n_estimators=100. Performance metric: PCC between true and predicted trait values on the held-out test set (PCC_test). Baseline models using population structure (first five PCs of G) were also built.
Methylation feature engineering (ssM): To increase methylation resolution beyond gbM, six single-site-based methylation (ssM) representations were created: presence/absence (P/A) of methylation per site; methylation proportion per site; binned gene-region profiles (upstream, gene body, downstream split into 30 bins) using mean P/A or median proportion; and cluster memberships of per-gene methylation profile vectors (K-means, k=30) computed separately for CG, CHG, CHH. Missing single-site methylation calls were handled via rules considering SNPs and read coverage, with KNN imputation; features with >95% invariant values were removed. Thresholding on site missingness was explored; 50th percentile threshold matrices were used for downstream analyses. ssM models were evaluated for flowering time.
Confounding control: To assess G confounding in gbM similarity, linear regression of mCor on kinship in training accessions was used to obtain mCor residuals; RF models were trained on residuals.
Feature importance and interpretation: Three measures were used: rrBLUP coefficients (absolute), RF Gini importances, and SHAP values from RF (global: average absolute SHAP across instances; local: per-accession SHAP). Genic mapping of G variants enabled gene-level importance comparisons across omics. Genes with importance ≥95th or ≥99th percentile were considered important. Enrichment of known flowering genes was assessed via Fisher’s exact tests.
Benchmark gene models and validation: RF models restricted to features from 426 benchmark flowering time genes (from FLOR-ID and TAIR) were compared with full models and with models built from the top 426 non-benchmark important genes. Experimental validation measured flowering time in loss-of-function mutants (single and double mutants of paralogs) for 21 non-benchmark genes predicted important and for 37 predicted non-important genes in the Col-0 background under long days at ~21 °C; flowering time was normalized within blocks and tested by Wilcoxon rank sum.
Multi-omics integration and interaction analysis: Integrated RF models used concatenated G+T+gbM features. For interaction discovery focused on benchmark genes, one top feature per omics per gene was retained, yielding 1,227 features. SHAP interaction values (TreeExplainer) identified pairwise feature interactions across all omics combinations (G-G, T-T, gbM-gbM, G-T, G-gbM, T-gbM). Clustering of accessions by SHAP profiles of top features examined accession-specific contributions.
- Weak global correspondence between omics similarity and trait similarity: Pearson’s r between pCor and omics similarities ranged from −0.02 to 0.17 across traits and omics; kinship and mCor were more correlated with each other (r=0.43).
- Predictive accuracy using single omics: G-, T-, and gbM-based models achieved comparable PCC_test across traits with both rrBLUP and RF, and models using omics similarity matrices (K, eCor, mCor) performed similarly to those using raw omics features. Population structure models underperformed G/T/gbM models for flowering time, RLN, and CLN but not for DoR, RBN, and SL.
- Methylation representations: Models using combined gbM (CG+CHG+CHH) generally outperformed single-context gbM; CHH alone tended to perform worst. For flowering time, ssM features improved rrBLUP accuracy over gbM (RF did not improve), indicating added value of site-level methylation patterns.
- Feature importance across omics: Importance scores showed weak or no correlation between omics (Spearman’s ρ −0.07 to 0.11) with little overlap of top genes, implying distinct contributing features across G, T, and gbM. Removing kinship confounding from mCor reduced model performance (PCC decreases 0.01–0.18; p<0.05), but residual mCor still outperformed population structure for flowering time and RLN, indicating gbM contains predictive information beyond genetic structure.
- Benchmark flowering genes: Of 426 benchmarks, 169 were identified as important by at least one omics/importance measure. Only FLC and MAF2 were important across all three omics. RF Gini from G and T models significantly enriched benchmarks at 95th percentile (8.78% and 8.62%; p=1.22e-03 and 1.76e-03). Increasing threshold to 99th percentile improved SHAP-based enrichment for G and T models. gbM alone did not show significant enrichment using gbM gene-body features, but ssM models recovered 144 additional benchmarks and single ssM models identified up to 44 benchmarks (vs up to 24 for gbM).
- Example ssM-specific discovery: CSTF77 associated with flowering time via CG methylation at two sites; no significant association via SNPs, expression, or gbM, explaining detection only by ssM.
- Environmental/context effects: Temperature at scoring affected gene importance; FRI had SHAP=0 at 10 °C but SHAP=0.075 (rank 14) at 16 °C. RLN and CLN models identified overlapping and distinct benchmark genes relative to flowering time, consistent with trait proxy differences (e.g., TFL1 specific to CLN).
- Experimental validation: Of 21 predicted important non-benchmark genes tested, 9 affected flowering time (6 single mutants; 3 only in double mutants with paralogs). Among 37 predicted non-important genes, 16 (43.2%) showed flowering phenotypes, suggesting accession/environment dependence and limited power contribute to false negatives/positives.
- Accession-dependent contributions: Clustering accessions by local SHAP values of top T features revealed eight clusters with distinct contributions of SOC1, FT, FLC, SPL15, and others; SOC1/FT/FLC often showed coupled or decoupled effects across accessions, explaining variation in flowering times.
- Multi-omics integration: RF models improved with G+T and G+T+gbM integration (significant vs best single-omics RF; p=4.87e-04 and p=1.08e-05), whereas rrBLUP did not improve, implicating non-linear cross-omics interactions.
- SHAP interaction analysis (benchmark-focused model): Identified 7,186 feature interactions spanning all omics pairings: T-gbM (1,906), G-T (1,654), T-T (1,343), G-gbM (1,174), gbM-gbM (623), G-G (486). SOC1, FT, and FLC T-features had the most interactions (707, 638, 279). Top interactions included known (SOC1–FT, SOC1–MIR172B, SOC1–SPL5, SOC1–FLC) and putative novel interactions (e.g., SOC1 with PIF3, SOC1 with ZTL [T–gbM], SOC1 with NUA [T–G], and various G–G, gbM–gbM, G–gbM pairs), generating hypotheses about regulatory cross-talk.
Comparable predictive performance of G, T, and M indicates that each omics layer contains complementary information about complex traits. Distinct important genes across omics suggest that relying on a single data type can miss relevant biology, while integrating data enables discovery of non-linear interactions that improve predictions. Interpretable models (RF with SHAP) recovered core flowering regulators (FLC, FT, SOC1, SPL15) and revealed accession-specific contributions and coupling/decoupling of key regulators, underscoring genetic background effects and complex network behavior. Enhanced methylation representations (ssM) uncovered additional benchmark genes not captured by gene-body summaries, illustrating the value of richer feature engineering. Environmental context (e.g., temperature) alters gene importance, consistent with known G×E effects on flowering pathways (e.g., FRI vernalization dependence). Multi-omics integration improved RF performance and exposed cross-omics interactions consistent with known regulatory relationships and novel hypotheses, providing a route to refine flowering time regulatory networks beyond GWAS-identified QTLs.
This study demonstrates that integrating genomic, transcriptomic, and methylomic data can accurately predict plant complex traits and, via interpretable machine learning, reveal both known and previously unrecognized genes and interactions underlying flowering time. Single-omics models perform similarly but highlight different contributors; integrated RF models leveraging non-linear interactions provide the best performance and mechanistic insights. Enriched methylation representations (ssM) substantially increase discovery of relevant genes. Experimental validation confirmed nine non-benchmark regulators of flowering time. Future work should expand accession numbers, incorporate multi-environment measurements, and add further omics (chromatin architecture, accessibility, histone marks), as well as single-cell and spatial omics to resolve tissue and cell-type specificity. Improved feature selection and representation, and modeling frameworks explicitly accounting for G×E and accession-specific effects, will further enhance prediction and biological inference.
- Tissue and developmental context mismatch: Omics measured in rosette leaves before bolting at 22 °C, while flowering time was scored at 10 °C (and 16 °C in a secondary analysis), likely affecting gene importance and reducing enrichment of benchmarks.
- Cell-type heterogeneity: Bulk T and M data obscure cell-specific signals, adding noise and potentially diluting predictive features.
- Sample size and feature dimensionality: Many more features than instances (e.g., >20,000 expression features vs 383 accessions) reduce power to detect moderate contributors and increase false negatives; importance ranks may be unstable for weaker effects.
- Methylation confounding: gbM correlates with kinship; although residualization reduced confounding, it also reduced predictive performance, complicating interpretation of methylation effects.
- Trait predictability heterogeneity: DoR showed heterogeneity between train/test; RBN and SL had low predictability, likely driven by environment or G×E interactions.
- Generalizability: Accession-dependent effects imply that discoveries from one genetic background (e.g., Col-0) may not transfer to others; validation in diverse backgrounds is needed.
Related Publications
Explore these studies to deepen your understanding of the subject.

