Agriculture

A data-driven simulation platform to predict cultivars' performances under uncertain weather conditions

G. D. L. Campos, P. Pérez-rodríguez, et al.

This innovative study by Gustavo de los Campos, Paulino Pérez-Rodríguez, Matthieu Bogard, David Gouache, and José Crossa presents a groundbreaking computer simulation platform that forecasts crop cultivars' performance under uncertain weather conditions by leveraging field trial data, DNA sequences, and historical weather records. The Monte Carlo methods applied enhance predictive accuracy, making it a valuable tool in agriculture.... show more

Introduction

The study addresses the long-standing challenge of predicting crop cultivar performance under largely uncertain future weather conditions, a problem driven by substantial genotype-by-environment (G×E) interactions. Traditional multi-year, multi-location trial networks are limited in their ability to expose genotypes to the full range of weather variability, making selection and agronomic recommendations difficult. The authors propose a data-driven simulation framework that leverages field trial phenotypes, genomic data (SNPs), and environmental covariates derived from weather and crop models to learn G×E patterns and to simulate cultivar performance distributions for target locations while explicitly integrating uncertainty about future weather and model parameters via Monte Carlo methods. The goal is to provide more precise forecasts of cultivar means and stability across locations than those obtained from limited-year trials alone.

Literature Review

The work builds on genomic prediction models that integrate DNA sequence data with environmental covariates, including reaction norm models such as Jarquín et al. (2014), and related approaches integrating crop growth models with whole-genome prediction. Classical methods for G×E include fixed- and mixed-effects linear-bilinear and factor-analytic models, AMMI/SREG, and reaction norms. Prior simulation studies (e.g., Chenu et al., 2009) used crop models with assumed parameters and selected scenarios; here, the authors advance a heavily data-driven approach using real genotypes, historical weather, and learned G×E patterns, accounting for parameter uncertainty. The concept of breeding for target environments is noted, but prediction is hampered by large year and year-by-location variance components, motivating methods that incorporate weather uncertainty via historical records.

Methodology

Data: 25,841 wheat grain yield records (standardized at 15% moisture) from France (1998–2014) across 242 locations and 752 year-location trials, with 481 registered cultivars. Trials were unreplicated within year-location and managed for optimal yield. Genotyping: Axiom HD array; after QC (MAF>0.05, call rate>0.8, <10% heterozygosity, mean imputation), 167,440 SNPs remained. Environmental covariates (106 ECs) were derived using an Arvalis crop model based on predicted phenology and daily weather (temperature, radiation, rainfall), summarizing critical temperature, radiation, and water availability across eight phenological phases. Validation of ECs: regression of trial means on ECs in cross-validation yielded an average testing correlation of 0.600 (±0.04). Models: Bayesian mixed models fitted with BGLR. Baseline TL model with random effects of year (Y), location (L), year×location (YL), and cultivar (V). GW model replaces V with genomic random effects V~~N(0,GσV²) and replaces Y, L, YL with environmental random effects w~~N(0,Ωσw²) where G is the SNP-derived relationship matrix and Ω is EC-derived similarity. GW-G×W adds SNP×EC interactions via a Gaussian random effect with covariance proportional to the Hadamard product G∘Ω. To better capture environmental variance, combined models reintroduced Y, L, and YL: TGW (Y, L, YL + SNPs + ECs) and TGW-G×W (full model adding SNP×EC interactions). Cross-validation: 10-fold CV leaving entire trials (year-locations) out; within-trial correlations between observed and predicted yield assessed. Additional CV assigned cultivars to folds (prediction of untested cultivars within year-location). Simulation platform: Using the fitted full TGW-G×W model, posterior sampling via MCMC (100,000 samples thinned to 20,000) provided parameter uncertainty. For 28 well-represented cultivars and 16 target French wheat locations, 16 years (2000–2015) of historical weather were used to derive ECs. For each genotype–location–year–posterior sample, the prediction function f(i,j,k,θs)=μ+Y_i+V_k+w_ij+VW_ijk was evaluated using terms learnable from past data or predictable from SNPs and ECs (year and year-location effects not predictable from ECs were excluded), generating ~143–144.3 million simulated data points. Downstream analyses included heatmaps and SREG biplots of predicted means and Finlay–Wilkinson regressions on raw means, BLUEs, and simulated means (averaged over 16 years) to assess mean–stability, with two-fold cross-validation for FW slope stability.

Key Findings

Data characteristics: Average grain yield 9.49 ± 1.50 t/ha; ECs predicted trial means with average testing correlation 0.600 (±0.04).
Variance decomposition (TL baseline): Environmental differences (Y, L, Y×L) explained ~83% of total variance; about half of between-trial variance was Y×L. Cultivar main effects explained ~8% of total variance and ~50% of within-trial variance.
GW model: Genomic term captured ~83% of cultivar variance (0.158/0.191). ECs captured ~44% of between-trial variance [0.892/(0.289+0.780+0.944)], indicating ~56% missing “environmentability.”
GW-G×W: Adding SNP×EC interactions reduced error variance by ~5% relative to GW (0.074/1.300) but still under-captured environmental variance compared to TL.
TGW and TGW-G×W: Full model (TGW-G×W) recovered environmental variance similar to baseline while retaining SNP capture of nearly all genetic variance and a sizable fraction of environmental variance via ECs.
Predictive performance (leave-trial-out CV): Within-trial correlation highest for TGW-G×W at 0.58, exceeding simpler models (significant differences at α=0.01). Predicting untested cultivars (CV by cultivar) yielded lower accuracy; best models (GW-G×W, TGW-G×W) achieved average correlation ~0.25, baseline slightly negative.
Simulation outputs: Generated ~144.3 million simulated yield data for 28 cultivars × 16 locations × 16 years × 20,000 posterior samples. Predicted mean yields identified clusters of high-performing cultivars (e.g., Cellule, Rubisko, Barok, Pakito). Simulated distributions quantified uncertainty across locations; Thizay and Montans showed greater yield uncertainty than Crenay and Estrées-Mons.
Finlay–Wilkinson stability: Raw means suggested wide slope range (~0.8–1.4), BLUEs reduced variability, and simulated means further reduced slope variability with more precise estimates, indicating small genotype-by-location variance at location-level means after averaging over years. High, stable performers across locations included Cellule (intercept ≈ 10.40, slope ≈ 1.03) and Rubisco/Rubisko (intercept ≈ 10.40, slope ≈ 1.02). FW slope CV correlations: raw 0.342 (±0.015), BLUEs 0.447 (±0.012), simulated 0.502 (±0.016); all pairwise differences significant (Holm-adjusted p<0.001).

Discussion

The findings demonstrate that integrating genomic data and environmental covariates with historical weather, within a Bayesian framework, can learn cultivar-specific reaction norms and generate predictive distributions of performance that account for uncertainty in both future weather and model parameters. Because a large portion of environmental variance is due to year and year-by-location effects, forecasts based on limited years are inherently noisy. By averaging over many likely weather realizations and parameter samples, the simulation platform produces smoother, more precise estimates of cultivar means and stability at target locations than raw trial means or BLUEs. The approach enables targeted variety recommendations and identification of broadly adapted, high-yielding genotypes. Nevertheless, predictive accuracy for entirely untested cultivars remains limited unless close relatives are present, reflecting reliance on genomic relatedness. The platform should be applied within the genotype–environment space represented in training data to avoid extrapolation. Extensions include regional aggregation of simulations and sensitivity analyses to climate change by over-sampling adverse weather years.

Conclusion

The study introduces a data-driven Monte Carlo simulation platform that leverages field trial phenotypes, dense SNP genotypes, and environmental covariates derived from historical weather to predict cultivars’ future performance distributions under uncertain conditions. The full TGW-G×W model achieved the best leave-trial-out predictive accuracy and, when coupled with historical weather, delivered more precise cultivar-level mean and stability assessments across locations than conventional analyses. The simulated outputs facilitate robust selection and recommendation decisions by smoothing variability attributable to year and year-location effects. Future work can apply the framework at regional scales, explore climate change scenarios by reweighting weather histories, and consider alternative modeling choices (e.g., crop models) within the same Bayesian simulation paradigm.

Limitations

Environmental covariates captured only ~44% of between-trial variance in GW models, indicating missing environmentability due to unmodeled factors (e.g., soils, management) and model limitations (e.g., non-linearities, EC interactions).
Predictions for untested cultivars are of limited accuracy unless close genetic relatives are in training; at least one year of testing per cultivar was suggested for accurate within-year-location prediction.
The simulation framework should not be used to extrapolate beyond the genotype/environment space observed during training; simulations excluded year and year-by-location effects not predictable from ECs.
Data availability constraints (proprietary datasets) may limit external validation and generalizability to other contexts without similar data richness.

Related Publications

Explore these studies to deepen your understanding of the subject.

Earth Sciences

A data-driven approach to rapidly estimate recovery potential to go beyond building damage after disasters

S. Loos, D. Lallemant, et al.

Medicine and Health

Combining Clinical and Genetic Data to Predict Response to Fingolimod Treatment in Relapsing Remitting Multiple Sclerosis Patients: A Precision Medicine Approach

F. L, C. F, et al.

Engineering and Technology

MLMD: a programming-free AI platform to predict and design materials

J. Ma, B. Cao, et al.

Environmental Studies and Forestry

A data-driven approach to identifying PFAS water sampling priorities in Colorado, United States

K. E. Barton, P. J. Anthamatten, et al.

Listen, Learn & Level Up

Over 10,000 hours of research content in 25+ fields, available in 12+ languages.

No more digging through PDFs, just hit play and absorb the world's latest research in your language, on your time.

listen to research audio papers with researchbunny