Agriculture
A data-driven crop model for maize yield prediction
Y. Chang, J. Latham, et al.
This innovative research, conducted by Yanbin Chang, Jeremy Latham, Mark Licht, and Lizhi Wang, presents a novel data-driven crop model that merges process-based and data-driven methodologies to accurately predict crop yields. By analyzing extensive US Corn Belt data, this model showcases its potential in enhancing food security and aiding farmers in selecting the best seeds for their crops.
~3 min • Beginner • English
Introduction
The study addresses the central challenge of accurately predicting crop yield—critical for food security under climate change—by modeling how phenotype (yield) is determined by genotype, environment, and their interactions. Existing approaches include process-based crop models (e.g., APSIM, DSSAT), which are explainable but require extensive experimental calibration and detailed inputs, and data-driven machine learning models, which can achieve high predictive accuracy but are often black boxes and sensitive to data quality and quantity. The research question is whether a hybrid, data-driven crop model can combine the explanatory strengths of process-based models with the predictive power and flexibility of data-driven methods, enabling accurate, transferable, and interpretable maize yield predictions across space and time without experimental calibration.
Literature Review
Prior work in process-based modeling (APSIM, DSSAT) simulates crop growth as functions of weather, soil, and management, yielding mechanistic insights and experimentally verifiable outputs but challenged by calibration burden, limited generalizability, and sparse spatially detailed inputs. Data-driven methods (linear/PLS regression, random forests, CNNs, deep neural networks) and remote sensing indicators (GPP, NDVI, EVI) have delivered strong predictive performance and can capture complex genotype–environment–management interactions but suffer from data sensitivity, lack of interpretability, and transferability. Recent efforts to integrate process-based and data-driven models (e.g., Bayesian model averaging ensembles, APSIM with regression/ML hybrids) have improved predictions. However, many conventional models conflate genotype and environment in parameters, require fixed, often unavailable inputs, and rely on experimental calibration. This paper positions a data-driven crop model that defines genotypic parameters independent of environment, uses modular design adaptable to available data, and calibrates parameters from historical observations.
Methodology
Study region and data: County-level maize yield data for the US Corn Belt (1981–2020) from USDA-NASS were compiled (47,710 county-year records after excluding missing yields). County shapefiles (NWS) were used for geographic aggregation and mapping to weather stations and soil units. Daily weather (1-km) came from Daymet. Management data (planting/harvest dates; plant population density) were from USDA; missing population densities were imputed using state-level means over non-missing years. Soil data were from gSSURGO (July 2020 release). Assumptions: (1) Genotype proxy—each county-year’s seed portfolio is represented by a unique genotype with a set of genotypic parameters; (2) No irrigation and adequate fertilizer availability (due to lack of high-resolution irrigation/fertilization data), though the framework can incorporate these when available; (3) Some factors (e.g., waterlogging) assumed uniform within a county.
Descriptive crop model framework: The model mechanistically describes daily growth and biomass allocation across modules: (i) Soil water balance (precipitation, runoff, crop uptake, evaporation); (ii) Water uptake proportional to root mass and vapor pressure deficit; (iii) Radiation interception proportional to LAI; (iv) Phenology clock with vegetative and reproductive stages, transitioning at a genotype-specific growing degree day threshold; (v) Daily biomass and metabolism determined by water uptake, solar radiation, leaf weight, and stress; (vi) Stress module for heat, drought, flooding—reducing available soil water, radiation use efficiency, and growth; (vii) Crop organ allocation—during vegetative stage biomass goes to leaves, roots, and other organs; during reproductive stage grain filling begins and leaf/root growth ceases.
Predictive modeling framework: The model predicts county-year yield via y_{c,t} = f(W_{c,t}, M_{c,t}, S_c, g_{c,t}, s_c), where W is weather, M is management, S is soil, g is the genotypic parameter vector for county c and year t (independent of environment), and s denotes soil parameters. Function f encodes the physiology-based relationships detailed in Supplementary materials.
Parameter calibration as optimization: Genotypic parameters g are calibrated from historical data by minimizing planting-area-weighted RMSE between predicted and observed yields:
min_{g} [ Σ_{t} Σ_{c} (M_{c,t})(y_{c,t} − θ_{c,t})^2 ] / [ Σ_{t} Σ_{c} (M_{c,t}) ]
subject to y = f(W, M, S, g), and smoothness/gradual-change constraints on genotypes: g_{i,t} ≤ 1.25 g_{i,t−1} (temporal) and g_{i,t} ≤ 1.25 g_{i−1,t} (spatial), reflecting gradual genetic change and reducing overfitting. A heuristic algorithm solves the calibration (details in Supplementary Note 3). The approach differs from neural networks by leveraging a physiology-guided structure with a small number of biologically meaningful parameters, enhancing interpretability and reducing data sensitivity while still learning quantitative rates from data.
Computational setup: Python implementation run on a laptop (Intel i7-10750H, 2.60 GHz, 16 GB RAM).
Key Findings
Training accuracy: Calibrated genotypic parameters achieved training RMSE 0.74 Mg/ha. Relative RMSE (RRMSE) with respect to average 2020 yield (10.34 Mg/ha) was 7.16%. Accuracy in the last decade exceeded earlier decades; low-accuracy years coincided with extreme weather (e.g., 1993 flood, 2012 drought). Benchmarking: Comparable to state-of-the-art models—deep learning models reported RMSEs of 0.67 and 0.72 Mg/ha; an APSIM+ML hybrid achieved 0.69 Mg/ha.
Spatial extrapolation: Leave-one-county-out within test states. Average test RMSE 1.17 Mg/ha; RRMSE 11.32%. Benchmark nearest-county approach RMSE 1.44 Mg/ha; RRMSE 13.93%. Training RMSE for same splits 0.83 Mg/ha; RRMSE 8.03%. Highest errors in Nebraska and Kansas likely due to unmodeled irrigation (model assumes zero irrigation).
Temporal extrapolation: Leave-one-year-out tests (1981–2020). Average test RMSE 1.15 Mg/ha; RRMSE 11.12%. Benchmark nearest-year approach RMSE 1.55 Mg/ha; RRMSE 14.99%. Training RMSE 0.71 Mg/ha; RRMSE 6.87%. Benchmark struggled in drought/flood years (1983, 1988, 1993, 2012); model improved performance in these years though 1993 remained challenging.
Genotype-by-environment (G×E) what-if analysis: Heatmap across all historical genotypes and environments showed: (i) 2012 drought severely limited yields regardless of genotype; (ii) Post-2009 genotypes substantially outperformed earlier genotypes when applied to earlier-year environments, indicating genetic improvement.
Prescriptive analysis—optimal seed selection: Across 2011–2020, observed average yield was 9.72 Mg/ha. Optimizing genotype selection with unknown future weather (historical-average expectation) increases yield by 0.38 Mg/ha (3.91%). With perfect knowledge of weather, gain is 1.73 Mg/ha (17.59%). Benefits trend upward over time due to expanding historical genotype pool.
Discussion
The findings demonstrate that a physiology-informed, data-calibrated crop model can deliver accurate and explainable maize yield predictions while generalizing across space and time. By defining genotypic parameters independent of environmental effects and learning them from historical data, the model captures G×E interactions and supports counterfactual analyses (e.g., swapping genotypes across environments). Compared with purely data-driven baselines, it offers interpretability and reduced sensitivity to data scarcity; compared with traditional process-based models, it avoids resource-intensive experimental calibration and rigid input requirements. The model outperformed simple benchmarks for spatial and temporal extrapolation and revealed actionable insights, such as the limited mitigability of the 2012 drought by available genetics and the sizable potential gains from genotype optimization. These results support the central hypothesis that integrating process knowledge with data-driven calibration yields competitive accuracy with scientific explainability and prescriptive utility.
Conclusion
This work introduces a data-driven crop model for maize that blends process-based structure with data-driven calibration of biologically meaningful, environment-independent genotypic parameters. On Corn Belt data (1981–2020), the model achieved competitive training accuracy (RRMSE 7.16%) and strong spatial and temporal extrapolation performance (RRMSEs 11.32% and 11.12%). It uncovered G×E insights and enabled prescriptive analyses, indicating potential yield gains of 3.91% without, and 17.59% with, perfect weather knowledge through optimal genotype selection. Future research directions include: (1) extending and validating the framework for other crops; (2) conducting comprehensive case studies with higher-resolution, more complete inputs (e.g., irrigation, fertilization, detailed management); (3) experimental validation of optimal seed selection recommendations; and (4) leveraging learned genotypic parameters to inform plant breeding.
Limitations
- Prediction errors increase during extreme weather years (1983, 1988, 1993, 2012), suggesting the need to refine the stress module.
- Assumptions include no irrigation and adequate fertilization due to data limitations; lack of irrigation data particularly affected predictions in heavily irrigated states (e.g., Nebraska, Kansas).
- Some factors (e.g., waterlogging) and management variables are assumed uniform within counties; several inputs (plant population density, planting/harvest dates) are available only at reduced spatial/temporal resolution with imputation used for missing values.
- Transferability of the specific modeling structure to other crops is limited; each species requires tailored physiological modules.
- Reliance on public datasets with missing values and varying quality may affect generalizability; performance could improve with richer, higher-resolution genotype, irrigation, and fertilization data.
Related Publications
Explore these studies to deepen your understanding of the subject.

