Introduction
Predicting crop yields accurately is essential for ensuring food security, especially given the challenges posed by climate change. Accurate predictions enable informed decision-making by farmers, contributing to famine prevention and global food security. Existing approaches primarily fall into two categories: process-based models and data-driven machine learning models. Process-based models, like APSIM and DSSAT, leverage physiological understanding of crop growth to model yield as a function of weather, soil, and management practices. While explainable, these models are challenging to calibrate and generalize across diverse varieties and environments due to the large number of parameters and the variability in environmental conditions. Data-driven models, conversely, use machine learning techniques to directly approximate the relationship between inputs (genotype and environment) and output (crop yield), often achieving high predictive accuracy. However, these models lack the interpretability of process-based approaches and are sensitive to data quality and quantity. This research proposes a novel data-driven crop model that integrates the strengths of both approaches, aiming to provide both accurate predictions and scientific insights.
Literature Review
The literature on crop yield prediction is extensive, with approaches ranging from simple linear regression to sophisticated deep learning techniques. Process-based models such as APSIM and DSSAT are widely used, offering mechanistic explanations for yield variations. However, their parameter calibration is resource-intensive and often lacks generalizability. Data-driven models, including multiple linear regression, partial least squares regression, random forests, and deep neural networks, have demonstrated impressive predictive power. However, these models frequently lack transparency and are sensitive to data limitations. Hybrid approaches, integrating process-based and data-driven modeling, have recently emerged as a promising direction. Studies have combined Bayesian averaging with process-based ensembles, integrated APSIM with statistical regression models, and coupled APSIM with machine learning for improved yield prediction accuracy. This study builds upon these advances, aiming to overcome the limitations of both purely process-based and purely data-driven approaches.
Methodology
The proposed data-driven crop model for maize integrates a process-based framework with data-driven parameter estimation. The model comprises modules representing key aspects of maize growth: soil water dynamics, water uptake, radiation interception, phenology, daily biomass and metabolism, and stress factors (heat, drought, flooding). The model structure is designed to incorporate available data, with flexibility in the number of modules based on data availability. Genotypic parameters for each variety are defined independently from environmental effects, improving transferability to different environments. Instead of experimental calibration, a machine learning approach calibrates these parameters using historical data. The model minimizes the weighted root-mean-square-error (RMSE) between predicted and observed yields, subject to constraints that limit the temporal and spatial variability of genotypic parameters. The model employs a heuristic algorithm for parameter estimation and uses county-level data from the US Corn Belt (1981-2020) including yield, geographic location, genotype, and environmental data (weather, soil, management). The data includes corn yield, geographic data, daily weather data from Daymet, and management data (planting and harvest dates, plant population density) from USDA. Soil data was obtained from the gSSURGO database. Missing data were imputed using available means from similar locations or years. The model's performance was evaluated through training accuracy, spatial extrapolation (testing on unseen counties), temporal extrapolation (testing on unseen years), and analysis of genotype-by-environment interactions. Optimal seed selection scenarios were also explored, considering both known and unknown future weather conditions.
Key Findings
The data-driven crop model demonstrated strong performance. The training RMSE was 0.74 Mg/ha (7.16% RRMSE relative to the 2020 average yield), comparable to state-of-the-art models. Spatial extrapolation yielded an average RMSE of 1.17 Mg/ha (11.32% RRMSE), outperforming a nearest-county benchmark (1.44 Mg/ha, 13.93% RRMSE). Temporal extrapolation showed an average RMSE of 1.15 Mg/ha (11.12% RRMSE), again surpassing a nearest-year benchmark (1.55 Mg/ha, 14.99% RRMSE). Analysis of genotype-by-environment interactions revealed the impact of both genetic improvements and extreme weather events (e.g., the 2012 drought) on yields. Optimal seed selection simulations suggested potential yield improvements of 3.91% (unknown weather) and 17.59% (known weather) between 2011 and 2020, demonstrating the model’s prescriptive value.
Discussion
The data-driven crop model successfully integrates the strengths of process-based and data-driven approaches. Its ability to provide both accurate predictions and scientifically explainable results is a significant advantage. The model’s performance in spatial and temporal extrapolation demonstrates its robustness and generalizability. The analysis of genotype-by-environment interactions offers valuable insights for breeders and farmers. The findings on optimal seed selection highlight the potential for yield improvement through informed decision-making. The model's competitive performance compared to existing state-of-the-art models underscores its potential for widespread application in agricultural management and breeding programs.
Conclusion
This study presents a novel data-driven crop model for maize yield prediction that combines the explanatory power of process-based models with the predictive accuracy of data-driven methods. The model’s ability to accurately predict yields, analyze genotype-by-environment interactions, and guide optimal seed selection makes it a valuable tool for enhancing agricultural productivity. Future research should focus on validating the model with experimental data, extending it to other crop species, and incorporating higher-resolution data to further refine its accuracy and insights. The model’s genotypic parameters also hold potential for informing plant breeding strategies.
Limitations
The model's accuracy can be affected by extreme weather events and data limitations. Missing data (e.g., irrigation, fertilization) were imputed, potentially affecting the model's precision. The model's transferability to other crop species may be limited due to species-specific physiological differences. The assumptions of uniform county-level factors (e.g., waterlogging) might lead to inaccuracies. The optimal seed selection results rely on the assumption of availability of all historical seeds in all subsequent years, which is a simplification.
Related Publications
Explore these studies to deepen your understanding of the subject.