Environmental Studies and Forestry
Knowledge-guided machine learning can improve carbon cycle quantification in agroecosystems
L. Liu, W. Zhou, et al.
Agroecosystems cover around one-third of Earth’s land and play a central role in the terrestrial carbon cycle because soils are the largest terrestrial carbon reservoir. Agriculture is both a significant source of greenhouse gases and a sink via crop CO2 uptake, with potential to increase soil organic carbon (SOC) in carbon-unsaturated, intensively cultivated soils. Accurate, scalable, and cost-effective quantification of carbon fluxes and SOC change at field-level scales is needed to support climate mitigation programs, ensure fair compensation in carbon markets, and guide sustainable land management. Traditional approaches—soil sampling, emission factors, and process-based (PB) models—face challenges from spatial heterogeneity, seasonality, parameter uncertainty, incomplete process representation, and high computational costs at high resolution. Pure machine learning (ML) models can be efficient and accurate but often fail out-of-sample due to limited training data and lack interpretability. The study addresses these limitations by proposing a knowledge-guided machine learning (KGML) framework that integrates process understanding with ML and multi-source observations to reliably quantify carbon budgets, crop yields, and SOC changes at high spatial (250 m) and temporal (daily) resolution in the U.S. Midwest.
Knowledge-guided machine learning (KGML) has shown success in Earth system domains where governing equations are well established (e.g., hydrology and atmospheric sciences). However, biogeochemical cycles in ecosystems are highly nonlinear and complex, and many crucial belowground processes are not directly observable by remote sensing. Existing PB models can be uncertain without local calibration and are computationally expensive at fine spatiotemporal scales, while black-box ML suffers from limited data and poor interpretability. Prior work demonstrates the value of integrating scientific knowledge into ML via physics-informed constraints, hybrid modeling, and multi-source data assimilation. This paper builds on these insights, highlighting the need for new KGML approaches that incorporate biogeochemical knowledge, assimilate indirect observations (e.g., remote sensing, surveys), and capture less-observable terrestrial processes to improve carbon cycle quantification in agroecosystems.
The authors develop KGML-ag-Carbon, a hierarchical, knowledge-guided deep learning framework grounded in the ecosys process-based model and augmented by multi-source data. Architecture: A GRU-based hierarchical model with submodules for daily autotrophic respiration (Ra), heterotrophic respiration (Rh), net ecosystem exchange (NEE), and an attention-based annual crop yield estimator. A shared GRU_Basis provides latent states to submodules. Crop residue is computed via a mass-balance relation (annual residue = ΣGPP − ΣRa − yield) and fed back to the Rh submodule to represent plant–soil coupling. NEE is predicted using predicted Ra, Rh, and inputs, enforcing flux relationships. Inputs: Daily climate (NLDAS-2: radiation, temperature metrics, humidity metrics, wind, precipitation), daily GPP (SLOPE SANIRV-based), year, crop type (corn/soybean), and 0–30 cm soil properties (bulk density, sand, silt, field capacity, wilting point, saturated hydraulic conductivity, SOC, pH, CEC). Variables are Z-normalized using statistics from synthetic ecosys data. Data sources and assimilation: - Synthetic pretraining data: Over 10k ecosys simulations across 293 counties (IL, IA, IN) from 2001–2018, covering diverse soils, crop calendars, and rotations; generating Ra, Rh, NEE, GPP, and yield. - Observational data for fine-tuning: County-level corn/soybean yields (USDA NASS) for 637 counties; eddy covariance (EC) flux tower data from 11 cropland sites (Reco and NEE, with GPP derived and corrected to preserve mass balance); remotely-sensed daily GPP at 250 m (SLOPE). Outputs and derived variables: Daily Ra, Rh, Reco (Ra+Rh), NEE; annual yield; annual ΔSOC (ASOC) via mass balance: ASOC = NEE − yield. Knowledge-guided constraints (KG losses): Mass balance (Ra+Rh−GPP = NEE); biophysical thresholds (e.g., 0 < yield < fraction of annual GPP); monotonic/response constraints (e.g., Rh increases with SOC under fixed conditions); constraints to retain learned responses during fine-tuning (e.g., changes in Ra responses within 10%). Training strategy (5 steps): (1) Pre-train yield and Ra with synthetic data using self-paced MSE; (2) Pre-train Ra, Rh, NEE with MSE + mass balance + response controls; (3) Fine-tune yield with county-scale yield data (and synthetic) with threshold/response constraints, partially freezing GRU_Basis; (4) Re-train Ra, Rh, NEE on synthetic data to preserve prior knowledge; (5) Fine-tune Ra, Rh, NEE with EC data (and synthetic) using mass balance and response constraints; GRU_Basis and attention frozen; adjusted learning rates to retain pretraining knowledge; winter separation assumption (Reco ≈ Rh) to decouple Ra and Rh. Regional application: After fine-tuning, the model predicts daily fluxes and annual yield at 250 m across the U.S. Midwest (2000–2020). Products are compared to EC observations and the Trendy-v9 ensemble (regridded to 0.5°). High-resolution ΔSOC maps at 0.0025° are produced using regridded 250 m NEE and yield; coarse (0.5°) ΔSOC maps are generated for comparison using aggregated inputs and rotation-weighted simulations.
- Performance vs process-based and pure ML models: • Pretrained model reproduces ecosys synthetic test data with R² = 0.99 (yield, Ra), 0.97 (Rh, NEE). • Against observations, ecosys benchmark R²: yields 0.49 (corn) and 0.42 (soybean); daily Reco 0.67–0.89; daily NEE 0.59–0.88. • KGML-ag-Carbon after fine-tuning achieves R²: yields 0.91 (corn) and 0.88 (soybean) on 210-county test; daily Reco 0.94 and daily NEE 0.96 at 11 EC sites (out-of-sample years). - Robustness and sample efficiency: KGML-ag-Carbon consistently outperforms a GRU-based pure ML model using the same inputs and shows much lower sensitivity to training sample size; even with zero fine-tuning samples for fluxes, assimilating GPP allows KGML-ag-Carbon to outperform ecosys and pure ML trained with small samples. KGML performs particularly better in small-sample regimes and in extreme yield years. - Component contributions: Using GPP inputs and pretraining on ecosys synthetic data provide the largest gains; hierarchical structure and KG loss further reduce mass balance residuals to near zero and improve temporal dynamics (especially winter/summer NEE), enhancing interpretability. - Regional flux estimates vs Trendy-v9: Distributions of Reco and NEE from KGML-ag-Carbon align more closely with EC tower observations than Trendy models, which tend to overestimate Reco and NEE and show large inter-model spread. - High-resolution ΔSOC mapping: 250 m products (aggregated to 0.0025°) reveal that 86% of ΔSOC fractions lie within −0.5% to 0.5% C/year, consistent with experimental ranges, and capture spatial patterns (e.g., SOC declines in southern MN, northern IA, NE IL; increases in southern Midwest). Soil factors explain ~43% of spatial variance and climate ~11%. Coarse (0.5°) estimates show substantial discrepancies vs fine-scale (NRMSE ≈ 86%), indicating 86% more spatial detail captured at high resolution and highlighting mixed-pixel effects at coarse scales. - Computational efficiency: KGML-ag-Carbon is >1,000,000× faster than ecosys for a 21-year, field-scale Midwest simulation (1.6 days on one GPU vs 5.9 years on 1000 CPUs for ecosys).
The study demonstrates that integrating process-based knowledge, synthetic pretraining, and remote sensing constraints within a hierarchical KGML framework markedly improves carbon cycle quantification in agroecosystems. KGML-ag-Carbon addresses the core challenge—accurate, interpretable, and scalable field-level estimates—by combining the strengths of PB models (mechanistic knowledge) and ML (efficiency and pattern learning) while mitigating their individual weaknesses (parameter/structure uncertainty of PB; data hunger and opaqueness of ML). The framework delivers high predictive accuracy for yield and carbon fluxes, stable performance under limited training data, and strong out-of-sample generalization, including extreme years. Knowledge-guided structure and losses enforce physical plausibility (mass balance) and process-consistent responses, improving interpretability and temporal dynamics (e.g., winter partitioning of respiration). High-resolution mapping uncovers fine-scale ΔSOC patterns and reduces aggregation errors, essential for fair carbon accounting, targeted management, and policy implementation. The alignment of regional flux distributions with EC observations, and improvements over state-of-the-art PB ensembles (Trendy), indicate the framework’s potential as a reliable operational tool for regional-to-field carbon budgeting. Remaining uncertainties in intermediate variables (Ra, Rh, residues) underscore the importance of improved measurements to further constrain mechanisms. The approach illustrates how multi-source data fusion (synthetic PB outputs, in situ EC data, and remote sensing GPP) can effectively leverage sparse observations to support accurate, scalable carbon accounting.
This work introduces KGML-ag-Carbon, a knowledge-guided, hierarchical deep learning framework that integrates process-based ecosys knowledge, synthetic pretraining, and remote sensing constraints to accurately and efficiently quantify field-level carbon budgets and yields across the U.S. Midwest. The model outperforms both a pure ML baseline and a process-based benchmark, adheres to physical constraints, captures complex temporal dynamics, and scales to high-resolution regional predictions. High-resolution ΔSOC estimates reveal substantially more spatial detail than coarse-resolution approaches, with strong implications for precise, fair, and cost-effective carbon accounting and land management. Future directions include: incorporating nitrogen (and phosphorus) cycle processes to capture C–N interactions and N2O emissions; explicitly modeling management practices (fertilization, irrigation, tillage, cover crops) potentially via additional remote sensing and inverse/self-supervised learning; enriching intermediate state simulations (e.g., soil moisture, canopy temperature) to assess extreme weather impacts; extending to larger regions and other ecosystems; and exploring reinforcement learning for management optimization once reliable GHG and productivity predictions under varied practices are available.
- Data and scale mismatches: Field-/plot-level SOC measurements (~10 m) vs model inputs at 250 m; county-level yields used to fine-tune a fine-scale model may induce scaling challenges; GPP at EC sites required correction to preserve mass balance. - Limited management information: Lack of location-specific management (fertilization, irrigation, tillage) leads to relying on GPP assimilation to capture management-induced variability. - Intermediate variable uncertainties: Ra, Rh, and residue lack direct observational constraints, leading to higher uncertainties despite accurate NEE/Reco/yield. - SOC measurement uncertainty: Lab, spatial sampling, and resampling errors (up to ~12%, 50%, and 45%, respectively) constrain ASOC validation; measurements often focus on humus while the model’s ASOC includes crop residue plus humus. - Model assumptions and choices: Winter partitioning assumption (Reco ≈ Rh); reliance on ecosys for synthetic pretraining introduces dependence on PB model selection; coarse-to-fine aggregation choices affect 0.5° comparisons. - Extreme events and generalization: While KGML-ag-Carbon performs well in tested extreme years, broader impacts of heat waves/flash droughts on processes remain to be fully elucidated; additional intermediate states may improve attribution.
Related Publications
Explore these studies to deepen your understanding of the subject.

