logo
ResearchBunny Logo
Neural general circulation models for weather and climate

Earth Sciences

Neural general circulation models for weather and climate

D. Kochkov, J. Yuval, et al.

Discover how NeuralGCM, developed by a team from Google Research and MIT, merges machine learning with traditional atmospheric modeling to enhance weather and climate forecasting. This innovative approach not only matches leading methods in accuracy but also offers remarkable computational efficiency, promising significant advancements in our understanding of the Earth's climate system.... show more
Introduction

The study addresses limitations of traditional physics-based general circulation models (GCMs) and recent machine-learning (ML) approaches in weather and climate prediction. While GCMs have improved over decades, they rely on tuned parameterizations for unresolved processes and retain biases, complicating uncertainty estimation for long-term projections and extremes. Pure ML weather models trained on ERA5 have achieved state-of-the-art deterministic forecasts at far lower computational cost, but they struggle with calibrated uncertainty, produce blurry multi-day forecasts under MSE training, misrepresent derived diagnostic variables, and have not shown stability for long-term simulations. Prior hybrid efforts typically trained ML parameterizations offline, leading to instability and climate drift when coupled with dynamics. The research hypothesis is that a fully differentiable, end-to-end trained hybrid GCM—NeuralGCM—can couple a dynamical core with learned physics and achieve accurate, physically consistent deterministic and ensemble weather forecasts, while supporting stable climate simulations over seasonal to decadal scales.

Literature Review

The paper situates NeuralGCM within prior work on numerical weather prediction and ML-based forecasting. Recent ML models (e.g., GraphCast, Pangu) trained on ERA5 deliver strong deterministic skill for 1–10 day forecasts with reduced code complexity and cost but lack calibrated ensembles and degrade at longer leads. Studies have highlighted issues with diagnostic consistency (e.g., geostrophic balance) and increasing blurriness with lead time. Hybrid models replacing/correcting parameterizations showed promise in idealized settings (aquaplanets) and at coarse resolutions but often suffered from instability and climate drift due to offline training. Benchmarks such as WeatherBench2 provide standardized evaluation metrics (RMSE, RMSB, CRPS). Comparisons to operational ECMWF systems (HRES, ENS), global cloud-resolving models (e.g., X-SHIELD), and CMIP6 AMIP experiments frame the assessment across deterministic, ensemble, and climate contexts.

Methodology

NeuralGCM is a fully differentiable hybrid atmospheric model combining: (1) a dynamical core solving discretized large-scale fluid motion and thermodynamics under gravity and Coriolis forces using global spectral methods and an implicit–explicit ODE solver to advance tendencies; and (2) a learned physics module, a neural network that operates locally on vertical atmospheric columns to parameterize unresolved processes (e.g., cloud formation, radiation, precipitation, subgrid turbulence). Inputs (forcings, optional stochastic noise, and state variables) are encoded into a model state, passed through the dynamical core and learned physics to produce tendencies, and advanced in time; outputs are decoded to forecast fields. The end-to-end differentiability enables online training via stochastic gradient descent through hundreds of time steps (backpropagation through time), explicitly coupling learned physics with resolved dynamics. Training protocol: models are trained on ERA5 weather rollouts with gradually increased rollout length from 6 hours to up to 5 days, which is critical for stability and multi-day accuracy. Both deterministic and stochastic variants are trained with distinct protocols; the stochastic version supports ensemble generation via injected noise/perturbations and produces probabilistic forecasts. Model resolutions: 2.8°, 1.4°, and 0.7° horizontal grids. Evaluation: For medium-range weather (year 2020, held out), all forecasts are conservatively regridded to 1.5°, with 732 forecasts at 00/12 UTC. Metrics include RMSE, RMSB, CRPS, and spread–skill ratio; comparisons are made against ECMWF-HRES and ECMWF-ENS (referenced to ECMWF operational analysis) and ML baselines GraphCast and Pangu (referenced to ERA5). Physical consistency assessments include geostrophic wind balance, power spectra, and column water budget diagnostics. For climate, deterministic 2.8° and 1.4° models are run with prescribed historical SST and sea ice (AMIP-like setups). Stability is quantified by the fraction of initial conditions completing long simulations without instability. Seasonal cycle fidelity, precipitable water and temperature biases, and emergent phenomena (e.g., tropical cyclone counts and tracks using TempestExtremes) are evaluated. Decadal skill is assessed using 40-year AMIP-like simulations (22 stable runs out of 37 initializations) and compared to 22 CMIP6 AMIP experiments, including vertical structure of tropical temperature trends. Computational performance is benchmarked versus ECMWF systems and a global cloud-resolving model (X-SHIELD).

Key Findings
  • Deterministic weather forecasts: NeuralGCM-0.7° achieves state-of-the-art short-range RMSE (1–3 days depending on variable), competitive with best ML models (GraphCast) for 1–10 days. It exhibits notably reduced persistent bias (RMSB), including much less bias for specific humidity in the tropics.
  • Ensemble weather forecasts: The stochastic NeuralGCM-ENS (1.4°) outperforms ECMWF-ENS across most variables, lead times (up to ~15 days), and levels on ensemble-mean RMSE, RMSB, and CRPS. Both NeuralGCM-ENS and ECMWF-ENS achieve spread–skill ratios near 1, indicating calibrated ensembles. Individual ensemble members remain sharp even at long leads.
  • Physical consistency: NeuralGCM power spectra are closer to ERA5 than other ML models and remain roughly constant over lead time; higher resolution improves spectra. NeuralGCM better preserves geostrophic and ageostrophic wind vertical structure than GraphCast across rollouts, though ECMWF-HRES remains slightly closer to ERA5. The column-based learned physics enables interpretable water budget diagnostics; precipitation-minus-evaporation spatial patterns match ERA5 on short leads, with underestimation of tropical extremes.
  • Generalization: When trained to 2017 and tested on 2018–2022, NeuralGCM shows no clear trend of worsening error with recency, unlike GraphCast. A model trained only on data before 2000 maintains skill over 21+ unseen years.
  • Seasonal to annual climate: NeuralGCM-1.4° accurately reproduces seasonal cycle metrics (global precipitable water, total kinetic energy), Hadley circulation, zonal-mean winds, eddy kinetic energy patterns, and monsoon seasonality. For 2020, ensemble-mean global 850-hPa temperature RMSE is 0.16 K vs climatology 0.45 K.
  • Hydrological cycle and biases: Annual precipitable water bias RMSE: NeuralGCM-1.4° 1.09 mm vs X-SHIELD 1.74 mm and climatology 1.36 mm (Jan 2020–Jan 2021). NeuralGCM shows lower temperature bias in upper and lower troposphere than X-SHIELD; precipitation-minus-evaporation bias slightly larger with some grid-scale artefacts.
  • Tropical cyclones: At 1.4° resolution, NeuralGCM produces realistic cyclone frequencies and tracks (83 vs 86 in ERA5 over the evaluation year), while X-SHIELD regridded to 1.4° underestimates counts (40).
  • Decadal AMIP simulations: In 22 stable 40-year NeuralGCM-2.8° runs, global warming trends and interannual variability correlate well with ERA5 and SST forcing. Spatial 850-hPa temperature biases (1981–2014) are smaller in all NeuralGCM runs than in CMIP6 AMIP runs, even after debiasing the latter. Vertical tropical warming trends are closer to ERA5 with reduced upper-troposphere bias, though NeuralGCM exhibits a wider spread.
  • Computational efficiency: NeuralGCM achieves orders-of-magnitude savings. Example: NeuralGCM-1.4° simulates 70,000 simulation days in 24 hours on a single TPU versus 19 simulated days on 13,824 CPU cores for X-SHIELD, enabling large ensembles at low cost.
  • Limitation noted: The model does not extrapolate reliably to substantially different future climates; responses to +4 K SST diverge from expectations, with signs of climate drift.
Discussion

The findings demonstrate that end-to-end differentiable hybrid modeling can reconcile strengths of physics-based GCMs with data-driven learning. By training learned physics online within a dynamical core, NeuralGCM attains physically consistent, sharp forecasts and calibrated ensembles that rival or surpass operational systems on CRPS, while maintaining interpretability through column-process parameterizations and diagnostics. Its ability to sustain multi-year AMIP-like simulations that reproduce seasonal cycles, emergent phenomena (tropical cyclones), and historical warming trends supports the hypothesis that optimizing for short-term weather forecasts can yield parameterizations suitable for climate timescales. The large computational efficiency unlocks extensive ensembles and experimentation. Remaining gaps include full parity with high-resolution physical models on spectral fidelity and extremes, generalization to substantially warmer climates, and stability across all initializations. Addressing these will likely require integrating more explicit physics (e.g., radiation), coupling to ocean/land/chemistry components, and alternative training strategies to capture processes with subtle weather-scale signals (e.g., cloud feedbacks).

Conclusion

NeuralGCM introduces a differentiable hybrid GCM that unifies a spectral dynamical core with neural parameterizations trained end-to-end. It delivers competitive deterministic weather forecasts, state-of-the-art ensemble probabilistic skill (CRPS), realistic seasonal-to-decadal climate behavior with emergent tropical cyclones, and markedly reduced computational cost. The results substantiate that learning from short-term weather can effectively tune parameterizations for climate simulation. Future work should extend NeuralGCM with additional physics and coupling to other Earth-system components, improve handling of extremes and spectral fidelity via higher resolutions and architectures, and develop training strategies to generalize to unprecedented climates while mitigating instability and drift.

Limitations
  • Extrapolation to different climates is limited: AMIP simulations with large SST increases (+4 K) show divergent responses and climate drift.
  • Numerical stability: Not all long integrations remain stable (e.g., 35/37 2-year runs and 22/37 40-year runs complete), indicating residual instability risks.
  • Resolution and sharpness: Forecasts remain blurrier than high-resolution physics-based models; extreme events in the tropics are underestimated in precipitation-minus-evaporation diagnostics.
  • Hydrology diagnostics: Current learned physics predicts column tendencies and cannot separately diagnose precipitation versus evaporation.
  • Dependence on prescribed boundary conditions: Climate simulations rely on prescribed SST/sea ice and do not include coupling to ocean/land/chemistry.
  • Biases remain in some variables and regions despite improvements; ECMWF-HRES still aligns slightly better with ERA5 for geostrophic balance.
  • Generalization limitations highlighted by warmer-climate tests and signs of climate drift under strong forcing perturbations.
Listen, Learn & Level Up
Over 10,000 hours of research content in 25+ fields, available in 12+ languages.
No more digging through PDFs, just hit play and absorb the world's latest research in your language, on your time.
listen to research audio papers with researchbunny