
Earth Sciences
Training machine learning models on climate model output yields skillful interpretable seasonal precipitation forecasts
P. B. Gibson, W. E. Chapman, et al.
Discover how machine learning models trained on extensive climate simulations can enhance seasonal forecasting accuracy for precipitation patterns in the western United States. This innovative research, led by Peter B Gibson and colleagues, shows that these models not only compete with traditional methods but also provide insights into the underlying physical processes.
~3 min • Beginner • English
Introduction
The study addresses the challenge of skillful seasonal precipitation forecasting over the western United States, where precipitation variability is dominated by a limited number of atmospheric river events and significant teleconnections. Decision-making during events such as the 2012–2016 California drought highlights the societal and economic stakes. While ENSO is the primary driver of North American seasonal predictability, its signal-to-noise ratio and the roles of other tropical and stratospheric processes complicate forecasts. Traditional empirical approaches are constrained by limited observational records (~40–100 years), making it difficult to learn nonlinear, multivariate relationships without overfitting. The research question is whether training machine learning models on long climate model simulations can overcome sample-size limitations, yield competitive or superior seasonal precipitation predictions relative to dynamical models (NMME), and provide interpretable links to known physical teleconnections. The authors focus on predicting large-scale precipitation patterns (clusters) for NDJ and JFM seasons by leveraging a pool of atmospheric and oceanic predictors and interpretable ML techniques.
Literature Review
Seasonal forecasting has used dynamical, empirical, and hybrid approaches. Dynamical model ensembles such as NMME provide probabilistic forecasts but often show limited skill for western US precipitation beyond two weeks, with modest improvements over successive model upgrades for temperature but minimal for precipitation. Classical statistical methods (e.g., CCA) capture linear relationships between lagged SSTs and precipitation but struggle with multiple predictors, interactions, temporal structure, and nonlinearity. Recent studies have explored training statistical/ML models on climate model simulations to extend training samples and improve skill, including regularized regression and CNN-based approaches trained on historical simulations. Teleconnection literature notes ENSO as a dominant driver but also emphasizes roles for tropical diabatic heating in the western Pacific, Indian Ocean influences, subseasonal windows of opportunity, and impacts from tropical and polar stratospheric variability. Prior work also shows that aggregating over larger spatial domains increases forecast skill by easing spatial precision demands.
Methodology
Framework: Train machine learning models on long climate model simulations, then test on observations (1980–2020) using predictors from observed/reanalysis data for NDJ and JFM forecasts.
Training data: CESM Large Ensemble (CESM-LENS) single-model large ensemble with 40 members spanning 1920–2000 under CMIP5 historical forcing, using CESM1 with CAM5 and coupled ocean, land, and sea-ice components. Each member provides a physically plausible realization of internal climate variability. CESM is documented to simulate ENSO variability and teleconnections reasonably well, though with some amplitude biases and PDO-related differences.
Predictor variables and dimension reduction: Monthly atmospheric and oceanic fields from key regions were summarized via EOFs to reduce dimensionality and capture dominant modes. SST predictor regions include the tropical Pacific, western tropical Pacific (WP), Indian Ocean (IO), and North Pacific (NP). Additional predictors include velocity potential at 200 hPa (VP200), geopotential height at 500 hPa (Z500), and precipitation-related fields. For each predictor, the first four EOFs were computed, generally explaining at least 50% of variance. Monthly lags were included based on each variable’s memory; for example, tropical Pacific SST EOF1 from July–December (lags 0–5) for JFM. Predictors were standardized consistently across CESM-LENS and observations.
Predicted variable (target): K-means clustering was applied to standardized seasonal (3-month) precipitation anomalies across the western US to define four large-scale precipitation patterns for NDJ and JFM, aligning with known leading modes: (1) Dry North, Wet South; (2) Widespread Wet; (3) Widespread Dry; (4) Wet North, Dry South. Cluster counts in CESM-LENS training set were N=1648, 1112, 2049, and 2965, respectively. Forecasts are formulated as multi-class classification of the seasonal cluster.
Models:
- Random Forest (RF): Ensemble of decision trees (5000 trees; 10 variables randomly sampled per split). Hyperparameters tuned on CESM-LENS training data; predictors lagged per memory characteristics.
- XGBoost: Gradient-boosted decision trees with default parameters except selected tuning (e.g., gamma) via validation splits; trained separately for NDJ and JFM using the same variable set and lags as RF with 80/20 validation.
- Neural Network (feed-forward): Tuned via hyperparameter search on CESM-LENS (20% validation). Final settings include Adam optimizer, ReLU activations, learning rate 0.001, batch size 120, dropout regularization, categorical cross-entropy loss, up to 250 epochs with early stopping. Predictors included up to 12-month lags; class imbalance addressed by class-weighted loss.
- LSTM: Single LSTM layer followed by a 20-neuron dense layer; tuned sequence length, number of neurons, epochs, and batch size on validation; Adam with categorical cross-entropy and dropout regularization. Same predictor set used.
Observational and reanalysis data for testing (1980–2020): SST from ERSSTv5, reanalysis fields from ERA5, and precipitation from CPC-Unified. Trained ML models were driven by observed/reanalysis predictors for out-of-sample forecasts.
Evaluation: Accuracy defined as proportion of correct cluster predictions. Baselines included (1) frequency of the most common cluster and (2) random model predictions (distribution from 1000 repetitions with 5th/95th percentile). Accuracy was computed over overlapping test years; additional analyses aggregated clusters into two groups (wet southwest: clusters 1–2; dry southwest: clusters 3–4) to assess skill with relaxed spatial precision. Comparisons were made against NMME Phase 2 dynamical model forecasts projected onto the same clustered target.
Interpretability: Global interpretability via RF variable importance (permutation importance, mean minimum tree depth, frequency at tree root), pairwise interaction importance using mean conditional depth across 5000 trees, and response characterization via partial dependence plots and ALE plots. Local interpretability via LIME to explain individual forecasts by fitting simple local surrogate models around RF decisions.
Comparisons to NMME: Hindcasts initiated in appropriate months (e.g., October for NDJ; December for JFM), analyzed as ensemble means per model and evaluated on the same cluster framework.
Key Findings
- Overall accuracy and comparison to NMME: ML models trained on CESM-LENS achieved competitive or superior accuracy to NMME for forecasting western US seasonal precipitation clusters. Accuracy was generally higher for JFM than NDJ.
- Cluster grouping improves apparent skill: Grouping clusters into broader wet-southwest (1–2) and dry-southwest (3–4) categories increased accuracy. In JFM, several individual models and their ensembles achieved 70–80% accuracy, outperforming baselines. For NDJ, accuracies of 60–70% were observed but were not skillful relative to the random-guess baseline in the grouped setting.
- Cluster-dependent predictability: Clusters 4 (wet north/dry south) and 3 (widespread dry) were most predictable; cluster 2 (widespread wet) was hardest to predict. The RF model rarely predicted cluster 2 (0% of predictions in NDJ; 2.6% in JFM) despite its occurrence in the test set (15% NDJ; 17.5% JFM) and training set (~15.7% NDJ; 16.4% JFM). Attempts to rebalance training (skew adjustment, stratified sampling) did not consistently improve cluster 2 skill.
- Error characteristics: Misclassifications often reflected small spatial displacements of the north–south dipole; such errors may be less consequential if the anomaly sign over key subregions is retained (e.g., dry southwest).
- Physical interpretability and key predictors: RF importance metrics consistently identified tropical Pacific SST EOF1 (lags 0–5 from July–December) as top predictors, along with tropical velocity potential anomalies (VP200 EOF1, lags 0–5; VP200 EOF2 at certain lags) and western tropical Pacific SST EOF1 (lags 0–3). Lower-numbered EOFs and shorter lags were generally more important, aligning with physical intuition.
- Pairwise interactions: The shallowest/highest-importance interaction involved VP200 EOF1 (e.g., August) and tropical Pacific SST EOF1 (e.g., November), with partial dependence showing that La Niña-like conditions and enhanced Indian Ocean/Maritime Continent convection increase the probability of cluster 4. The most frequent interaction involved tropical Pacific SST in December and October, also favoring cluster 4 under La Niña-like states.
- Local explanations: LIME case studies showed that the 2005 JFM correct prediction of cluster 1 was driven primarily by negative SST_TP_EOF values (weak El Niño), while an incorrect 2016 JFM forecast (predicted cluster 1 vs observed cluster 4) was dominated by weak El Niño signals, with WP and IO SST patterns exerting smaller opposing influences; this error was common across ML and NMME models, consistent with dominance of less predictable atmospheric variability.
- Alignment with known teleconnections: Findings reinforce ENSO as the dominant source of predictability, with WP SST and tropical convection (velocity potential) modulating impacts and aiding discrimination between clusters 3 and 4.
Discussion
Training on large climate model ensembles effectively addresses the limited observational sample in seasonal forecasting, enabling machine learning models to learn robust, nonlinear, and multivariate teleconnections. By predicting clustered, large-scale precipitation patterns rather than grid-point anomalies, the approach aligns the predictand with the spatial scale of dominant sources of predictability (e.g., Rossby wave train positioning), improving practical skill and interpretability. The ML models, especially Random Forests, not only match or exceed NMME performance for JFM but also provide transparent insights into physical drivers via importance and interaction analyses. The reduced skill for NDJ and difficulties with the widespread-wet cluster reflect known forecast challenges and class imbalance in both training and nature. Interpretability tools (PDP, ALE, LIME) demonstrate that model decisions are consistent with established physical understanding—ENSO teleconnections modulated by tropical convection and western tropical Pacific SST—building confidence in ML-based seasonal forecasting. The approach suggests a pathway to explore teleconnection non-stationarity and compare influences across climate modes, and it offers operational value by enabling rapid, case-specific explanations of forecasts at negligible computational cost.
Conclusion
The study demonstrates that machine learning models trained on long climate-model simulations can produce skillful and interpretable seasonal precipitation forecasts for the western US, particularly in JFM, and can compete with or outperform NMME dynamical models. Predicting clustered precipitation patterns enhances skill by targeting more predictable large-scale structures. ENSO-related tropical Pacific SST anomalies are the primary predictors, with additional modulation by tropical convection and western tropical Pacific SST. Interpretability analyses confirm physically plausible teleconnections and provide actionable, local explanations for individual forecasts. Future work should: (1) leverage multiple large-ensemble datasets to quantify uncertainty and potentially improve skill; (2) train on simulations with better-resolved stratospheric variability (e.g., QBO); (3) explore deeper and bidirectional LSTM architectures; and (4) apply transfer learning by pretraining on model simulations and fine-tuning on observations to further enhance performance.
Limitations
- Cluster imbalance and rarity: The widespread-wet cluster (cluster 2) is less frequent and was rarely predicted by ML models, limiting skill for this regime despite attempts at rebalancing during training.
- Seasonal and spatial precision constraints: Skill gains depend on spatial aggregation; increased accuracy when grouping clusters comes at the cost of reduced spatial specificity. NDJ skill remained limited and in some cases not superior to random baselines under grouped categories.
- Model-training dependence on CESM-LENS: Training on a single model ensemble may propagate model-specific teleconnection biases and amplitude differences (e.g., ENSO/PDO), potentially affecting generalizability.
- Predictor/lag limitations and stratospheric processes: Some processes (e.g., QBO, sudden stratospheric warming) are not well resolved in the low-top CESM1 version used for training, possibly omitting a source of predictability.
- Interpretability caveats: Partial dependence plots can be confounded by correlated predictors; ALE plots mitigate but do not eliminate interpretability challenges. LIME provides only local, approximate explanations and may not fully capture complex decision boundaries.
- Event-specific unpredictability: Certain seasons (e.g., 2016 JFM) were dominated by less predictable atmospheric variability, leading to widespread forecast errors across all methods.
Related Publications
Explore these studies to deepen your understanding of the subject.