Earth Sciences
FuXi: a cascade machine learning forecasting system for 15-day global weather forecast
L. Chen, X. Zhong, et al.
Accurate weather forecasts underpin critical societal activities but are limited by forecast uncertainty that grows with lead time. ECMWF’s Integrated Forecasting System (IFS) sets a high bar (HRES deterministic; EPS ensemble up to 15 days) but is computationally expensive. Recent ML models achieve fast and often more accurate medium-range forecasts by learning from reanalysis data. However, iterative ML forecasts accumulate errors, degrading long-lead performance. The study targets two goals: (1) reduce iterative accumulation errors to extend skillful forecasts beyond 10 days, and (2) achieve performance comparable to the ECMWF ensemble mean at 15 days. The authors propose FuXi, a cascaded ML system with models specialized for successive time windows (0–5, 5–10, 10–15 days) to optimize performance across lead times while maintaining 0.25° resolution and 6-hourly cadence.
Prior benchmarks (WeatherBench) standardized ML evaluation at coarse resolutions. FourCastNet delivered 0.25° 7-day forecasts via AFNO + ViT but underperformed HRES. SwinRDM surpassed HRES for 5-day 0.25° forecasts using SwinRNN+ and diffusion-based super-resolution. Pangu-Weather outperformed HRES at 7 days using hierarchical temporal aggregation with multiple lead-time-specific models. GraphCast, an autoregressive GNN, outperformed HRES across most variable/lead-time combinations up to 10 days after curriculum training, though short-lead performance could degrade as autoregressive steps increase. Efforts to mitigate accumulation error include multi-step loss functions, direct lead-time prediction (requiring multiple models), hierarchical temporal aggregation, curriculum schedules, and replay-buffer training, yet a single model generally cannot excel across both short and long ranges. The study positions FuXi as a cascaded approach to bridge this gap and extend skill further to 15 days, approaching ECMWF EM performance.
Data: ERA5 reanalysis at 0.25° and 6-hourly resolution over 39 years is used as ground truth. Targets include 5 upper-air variables (Z, T, U, V, R) at 13 pressure levels (50–1000 hPa) and 5 surface variables (T2M, U10, V10, MSL, TP), totaling 70 variables. Data are split into training (1979–2015; 54,020 samples), validation (2016–2017; 2,920), and testing (2018; 1,460). ECMWF references HRES-fc0 and ENS-fc0 provide verification baselines for HRES and EM. Model and forecasting: FuXi is an autoregressive model that takes two prior time steps (t−1, t) to predict t+1 at 6-hour intervals. A single-model 15-day forecast would require 60 iterations, leading to error accumulation; thus, a cascaded design is used: three pre-trained and fine-tuned models specialized for 0–5 (FuXi-Short), 5–10 (FuXi-Medium), and 10–15 days (FuXi-Long). Outputs at 5 days (step 20) from FuXi-Short feed FuXi-Medium; outputs at 10 days (step 40) feed FuXi-Long. To enable efficient fine-tuning, outputs of preceding stages for 2012–2017 are cached. Architecture: Inputs (2 × 70 × 721 × 1440) are embedded via space-time cube embedding (3D conv kernel/stride 2×4×4, output channels C=1536) with LayerNorm to C × 180 × 360. Core backbone is a U-Transformer: 48 Swin Transformer V2 blocks using scaled cosine attention with relative position bias and learnable scale; a U-Net-like downsampling block (to C × 90 × 180) and upsampling block (back to C × 180 × 360) with residual conv blocks (3×3 convs, GroupNorm, SiLU) and skip connections. A fully connected layer maps features to 70 channels; outputs are reshaped to 70 × 720 × 1440 and bilinearly interpolated to 70 × 721 × 1440. Training: Implemented in PyTorch using FSDP, bfloat16, gradient checkpointing. Pre-training: supervised single-step prediction with latitude-weighted L1 loss, AdamW (β1=0.9, β2=0.95), initial LR 2.5e−4, weight decay 0.1, Scheduled DropPath 0.2; 40,000 iterations on 8×A100 (batch size 1/GPU), ~30 hours. Fine-tuning: autoregressive curriculum increases steps from 2 to 12; FuXi-Short optimized for 0–5 days; FuXi-Medium and FuXi-Long initialized from prior stage weights and fine-tuned for 5–10 and 10–15 days respectively; constant LR 1e−7; ~2 days per stage on 8×A100. Ensemble: A 50-member ensemble is generated by perturbing initial conditions with random Perlin noise (49 perturbations with 4 octaves, scaling 0.5; noise periods: channel=1, lat=6, lon=6) and model parameters via Monte Carlo dropout at rate 0.2 during inference. Evaluation: Deterministic skill uses latitude-weighted RMSE and anomaly correlation coefficient (ACC) against ERA5; normalized differences use ECMWF EM as reference in some comparisons. ECMWF HRES/EM are verified against HRES-fc0/ENS-fc0 per ECMWF practice. Probabilistic evaluation uses continuous ranked probability score (CRPS, Gaussian assumption via ensemble mean and variance), spread, and spread–skill ratio (SSR=spread/RMSE of EM; SSR≈1 indicates reliable spread). All results reported for 2018 test set with 00 and 12 UTC initializations producing 6-hourly forecasts to 15 days.
- Against ECMWF HRES and GraphCast: FuXi and GraphCast both outperform HRES across variables; FuXi matches GraphCast up to ~7 days and surpasses it thereafter, with higher ACC and lower RMSE across surface (MSL, T2M, U10, V10) and upper-air (Z500, T500, U500, V500) fields. Using ACC≥0.6 as a skill threshold, FuXi extends skillful lead time versus HRES: Z500 from 9.25 to 10.5 days; T2M from 10 to 14.5 days.
- Against ECMWF EM: FuXi shows superior deterministic performance for 0–9 days (positive normalized ACC difference; negative normalized RMSE difference). Beyond 9 days it is slightly inferior to EM but overall comparable over 15 days: FuXi attains higher ACC in 67.92% and lower RMSE in 53.75% of 240 variable–level–lead combinations. FuXi ensemble mean (FuXi EM) is slightly worse than deterministic FuXi at short leads but better after ~3 days, consistent with prior ML models.
- Spatial error structure: RMSE largest at high latitudes and over land for Z500 and T2M. FuXi outperforms HRES in most grid points (predominantly positive RMSE differences for HRES−FuXi), while EM and FuXi are largely comparable (differences near zero) across regions.
- Probabilistic performance: FuXi ensemble CRPS is comparable to ECMWF ensemble and slightly smaller for leads up to ~9 days; beyond 9 days, ECMWF ensemble has lower CRPS. FuXi ensemble SSR indicates early overdispersion (>1) for Z500, T850, MSL, transitioning to underdispersion (<1) after ~9 days; T2M is underdispersive throughout for both systems. FuXi ensemble spread increases initially then decreases after ~9 days, likely due to decay of flow-independent Perlin-noise perturbations.
The cascaded architecture addresses the central challenge of error accumulation in autoregressive ML weather forecasting by specializing models for distinct lead-time windows and passing intermediate states between them. This design yields strong long-lead deterministic skill, extending HRES’s skillful lead times for key variables and achieving overall performance comparable to ECMWF EM at 15 days. The ensemble implementation demonstrates potential for competitive probabilistic skill up to 9 days and highlights the need for better perturbations to maintain spread at longer leads. Findings support the hypothesis that a single model cannot be optimal across all lead times and that targeted fine-tuning within a cascade can balance short- and long-range performance. The spatial error analysis confirms FuXi’s broad superiority to HRES and parity with EM, while ensemble diagnostics indicate improvements are needed in maintaining reliable spread beyond the medium range.
FuXi introduces a cascaded ML architecture for 15-day global forecasts at 0.25° and 6-hourly resolution, trained on ERA5. By fine-tuning three models for successive 5-day windows and cascading them, FuXi reduces autoregressive error accumulation, extends skillful lead time compared to ECMWF HRES (Z500 to 10.5 days; T2M to 14.5 days), and delivers deterministic performance comparable to ECMWF EM at 15 days. A 50-member ML ensemble using initial-condition and parameter perturbations provides competitive probabilistic skill up to 9 days. Future work will (1) develop flow-dependent initial perturbations to sustain ensemble spread at longer leads, (2) extend the cascade framework to sub-seasonal (14–28 days) prediction, and (3) pursue data-driven data assimilation for end-to-end ML forecasting that directly ingests observations without reliance on NWP analyses.
- Ensemble spread reliability: The FuXi ensemble exhibits overdispersion at short leads for several variables and underdispersion after ~9 days, with spread decreasing at longer leads; flow-independent Perlin-noise perturbations likely decay with integration time, limiting long-range spread.
- Dependence on reanalysis: The system relies on ERA5 analyses as ground truth and for initial conditions; it is not yet an end-to-end system that assimilates observations directly.
- Variable availability for comparisons: Some ECMWF variables/levels were unavailable from the server, limiting comprehensive direct comparisons for all fields.
- Single-epoch environment and computational constraints: Autoregressive fine-tuning with many steps is memory-intensive, requiring caching and potentially limiting model/step scaling.
- Generalization beyond 15 days not demonstrated; sub-seasonal performance remains an open challenge.
Related Publications
Explore these studies to deepen your understanding of the subject.

