Earth Sciences

Evaluation of five global AI models for predicting weather in Eastern Asia and Western Pacific

C. Liu, K. Hsu, et al.

Discover the latest in machine learning-based weather prediction! This study, conducted by a team of dedicated researchers, evaluates five cutting-edge models applied to Eastern Asia and the Western Pacific. FengWu takes the lead in accuracy, particularly for typhoon tracking. Don't miss out on these exciting findings from Cheng-Chin Liu, Kathryn Hsu, and their colleagues!... show more

Introduction

The study addresses how recent machine learning-based weather prediction (MLWP) models perform for regional forecasting over Eastern Asia and the Western Pacific, with emphasis on synoptic-scale skill and tropical cyclone (TC) prediction. It contextualizes the rapid evolution of AI/ML, their prior applications in Earth sciences, and the 70-year development of numerical weather prediction (NWP). Several global MLWP systems (FourCastNet/FCN2, Pangu-Weather, GraphCast, FuXi, FengWu) have shown performance comparable to or surpassing ECMWF’s IFS in some cases. However, a homogeneous, comprehensive comparison using identical initial conditions is lacking, particularly regarding extreme weather such as typhoons. Given that global models provide initial/boundary conditions for high-resolution regional models critical to East Asia and Western Pacific forecasting (e.g., Taiwan’s CWA operations), the study aims to systematically compare these MLWP models under consistent conditions and assess implications for regional forecasting.

Literature Review

Prior work established the feasibility of ML for weather prediction, including early global ML models trained on coarse (~5°) data that limited performance. Key advances include FourCastNet (Vision Transformer with FNO/AFNO) producing 0.25° forecasts from ERA5; Pangu-Weather (3D Earth-Specific Transformer based on Swin) surpassing IFS-HRES in medium range; GraphCast (multi-mesh GNN on icosahedral grids with autoregressive fine-tuning) outperforming IFS-HRES on many variables; FuXi (Swin v2 U-transformer with cascade models for different lead times); and FengWu (multi-modal, multi-task transformer with replay buffer to improve long-lead forecasts). Comparative studies highlighted strengths and weaknesses: Bouallegue et al. found Pangu-Weather comparable to IFS for many metrics but with smooth forecasts, increasing bias with lead time, and poor TC intensity prediction; Charlton-Perez et al. showed multiple MLWP models captured synoptic structure of Storm Ciarán yet underestimated peak winds, suggesting smoothing tendencies from ML training and objectives. The role of multi-model ensembles in outperforming single models is well-established in seasonal prediction (e.g., NMME). Gaps remain in homogeneous regional evaluations and systematic assessment of TC track and intensity skills across MLWP models.

Methodology

Models: Five global MLWP systems—Pangu-Weather, FCN2, GraphCast, FuXi, FengWu—each run at 0.25° horizontal resolution, 13 vertical levels, 6-hour time step. Model codes from developers’ repositories were used to generate forecasts.
Initial conditions and verification: ERA5 reanalysis used for initial conditions (identical for all MLWP models) and as verification at valid time. ECMWF IFS (9 km) is included as a reference, recognizing its different operational analysis without post-analysis data.
Domain and period: East Asia and Western Pacific region (as in Fig. 1), June–November 2023.
Forecast length: 168 hours (7 days).
Metrics: Latitude-weighted RMSE and Anomaly Correlation Coefficient (ACC) for representative variables (e.g., 500 hPa geopotential height, 850 hPa temperature, 2 m temperature, 10 m zonal wind). Bias assessment via the Western Pacific Subtropical High (WPSH) position/intensity using 500 hPa geopotential height contours (e.g., 5880 m and 5760 m lines) for JJA and SON.
Tropical cyclone evaluation: 11 typhoons (excluding three short-lived) in 2023. TC center identified as minimum sea-level pressure in forecasts; verification best tracks from CWA (similar to IBTRACS). Track errors (km) and absolute intensity errors (hPa) computed by forecast lead time; additional summary comparison at 96 h per storm (Table 1). Track types categorized: straight, curving, recurve, irregular.
Ensemble strategy: Simple multi-model ensemble by arithmetic mean of the five MLWP forecasts; no perturbation of initial conditions; assessed via RMSE/ACC and TC track errors.
Statistical tests: Mann–Whitney U tests for non-normal error distributions. Comparisons conducted between Pangu-Weather or FengWu and other MLWP models. Reported 95% confidence in track error differences and 90% in intensity error differences (FW vs others).
Case study: Typhoon Haikui (Aug 28–Sep 4, 2023). Examined model tracks versus WPSH position (5880 m contour at 500 hPa) for two 96 h forecasts 6 hours apart, linking WPSH westward migration to track adjustments.
Precipitation case (Haikui, 48 h from 12 UTC 2 Sep): Evaluated rainfall from GraphCast and FuXi (only MLWP models with precipitation outputs), plus IFS and regional TWRF (15 km), and ERA5; verification from QPESUMS. Also assessed TC intensity evolution (10 m max wind, MSLP) and earliest TC genesis prediction lead times per model.

Key Findings

Synoptic-scale skill (RMSE/ACC): FengWu consistently best; FuXi and GraphCast next; FCN2 behind; Pangu-Weather worst among MLWP; IFS (9 km) generally in the middle. IFS exhibits larger initial RMSE due to analysis differences (no post-analysis data), but its RMSE growth trends align with MLWP models; Pangu-Weather shows larger error growth.
Multi-model ensemble: Simple average of the five MLWP models improves performance over individual weaker models; as more models are added, gains diminish. The full ensemble approaches FengWu’s performance and reduces error spread, particularly for TC tracks, mitigating outliers.
Bias/WPSH: All MLWP models show a weak WPSH bias (ERA5’s 5880 m contour encloses all model contours). Pangu-Weather has the largest weak bias; FengWu the smallest. Higher-latitude 5760 m contours cluster more tightly than tropical contours, indicating greater tropical challenges likely inherited from training data.
TC track prediction: Overall ranking mirrors RMSE/ACC—FengWu best track errors up to 144 h; at 168 h, FengWu slightly worse than FuXi and near GraphCast. IFS errors lie mid-pack despite best initial position. Ensemble closely tracks FengWu and stays within error ranges, avoiding outliers. Table 1 shows FengWu leads in four individual typhoons; Pangu-Weather, despite highest average track error, is best for three storms; IFS leads in two recurving storms; GraphCast and FuXi each lead in one.
TC intensity prediction: Using absolute errors, all models show weak biases. FengWu has the largest intensity errors despite best tracks. GraphCast and Pangu-Weather have the lowest intensity errors. IFS has much lower initial intensity bias and, together with higher resolution, shows superior intensity evolution relative to MLWP models. Intensity errors generally decrease from 120–168 h as storms weaken late in life cycles.
Statistical significance: Mann–Whitney U tests indicate 95% confidence in track error differences between selected model (Pangu-Weather or FengWu) and other MLWP models; 90% confidence for intensity error differences between FengWu and other MLWP models.
Haikui case study: IFS had largest average track error at 96 h (>1000 km) in early forecasts with strong poleward bias; FCN2 and GraphCast showed similar early biases. FengWu best (≈41 km at 96 h), FuXi second, Pangu-Weather next. Westward migration of WPSH in MLWP runs aligned with corrected westward tracks; IFS lagged in WPSH adjustment and retained a poleward track initially.
Precipitation and intensity during Haikui (48 h from 12 UTC 2 Sep): TWRF (15 km) reproduced observed extreme 48 h rainfall (>700 mm). IFS achieved ~600 mm and good spatial distribution. ERA5 reached ~300–400 mm. Among MLWP, GraphCast ≈200 mm, FuXi ≈130–150 mm; FuXi’s weaker intensity in first 24 h contributed to lower rainfall. Circular rainfall over Taiwan Strait was weaker in FuXi. IFS and TWRF intensity evolutions were closest to best track due to higher resolution.
TC genesis prediction (Haikui): Earliest formation prediction leads—FCN2: 6 days; Pangu-Weather: 5 days; IFS and GraphCast: 4.5 days; FuXi and FengWu: 4 days.

Discussion

The study provides a homogeneous evaluation of five leading MLWP models in a region of operational significance (East Asia/Western Pacific), isolating differences by using identical ERA5 initial conditions and uniform configurations. Findings show clear stratification in synoptic-scale skill (FengWu > FuXi ≈ GraphCast > FCN2 > Pangu-Weather), with IFS generally in the middle. Track skill correlates with synoptic RMSE/ACC, underscoring the importance of large-scale flow fidelity for TC steering. A simple multi-model ensemble nearly matches the best single model while constraining outliers, supporting ensemble use in operations. Persistent challenges include tropical biases, smoothing tendencies, and weaker TC intensity prediction, likely linked to MLWP training data resolution and objectives. The Haikui case highlights sensitivity to WPSH representation; timely westward shifts improved MLWP track predictions while delayed adjustment degraded IFS early forecasts. For high-impact local phenomena, MLWP precipitation underpredicts extremes relative to high-resolution NWP (TWRF) and IFS, reinforcing the need for regional high-resolution modeling or advanced downscaling. Emerging AI developments (e.g., higher-resolution AI, causal multi-modal approaches for TC intensity, AIFS) suggest paths to improve intensity and fine-scale precipitation. Operational adoption should consider ensemble strategies, high-resolution downscaling, and potential benefits of integrating additional data modalities beyond ERA5.

Conclusion

A consistent, regional evaluation of five global MLWP models shows FengWu as the top performer on synoptic-scale metrics and TC tracks, with FuXi and GraphCast close behind, FCN2 trailing, and Pangu-Weather last. A simple multi-model ensemble rivals the best model and reduces outlier risk, especially for TC tracks. All MLWP models exhibit a weak WPSH bias, with FengWu smallest and Pangu-Weather largest. Despite best track skill, FengWu has the poorest TC intensity performance; GraphCast and Pangu-Weather have the lowest intensity errors, while IFS benefits from low initial intensity bias and higher resolution. A Haikui case study links WPSH shifts to track outcomes and shows MLWP underestimation of extreme rainfall compared to IFS and regional TWRF. These results support using MLWP ensembles for robust synoptic guidance while maintaining high-resolution regional models for extremes. Future work includes testing with other reanalyses (e.g., NCEP), exploring higher-resolution training and downscaling, integrating additional data (e.g., satellite for TC intensity), and conducting more rigorous, broader verification across seasons and regions.

Limitations

Initial conditions: All MLWP runs initialized from ERA5; IFS uses operational analyses without post-analysis data. This non-operational setup may alter relative performance compared to real-time operations.
Resolution and variables: MLWP models at 0.25° with 13 levels; only FuXi and GraphCast provided precipitation outputs for rainfall evaluation. Coarse training resolution and limited variables can smooth extremes and weaken intensity forecasts.
Regional and temporal scope: Evaluation confined to East Asia/Western Pacific and June–November 2023; results may not generalize to other regions/seasons.
Ensemble design: Simple equal-weight multi-model mean; no consideration of initial condition uncertainty or optimal weighting.
TC evaluation sample: 11 typhoons, with stage-dependent sampling by lead time; single-case analysis for genesis prediction offers limited generality.
Bias diagnostics: WPSH analysis based on select contours (5880/5760 m), providing a first-order view; deeper dynamics not fully explored.

Related Publications

Explore these studies to deepen your understanding of the subject.

Medicine and Health

Development of prediction models for screening depression and anxiety using smartphone and wearable-based digital phenotyping: protocol for the Smartphone and Wearable Assessment for Real-Time Screening of Depression and Anxiety (SWARTS-DA) observational study in Korea

Y. Shin, A. Y. Kim, et al.

Health and Fitness

Comparing role of religion in perception of the COVID-19 vaccines in Africa and Asia Pacific

S. Jin, A. R. Cook, et al.

Psychology

An internet-delivered acceptance and commitment therapy program for anxious affect, depression, and wellbeing: A randomized, parallel, two-group, waitlist-controlled trial in a Middle Eastern sample of college students

Z. Vally, H. Shah, et al.

Medicine and Health

A comparative study of COVID-19 transcriptional signatures between clinical samples and preclinical cell models in the search for disease master regulators and drug repositioning candidates

H. Chapola, M. A. D. Bastiani, et al.

Listen, Learn & Level Up

Over 10,000 hours of research content in 25+ fields, available in 12+ languages.

No more digging through PDFs, just hit play and absorb the world's latest research in your language, on your time.

listen to research audio papers with researchbunny