Earth Sciences
Skilful precipitation nowcasting using deep generative models of radar
S. Ravuri, K. Lenc, et al.
Discover how a team of experts from DeepMind and the Met Office is revolutionizing precipitation nowcasting with a deep generative model that enhances forecast accuracy and operational utility. This groundbreaking research delivers realistic predictions over expansive areas and timeframes, greatly improving the value of weather forecasts.
~3 min • Beginner • English
Introduction
The study addresses the challenge of precipitation nowcasting (0–2 hours ahead) critical for real-time, weather-dependent decision-making across sectors such as emergency response, energy, retail, aviation and flood warning. Traditional ensemble NWP methods are less effective at these lead times due to model spin-up and data assimilation challenges, motivating radar-based alternatives that advect precipitation using optical-flow-derived motion fields with stochastic perturbations (for probabilistic ensembles). Recent deep learning approaches directly predict future rain rates from radar, improving low-intensity event skill but often producing blurred, less physically plausible fields at longer lead times and lacking consistent probabilistic field-level predictions. The research objective is to develop and evaluate a deep generative, probabilistic nowcasting approach that maintains spatial-temporal realism, improves location accuracy, preserves variability across scales, and delivers higher decision-making value, particularly for medium-to-heavy rain events.
Literature Review
Operational probabilistic nowcasting commonly relies on advection-based ensemble methods such as STEPS and PySTEPS, which estimate motion via optical flow and add stochastic perturbations to yield ensembles consistent across scales. These systems can provide useful uncertainty estimates but struggle with non-linear phenomena like convective initiation. Deep learning methods (e.g., CNNs, ConvLSTMs, UNet variants, attention-based models like MetNet/axial attention) trained on large radar archives have shown improved per-gridcell metrics (e.g., CSI at low thresholds) yet often express uncertainty via spatial blurring at longer lead times, degrading effective resolution and small-scale structures. Prior work emphasizes deterministic or location-specific predictions rather than coherent probabilistic fields, limiting operational utility. The need remains for methods that provide ensemble-like probabilistic forecasts, capture non-linear precipitation evolution, and verify well across spatial-temporal scales while retaining high effective resolution.
Methodology
The proposed approach is a conditional deep generative model of radar precipitation fields (DGMR) that predicts N future radar frames conditioned on M past frames. It is implemented as a conditional GAN tailored to spatiotemporal radar sequences. The generator takes four past radar observations (20 minutes of context) and produces multiple stochastic realizations of 18 future frames (90 minutes). Spatial and temporal discriminators guide learning: (1) a spatial discriminator (2D CNN) enforces realistic spatial structure and discourages blurring; (2) a temporal discriminator (3D CNN) enforces temporal coherence and penalizes jumpy sequences. A regularization term aligns the predictive mean (over multiple samples) with observed radar at the grid-cell level to improve location accuracy. The generator includes a fully convolutional latent module enabling inference over larger spatial extents than used in training while maintaining spatiotemporal consistency. An ablation study (Supplementary Information) supports the necessity of each loss term. Training data comprise a large corpus of UK radar observations (2016–2018), extracted as 256×256 crops of 22 frames (110 minutes). An importance sampling scheme biases training toward heavier precipitation events. Models are evaluated on a 2019 UK test set; generalization to a US dataset and weekly splits is reported in Extended Data and Supplementary Information. Inference is efficient: a single full-resolution prediction on an NVIDIA V100 takes just over a second. Baselines include PySTEPS (state-of-the-art ensemble advection-based nowcasting), a UNet deterministic deep learning model, and an axial attention radar-only model inspired by MetNet. Verification uses a suite of metrics: CSI at multiple rain-rate thresholds for location accuracy; radially averaged power spectral density (PSD) for spatial-scale variability; CRPS for probabilistic accuracy, with pooled/aggregated versions to assess multi-scale consistency. Post-processing calibration (“recal”) for DGMR is applied to improve reliability (reliability diagrams and rank histograms). Forecast value is assessed via a standard relative economic value framework using ensemble outputs at multiple accumulation thresholds and via a two-phase cognitive evaluation by 56 Met Office operational meteorologists who ranked anonymized nowcasts for accuracy and usefulness for medium (≥5 mm h⁻¹) and heavy (≥10 mm h⁻¹) rain cases.
Key Findings
- Case study performance (24 June 2019, eastern Scotland convection): Meteorologists significantly preferred DGMR, with 93% selecting it as first choice. DGMR better preserved spatial coverage, convection, and heavy rainfall early in the forecast compared to PySTEPS (which overestimated intensity) and deep learning baselines (which were blurred and missed small-scale structure). Example quantitative scores for this event: at T+30/T+60/T+90 min, DGMR CSI2/8 = 0.54/0.14, 0.50/0.04, 0.48/0.02 with CRPS = 0.52, 0.62, 0.53, respectively; PySTEPS CSI2/8 = 0.30/0.02, 0.19/0.02, 0.13/0.01 (CRPS 0.61, 0.69, 0.64); UNet CSI2/8 = 0.57/0.13, 0.52/0.02, 0.50/0.00 (CRPS 0.78, 0.90, 0.80); axial attention CSI2/8 = 0.58/0.11, 0.54/0.02, 0.50/0.00 (CRPS 0.70, 0.62, 0.72).
- CSI (2019 UK, aggregated): All deep learning systems outperform PySTEPS on CSI across thresholds, with DGMR having significant skill improvements over PySTEPS at all precipitation thresholds (paired permutation tests on alternating weeks).
- PSD: DGMR and PySTEPS match observed spectral characteristics, preserving medium- and small-scale variability. UNet and axial attention forecasts blur with lead time, reducing effective resolution to 32 km (UNet) and 16 km (axial attention) by T+90 min, versus the 1 km native grid.
- Probabilistic verification (CRPS): At grid resolution, DGMR, PySTEPS, and axial attention perform similarly (with improved axial attention via temperature scaling). With increasing spatial aggregation, DGMR and PySTEPS maintain strong performance; DGMR is better on maximum precipitation, whereas axial attention degrades at larger scales. Differences between DGMR and temperature-optimized axial attention are statistically significant.
- Post-processing recalibration (DGMR-recal) improves reliability and skill (CRPS), matching or exceeding baselines on reliability diagrams and rank histograms.
- Economic value: Using 20-member ensembles and 4-km aggregations for 90-min accumulations (≥5, ≥10, ≥15 mm), DGMR achieves the highest relative economic value (higher peaks and larger area under value curves) compared to baselines.
- Cognitive evaluation (n = 56 meteorologists): DGMR was most preferred in 89% (95% CI [0.86, 0.92], P < 10⁻⁴) of medium rain cases (≥5 mm h⁻¹) and 90% (95% CI [0.87, 0.92], P < 10⁻⁴) of heavy rain cases (≥10 mm h⁻¹), indicating superior perceived accuracy and usefulness.
- Operational characteristics: DGMR produces realistic, spatiotemporally consistent predictions over large domains (up to 1,536 km × 1,280 km) with 5–90 min lead times, generating full-resolution nowcasts rapidly (just over 1 s per sample on a V100). Generalization to US data and alternative train/test splits is demonstrated in extended analyses.
Discussion
The findings demonstrate that a deep generative, probabilistic approach to radar-based nowcasting can overcome key limitations of both advection-based ensemble systems and prior deep learning models. By learning the conditional distribution of future precipitation fields and sampling coherent ensembles, DGMR preserves spatial and temporal structure without resorting to blur when expressing uncertainty. This leads to improved location accuracy (CSI), maintenance of observed spectral characteristics across scales (PSD), and competitive to superior probabilistic accuracy (CRPS), especially at larger aggregations and for extremes (maximum precipitation). These verification gains translate into higher decision-making value: DGMR’s ensembles yield higher relative economic value, and expert meteorologists overwhelmingly prefer DGMR nowcasts for medium and heavy rain events, underscoring operational utility beyond standard metrics. The model’s calibration can be further enhanced via simple post-processing, aligning reliability with established methods. Results generalize across regions (UK, US) and data splits, indicating robustness. Overall, the approach addresses the research goal of delivering skilful, probabilistic nowcasts that are consistent across scales and valuable for real-world decisions, particularly where traditional methods struggle with non-linear convective behavior.
Conclusion
The study presents DGMR, a deep generative model for probabilistic precipitation nowcasting from radar, that delivers fast, accurate, and spatiotemporally consistent forecasts up to 90 minutes ahead over large domains. Using statistical (CSI, PSD, CRPS), economic, and cognitive evaluations, DGMR improves forecast quality, consistency, and value relative to strong baselines, without blurring away small-scale features. Expert meteorologists strongly prefer DGMR for medium and heavy precipitation scenarios, supporting its operational relevance. Future work should address remaining challenges in predicting heavy precipitation at longer lead times, develop verification metrics better aligned with operational utility, and further integrate machine learning with environmental science to expand to broader environmental variables and use cases.
Limitations
- Predicting heavy precipitation at long lead times remains challenging for all approaches, including DGMR.
- Standard verification metrics (e.g., grid-cell CSI/CRPS) may not fully capture operational value, necessitating new metrics aligned with practitioner needs.
- The method relies on radar observations; performance may depend on radar data quality, coverage, and preprocessing. Calibration/post-processing is beneficial for reliability.
- While generalization to a US dataset is shown, further validation across diverse climates and radar systems would strengthen generalizability.
Related Publications
Explore these studies to deepen your understanding of the subject.

