Earth Sciences
Deep learning for twelve hour precipitation forecasts
L. Espeholt, S. Agrawal, et al.
The paper addresses short-term (up to 12 hours) probabilistic precipitation forecasting at high spatial and temporal resolution, a domain traditionally dominated by physics-based Numerical Weather Prediction (NWP) models. It outlines limitations of NWP—computational bottlenecks tied to grid resolution and challenges in improving physical parameterizations—and motivates deep neural networks as efficient, parallelizable alternatives that learn mappings from observations to forecast distributions. The study introduces MetNet-2, a neural model producing 1 km resolution forecasts at 2-minute frequency across the CONUS, conditioned on a large spatiotemporal context. The goal is to assess whether neural models can emulate aspects of atmospheric physics to deliver skillful high-resolution forecasts up to 12 hours ahead and compare against state-of-the-art NWP ensembles (HREF) and deterministic models (HRRR).
The work situates itself within probabilistic weather forecasting and nowcasting literature, contrasting physics-based NWP (HRRR, HREF ensembles) that rely on data assimilation and numerical solutions of the Navier–Stokes equations with emerging deep learning approaches for weather prediction. It references ensemble post-processing for probabilistic forecasts (e.g., CRPS, Brier Score) and prior neural models such as MetNet. It highlights the known dependency of NWP forecast skill on grid resolution and the computational barriers to higher resolution, along with the scientific difficulty of improving physical parameterizations. The paper also references the limitations of extrapolation and short-term nowcasting methods beyond a few hours lead time, establishing the 12-hour horizon as a challenging benchmark for learning-based models.
Data and inputs: The model ingests multi-source observations over CONUS including MRMS radar reflectivity-derived precipitation (instantaneous at 2-min intervals; hourly cumulative gauge-corrected), HRRR data assimilation state variables (pressure, temperature, winds, vorticity, etc.) representing a detailed analysis of the atmospheric state, geospatial coordinates (longitude, latitude, elevation, forecast time), and optical satellite imagery. The target variable is precipitation rate (0–102.4 mm/h) represented as both instantaneous and hourly cumulative precipitation.
Spatial-temporal context: For each target 512 km × 512 km patch at 1 km resolution, MetNet-2 processes a 2048 km × 2048 km input context to capture mesoscale to synoptic-scale influences, corresponding to roughly 64–85 km of context per forecast hour up to 12 hours. Inputs are downsampled by 4× per spatial dimension to 512×512 for computational efficiency, and multiple recent time slices per feature are stacked to capture temporal dynamics.
Architecture: An input encoder uses a convolutional recurrent network along the time dimension to embed sequences of radar, assimilation, satellite, and geospatial features. A stack of 2D convolutional residual blocks with exponentially increasing dilation factors (doubling per layer) aggregates large spatial context efficiently, yielding an exponentially growing receptive field. A strong lead-time conditioning mechanism embeds the forecast lead time and conditions each convolutional layer. After context aggregation, a center-crop selects the target region; a tiling/upsampling step restores 1 km resolution. A final convolutional head outputs a discrete probability distribution over 512 precipitation-rate categories for each pixel (probabilistic forecasting). Training employs model-parallelism across multiple TPU cores (e.g., 4×4 mesh) to support large dilation factors and memory footprint; dilated convolutions can span TPU boundaries.
Training data and splits: Data span July 2017–August 2020. Non-overlapping temporal blocks generate train/validation/test splits using sequences of 400 h (train), 12 h (validation), 40 h (test), 40 h (test), and 12 h hiatuses, with spatial sampling over a 0.5° grid across CONUS. Two test sets are constructed: Dataset A (cumulative precipitation only; overlaps HREF availability at 953 timestamps) and Dataset B (both cumulative and instantaneous precipitation; covers all timestamps). Each test set has 39,841 patches.
Variants for hybrid experiments: MetNet-2 (default) uses observations and HRRR assimilation inputs but no NWP forecast fields. MetNet-2 Postprocess takes HRRR deterministic forecasts and learns a probabilistic mapping, correcting biases. MetNet-2 Hybrid takes both the default inputs and HRRR forecast fields to produce probabilistic outputs.
Evaluation: Probabilistic metrics include Continuous Ranked Probability Score (CRPS) and CRPS Skill Score (CRPSS); categorical metrics include Critical Success Index (CSI) at thresholds (e.g., ≥0.2, 2, 8, 20 mm/h). Brier Score maps quantify probabilistic error at given thresholds. Comparisons are made against HREF (probabilistic ensemble of multiple NWP models) and HRRR (deterministic). Ablation studies vary input context size (2048 vs 1536/1024/512 km), inclusion of assimilation features, lead-time conditioning, and maximum dilation factor.
Interpretability: Integrated Gradients attribute input features’ contributions to predictions. Analysis focuses on absolute vorticity at various pressure levels and its changing importance with lead time, relating findings to quasi-geostrophic theory.
- MetNet-2 vs HREF (Dataset A, cumulative precipitation): MetNet-2 achieves better probabilistic performance (lower CRPS) than HREF across the entire 1–12 h lead time range. When thresholded to categorical forecasts, MetNet-2 attains higher CSI than HREF for at least the first 9 hours at both low and high precipitation rates (up to 20 mm/h). Both MetNet-2 and HREF outperform HRRR across 12 hours.
- Hybrid results (Dataset B): MetNet-2 Hybrid outperforms MetNet-2 Postprocess across 1–12 h in both CRPS/CRPSS and CSI for instantaneous and cumulative precipitation, showing additive value from combining HRRR forecasts with observational inputs. Hybrid maintains benefits through 12 h lead time and improves over HRRR baseline.
- Case studies show MetNet-2’s probabilistic contours expand and uncertainty increases with lead time, tracking observed precipitation structures, including challenging events (e.g., Hurricane Isaias).
- Ablations: Larger spatial context (2048 km) improves skill over smaller contexts (1536, 1024, 512 km). Inclusion of HRRR assimilation features notably improves performance, especially at later hours. Removing the lead-time conditioning or capping dilation (≤16) degrades performance.
- Interpretability: The importance of absolute vorticity increases with lead time, with upper-level (e.g., 250 hPa) vorticity becoming more influential by 12 h, consistent with quasi-geostrophic theory linking upper-level positive vorticity to lower-tropospheric ascent that can precondition convection.
Findings demonstrate that a physics-free, data-driven neural model can deliver skillful 1 km resolution precipitation forecasts up to 12 hours, rivaling or surpassing state-of-the-art ensemble NWP methods in probabilistic and early-horizon categorical metrics. MetNet-2’s efficiency (seconds vs hours), probabilistic nature, and design simplicity enable frequent, high-resolution forecasting and potential personalization. The hybrid results indicate neural models can extract complementary information to NWP, extending utility at longer lead times. Attribution analyses suggest the network has learned relationships aligned with atmospheric dynamics (e.g., upper-level vorticity), supporting the premise that neural networks can emulate aspects of large-scale physics from data. These outcomes point toward a shift from heavy reliance on forward atmospheric simulations to learning-based frameworks, with opportunities to integrate more direct observations and broaden variable coverage.
The paper introduces MetNet-2, a deep learning model producing probabilistic 1 km precipitation forecasts up to 12 hours ahead, outperforming an operational NWP ensemble (HREF) in CRPS across all lead times and in CSI through 9 hours, and showing further gains in a hybrid configuration with HRRR inputs. Architectural advances (large-context processing via exponentially dilated convolutions, strong lead-time conditioning, model-parallel training) and rich observational inputs underpin its performance. The results validate neural weather models as viable, efficient alternatives for short-term forecasting and indicate that they can learn physically consistent patterns. Future directions include incorporating direct sensor data to reduce dependence on NWP assimilation states, extending to multiple weather variables jointly, leveraging transfer across variables, and expanding beyond CONUS.
- Dependence on HRRR’s data assimilation state for many atmospheric variables; direct sensor data are not yet readily available, contributing to remaining latency and reliance on NWP-derived initial states.
- Evaluation and training focused on CONUS; generalization to other regions and observing systems is not established here.
- Extreme precipitation rates are rare in the training distribution, potentially limiting performance at the highest intensities.
- HREF comparison covers cumulative precipitation (instantaneous precipitation not available from HREF), constraining some comparative analyses.
Related Publications
Explore these studies to deepen your understanding of the subject.

