logo
ResearchBunny Logo
Lightning nowcasting with aerosol-informed machine learning and satellite-enriched dataset

Earth Sciences

Lightning nowcasting with aerosol-informed machine learning and satellite-enriched dataset

G. Song, S. Li, et al.

This groundbreaking study by Ge Song, Siwei Li, and Jia Xing leverages machine learning to enhance lightning nowcasting accuracy using aerosol features and satellite observations. With a remarkable 94.3% accuracy, the team reveals unexpected influences of different aerosol types on lightning occurrences.... show more
Introduction

Lightning causes significant fatalities and economic losses globally and strongly affects atmospheric chemistry via NOx production. Accurate nowcasting is essential for preparedness in sectors where timeliness is critical. Conventional numerical models simulate lightning via parameterized microphysics but face trade-offs between detection and false alarms and are computationally expensive for nowcasting. Data-driven machine learning models offer efficiency and can capture nonlinear relationships with practical feature inputs, yet prior work has struggled with high false alarm ratios at high POD, likely due to limited training data and incomplete feature sets. Previous studies relied on ground-based networks and polar-orbiting sensors with limited spatial/temporal coverage, restricting model performance. The advent of the Geostationary Lightning Mapper (GLM) on GOES enables continuous, high-coverage lightning observations, improving data quality for observation-based models. Another gap is the omission of aerosol influences despite evidence that aerosols modulate convection, microphysics, and radiative effects, with different components exerting distinct impacts. This study aims to enhance lightning nowcasting by integrating aerosol optical depth and composition with meteorological variables and leveraging GLM observations for labeling, evaluating performance with standard nowcasting metrics, and interpreting how aerosol characteristics influence lightning formation.

Literature Review

Prior machine learning approaches for lightning occurrence have included artificial neural networks, decision trees, LightGBM, SVMs, random forests, and LSTM networks using meteorological inputs from stations, reanalyses/forecasts, and radar. While efficient, they often exhibit elevated FAR at high POD, attributed to limited datasets and missing predictors. Ground-based lightning networks and polar-orbiting satellites suffer from variable detection efficiency and sparse coverage per overpass, constraining model accuracy. GLM provides continuous, hemispheric lightning observations with high temporal resolution, enabling improved training labels and evaluation. Numerous observational and modeling studies demonstrate aerosol impacts on lightning through microphysical invigoration and radiative effects; component-specific influences have been reported (e.g., sulfate and organic compounds often enhancing, black carbon inhibiting). These insights motivate incorporating aerosol composition and loading into nowcasting models.

Methodology

Data sources and preprocessing: Lightning occurrence labels were derived from GOES-16 GLM L2+ flash products, providing continuous, full-coverage detection of total lightning (cloud-to-ground and intracloud). GLM detections were gridded to 0.25°×0.25° and hourly resolution over CONUS. Suspicious noise/outliers were removed by filtering instances with fewer than five flash records within 5 minutes, considering lightning continuity. Aerosol predictors included aerosol optical depths for five components (black carbon, dust, organic carbon, sulfate, sea salt) from Copernicus Atmosphere Monitoring Service (CAMS) global atmospheric composition forecasts, and hourly surface PM2.5 fields estimated via a published machine learning approach trained on EPA AQS observations. Meteorological predictors (from CAMS forecasts, interpolated to 0.25° hourly) included surface pressure, temperature at 500 hPa, relative humidity at 500 hPa, and U and V wind components (as specified in the paper) to maintain dataset consistency. Temporal features included day of year and local hour. Historical flash density was used as a strong predictor. Model: A LightGBM gradient-boosted decision tree classifier was trained to predict next-hour lightning occurrence (binary) at 0.25° hourly resolution. Hyperparameters were optimized via grid search. To address class imbalance (few lightning-active cases), a focal loss variant was implemented within LightGBM with weighting parameters α=0.75 and γ=0 to balance precision and recall. Two configurations were compared: a full model including aerosol features and a reduced model without aerosol predictors. Training and evaluation: The primary study period was summer 2020 (June–August), yielding 37,415,530 samples. Performance was assessed using 10-fold day-based cross-validation (each fold uses consecutive days as test; remaining days as training). Transferability was evaluated via out-of-sample testing on summer 2021 after training only on summer 2020. Metrics included accuracy, POD, FAR, CSI, HSS, and PRC-AUC. Model interpretability: SHAP analysis was applied to quantify feature contributions and determine whether aerosol components positively or negatively influenced predicted lightning occurrence.

Key Findings
  • Overall performance: The aerosol-informed LightGBM achieved accuracy 94.3%, POD 75%, FAR 38% (38.1% in abstract), and PRC-AUC 0.727 in summer 2020 cross-validation. The PRC curve shows the model maintains low false alarms at higher precision; at POD 75%, FAR ≈38%.
  • Transferability: Out-of-sample validation on summer 2021 yielded PRC-AUC 0.699, indicating strong temporal transferability.
  • Baseline comparisons (Fig. 2): Proposed model outperformed Persistence and CAPE baselines. POD: 0.75 (model) vs 0.53 (Persistence) vs 0.47 (CAPE). FAR: 0.38 vs 0.45 vs 0.85. CSI: 0.53 vs 0.37 vs 0.13. HSS: 0.66 vs 0.55 vs 0.20.
  • Enriched vs data-deplete models: Using GLM (data-enriched) markedly improved performance over LMA-based (data-deplete). POD was similar (75% vs 72%), but FAR was much lower for enriched (36% vs 56%); CSI and HSS were higher for enriched.
  • Spatial performance: Higher POD and lower FAR in southeastern CONUS where lightning density is high; performance correlates with lightning density. Sparse-lightning regions (<0.05 flashes km−2) had lower POD (~30%).
  • Aerosol temporal linkage: Diurnal anomalies of AOD strongly correlate with lightning occurrence anomalies (Pearson r=0.897) compared with temperature (r=0.772). Time-lagged cross-correlation shows AOD is synchronous (0 h lag) and retains high correlation at +1 h, indicating predictive value; RH and temperature show weaker lead–lag indication.
  • Aerosol contribution to skill: At high POD requirements (>75%), including aerosol features increases correct rejection rate (1−FAR) by >10%. After 2 pm local time, aerosol information reduces FAR by 0.10–0.15 and increases CSI by 0.05–0.10.
  • Dependence on AOD: For low-to-moderate AOD (≤0.2), both models improve, but the aerosol-informed model consistently achieves lower FAR (≈30–50%). At higher AOD (e.g., ≥0.4, potential pollution events), aerosol inclusion can reduce false warnings by ~40%.
  • Regional aerosol benefits: Largest improvements in POD, FAR, and CSI in southeastern and Midwestern CONUS; some west-coast areas may see degraded performance, likely related to different aerosol regimes (e.g., wildfire black carbon).
  • SHAP feature interpretation: Flash density is the strongest predictor. Among aerosols, sulfate is most influential and positively associated with lightning, followed by sea salt and organic carbon (positive). Black carbon shows a negative association. Meteorological variables (notably relative humidity) have moderate importance, consistent with lightning favoring moist conditions.
Discussion

Integrating aerosol information with meteorology and GLM labels substantially improves next-hour lightning nowcasting relative to baselines and prior ML-only meteorological models. The model’s high accuracy, favorable POD–FAR trade-offs, and strong transferability indicate robustness and potential for incorporation into parameterizations or operational systems. Interpretable ML confirms physically plausible aerosol influences: sulfate and organic aerosols (and sea salt under continental loading) tend to enhance lightning via microphysical pathways, whereas black carbon suppresses lightning via radiative heating and stabilization effects. Spatial analyses show the model performs best where lightning is frequent and aerosol loading is moderate to high, aligning with regions of highest protection needs. Lower skill in lightning-sparse or low-aerosol regions reflects class imbalance and weaker signals. The reliance on CAMS forecast-based aerosol products highlights the importance of high-quality, real-time aerosol observations; improvements in satellite aerosol retrievals should further enhance predictive performance. Overall, the findings address the research goal by demonstrating that aerosol-informed, satellite-enriched ML can both improve nowcasting skill and elucidate aerosol–lightning relationships.

Conclusion

This work presents an aerosol-informed LightGBM model trained on GLM-enriched datasets to nowcast lightning occurrence hourly at 0.25° resolution across CONUS. The model achieves strong performance (accuracy 94.3%, POD 75%, FAR 38%, PRC-AUC 0.727), outperforms persistence and CAPE baselines, and generalizes to a subsequent summer. Aerosol features materially improve skill at high POD thresholds, reduce false alarms especially in afternoon/evening hours, and reveal component-specific effects: sulfate and organic aerosols (and sea salt under continental conditions) enhance lightning likelihood, while black carbon suppresses it. These insights align with established physical understanding and highlight the value of combining aerosol characterization with meteorology for operational nowcasting and for informing lightning parameterizations in numerical models. Future work should: (1) mitigate class imbalance to boost performance in lightning-sparse regions; (2) incorporate improved, real-time satellite aerosol retrievals and additional microphysical/thermodynamic features; (3) deepen analysis of aerosol type–loading regimes (e.g., wildfire smoke) to refine regional performance; and (4) explore integration within NWP workflows for coupled prediction.

Limitations
  • Class imbalance: Sparse-lightning regions (e.g., western CONUS) exhibit reduced POD and higher uncertainty, reflecting limitations in handling highly imbalanced datasets despite the focal loss.
  • Aerosol data constraints: Aerosol predictors derived from CAMS forecast products and ML-estimated PM2.5 may carry biases and lack full fidelity of direct satellite observations; incomplete satellite aerosol coverage currently limits fully observation-based inputs.
  • Regional aerosol regimes: Performance may degrade in specific regimes (e.g., west coast wildfire black carbon), indicating that aerosol-type heterogeneity can challenge generalization.
  • Extremely high POD targets: Achieving POD >80% with low FAR remains difficult without additional features capturing the full chain of lightning formation processes.
  • Data-deplete baseline limitations: LMA-based models show distance-dependent detection efficiency, restricting comparability and applicability where ground networks are sparse.
Listen, Learn & Level Up
Over 10,000 hours of research content in 25+ fields, available in 12+ languages.
No more digging through PDFs, just hit play and absorb the world's latest research in your language, on your time.
listen to research audio papers with researchbunny