Medicine and Health

Heatstroke predictions by machine learning, weather information, and an all-population registry for 12-hour heatstroke alerts

S. Ogata, M. Takegami, et al.

This groundbreaking study, conducted by Soshiro Ogata and colleagues, unveils innovative prediction models for heatstroke cases across 16 Japanese cities. Through advanced machine learning techniques, the research promises significantly enhanced prediction accuracy, aiding public health measures to combat heat-related health issues effectively.

00:00

~3 min • Beginner • English

Index

Introduction

The study addresses the urgent need for accurate, short-horizon heatstroke alerts in the context of increasing extreme heat events in Japan. Heatstroke risk is closely tied to weather, especially high temperature and humidity, and previous exceptional summers (e.g., 2018 in Japan, 2019 in parts of Europe) underscored the public health impact. While Wet Bulb Globe Temperature (WBGT) is widely used for assessing heat risk, it has limitations: it poorly discriminates risk during midsummer periods with many days falling into broad warning categories, and it inadequately reflects heat strain when sweating evaporation is constrained. Few prior models leveraged multiple weather variables, and they generally ignored severity stratification and the critical task of predicting spike days. The objective is to develop and validate 12-hour resolution prediction models for (1) all heatstroke cases and (2) hospital admission and death cases, per city, using multi-source weather, calendar, and city demographic data linked with an all-population registry of ambulance-transported heatstroke patients across 16 cities (~10 million population) in Japan.

Literature Review

WBGT has a long history in assessing heat-related risk but has known limitations in high humidity/low airflow contexts and in discriminating risk during extended hot periods. Prior predictive studies in Japan and China used multiple weather variables to model daily heat-related illnesses but were limited by daily (not sub-daily) prediction units, lacked severity-specific predictions, and did not focus on predicting spike days. Heat waves (consecutive hot days) and timing around the rainy season end have been implicated in increased heat-related morbidity. The present work builds on these insights by integrating high-resolution weather data, population and city-level characteristics, and machine learning methods to improve granularity (12-hour windows) and to capture spikes and severity.

Methodology

Study design: Data from 16 cities in Japan’s Kinki region (Osaka, Toyonaka, Mino, Ikeda, Suita, Sakai, Kobe, Ashiya, Nishinomiya, Amagasaki, Akashi, Himeji, Kyoto, Uji, Muko, Nagaokakyo) were used. Heatstroke cases transported by ambulance between June 1 and September 30, 2015–2018 were drawn from the Fire and Disaster Management Agency registry. High-resolution weather data (2015–2018) were obtained from The Weather Company (IBM). Models were trained on 2015–2017 and tested on 2018. Outcomes: Primary outcome was the number of all heatstrokes per city per 12-hour period (daytime 6:00–17:59, nighttime 18:00–5:59). Secondary outcome was the number of hospital admission (moderate/severe) and death cases per city per 12 hours. Predictors: Weather variables included ambient temperature, relative humidity, precipitation (previous 12 h), wind speed, downward solar radiation, WBGT (for classic models), and derived differences between 12-hour frames and previous 24 hours for mean/max/min temperatures (capturing consecutive hot periods). Calendar variables included time of day, month, rainy season, days since end of rainy season, and holidays (weekends, national and Obon holidays). City characteristics included median age, total population (used as offset in GLM/GAM), population aged 65+, sex ratio, mean annual taxable income, and green area proportion. Weather data processing: Weather data were provided hourly at 4-km grids (30-km for 2015 and for downward solar radiation 2015–2018 when 4-km unavailable), aggregated to city-hour means, then summarized to 12-hour means/max/min and differences vs. previous 24 h. WBGT was estimated via a validated formula using temperature, humidity, solar radiation, and wind speed. Model development: Step 1 compared five approaches for each outcome: (1) GLM (Poisson) with WBGT only and log(population) offset, (2) GLM (Poisson) with multivariable predictors and offset, (3) GAM (Poisson) with multivariable predictors and offset (spline terms), (4) Random Forest, (5) XGBoost. Models were selected by lowest RMSE on cross-validation and evaluated on the 2018 test set. Step 2 developed city-specific models (using the best approach from Step 1) and compared their RMSE to the single model across all cities to assess heterogeneity. Step 3 (spike optimization): If spike prediction was inadequate, under-sampling and bagging were applied with XGBoost focused on learning spike patterns. Training data from large cities (>500,000 population) were partitioned into spike and no-spike clusters using a classifier trained on weather, calendar, and city features. Under-sampling of the no-spike cluster and bagging (10 repeats) produced an under-sampling XGBoost predictor; its outputs were combined with the GAM into a hybrid model based on thresholds of the daily sum of predicted values across cities (GAM used for <150, mean of GAM and under-sampling XGBoost for 150–<300, under-sampling XGBoost for ≥300). Feature selection and hyperparameter tuning: Recursive feature elimination (RFE) with 5-fold cross-validation and grid-search was used to select predictors and tune hyperparameters (minimizing RMSE). For city-specific models, features were selected on the pooled training set; hyperparameters tuned per city. Population size served as offset in GLM/GAM. Evaluation: Primary metric was RMSE per city per 12 h. Secondary metrics evaluated spikes: MAPE per day (across cities) and total absolute percentage error on days at/above the 80th percentile of daily totals per year. Predicted negatives were truncated at 0; extreme predicted values were capped at training maxima (104 for all cases, 48 for severe/death). SHAP values were computed to interpret predictor contributions.

Key Findings

Dataset characteristics: Between June–September, incidence per 100,000 was 37.5 (95% CI 36.8–38.2) in training and 74.4 (72.7–76.1) in testing for all cases; 11.0 (10.6–11.3) and 19.6 (18.8–20.5) for hospital admission/death. Median ambient temperature per 12 h: 25.52 °C (training) vs 26.27 °C (testing). August WBGT severe warning (28–31 °C) days: 64.2% (training) vs 62.7% (testing); threat (>31 °C): 24.5% vs 30.2%. All heatstroke cases (per city per 12 h): - Test RMSE: GLM (WBGT only) 3.73; GLM (multivariable) 2.92; GAM (multivariable) 2.47 (best among base models); RF 3.51; XGBoost 3.28. - Spike-day accuracy (MAPE test): GLM (WBGT only) 43.0%; GLM 27.1%; GAM 19.7%; RF 32.0%; XGBoost 28.5%. Total absolute percentage error on spike days (test): 48.8% (GLM WBGT), 30.5% (GLM), 21.9% (GAM), 37.2% (RF), 31.9% (XGBoost). - Under-sampling XGBoost vs hybrid: Under-sampling XGBoost had better spike prediction (test MAPE 13.38%, total abs error 6.94%) but worse overall RMSE (3.48). The hybrid model (GAM + under-sampling XGBoost) achieved test RMSE 2.97, spike-day MAPE 14.8%, and total abs error 14.2%. - City-specific GAMs performed worse overall than a single pooled GAM (test RMSE 5.25 consolidated vs 2.47 pooled); no serious heterogeneity detected in the pooled hybrid. - SHAP for all cases (hybrid): predictors increasing counts included high temperature, small 12 h vs prior 24 h max-temperature difference (consecutive hot days), high solar radiation, larger total population, and larger elderly population. Predictors associated with lower counts included low relative humidity, higher male-to-female ratio, and higher mean annual taxable income. Hospital admission and death cases (per city per 12 h): - Test RMSE: GLM (WBGT only) 1.14; GLM (multivariable) 0.92; GAM (multivariable) 0.83 (best); RF 1.09; XGBoost 1.08. - Spike-day accuracy (test): MAPE 10.6% and total abs error 7.5% for GAM (training: 23.3% and 11.7%). Under-sampling/bagging not needed. - City-specific GAMs performed worse than pooled GAM (test RMSE 1.42 vs 0.83). - SHAP for severe/death: higher counts with high temperature, small 12 h vs prior 24 h max-temperature difference, high solar radiation, and larger elderly population; lower counts with low relative humidity, higher male-to-female ratio, and higher mean annual taxable income.

Discussion

The models demonstrate that incorporating multiple weather variables, calendar effects, and city demographics substantially improves prediction of heatstroke incidence over WBGT-only models. GAMs with nonlinear terms captured overall dynamics effectively, while a hybrid approach with under-sampling XGBoost successfully targeted spike days for all cases, a key requirement for public health alerting and resource planning. For hospital admission and death cases, the GAM alone provided strong spike-day accuracy, likely due to smaller variability in maximum daily counts compared with all cases. Differences in performance between training years and 2018 reflect the unusually hot 2018 summer with a higher proportion of WBGT >31 °C days and higher case incidence; despite this, models generalized well to 2018. Compared with prior daily-unit models that did not handle severity or spike prediction, the 12-hour resolution and severity stratification here offer operational advantages for issuing alerts and allocating medical resources. SHAP analyses highlighted known drivers (temperature, consecutive hot days, solar radiation) and demographic factors (population, elderly) influencing risk; socioeconomic indicators (income) and sex ratio showed associations in the multivariable context, though causal interpretation is not the aim. Overall, findings address the research question by delivering accurate, operationally-relevant predictions and clarifying key predictive factors.

Conclusion

This study developed and validated 12-hour, city-level prediction models for all heatstroke cases and for hospital admission/death cases in 16 Japanese cities using multi-source weather, calendar, and demographic data. Multivariable machine learning markedly outperformed WBGT-only baselines. A hybrid model (GAM plus under-sampling XGBoost) effectively identified spike days for all cases, while GAM alone performed best for hospital admission/death spikes. These models are suitable for implementation in public health alert systems and for optimizing medical resource allocation. Future research should prospectively evaluate the impact of model-informed alerts on clinical outcomes and explore adaptation to other regions, years, and differing heatstroke definitions.

Limitations

- Geographic scope: data from 16 cities in western Japan may limit generalizability; model parameters may need adjustment elsewhere. - Dependence on weather forecast/data accuracy: predictive performance for future periods hinges on meteorological input quality. - Patient-level characteristics (e.g., age, sex, indoor/outdoor location) were not modeled as outcomes were aggregate counts by severity. - Cross-country generalizability is limited by differing heatstroke definitions; models require updating for other health systems and definitions.

Related Publications

Explore these studies to deepen your understanding of the subject.

Engineering and Technology

An electronic nose using a single graphene FET and machine learning for water, methanol, and ethanol

T. Hayasaka, A. Lin, et al.

Medicine and Health

Prognosis Individualized: Survival predictions for WHO grade II and III gliomas with a machine learning-based web application

M. Karabacak, P. Jagtiani, et al.

Medicine and Health

Identification and epidemiological characterization of Type-2 diabetes sub-population using an unsupervised machine learning approach

S. Bej, J. Sarkar, et al.

Engineering and Technology

An accurate and transferable machine learning interatomic potential for nickel

X. Gong, Z. Li, et al.

Listen, Learn & Level Up

Over 10,000 hours of research content in 25+ fields, available in 12+ languages.

No more digging through PDFs, just hit play and absorb the world's latest research in your language, on your time.

listen to research audio papers with researchbunny