
Computer Science
Nova Scotia COVID-19 Health
D. L. Silver and R. Digamarthi
This exciting research by Daniel L Silver and Rinda Digamarthi explores the potential of machine learning in predicting daily COVID-19 cases by leveraging weather and mobility data. The study, utilizing a comprehensive dataset from 50 Ontario counties, reveals the power of LSTMs in achieving superior predictions. Dive into the world of data-driven insights for better understanding and managing the pandemic!
~3 min • Beginner • English
Introduction
This report is part of a joint Saint Mary’s–Dalhousie–Acadia project, funded by the Nova Scotia COVID-19 Health Research Coalition, to study how environmental determinants and social mobility relate to COVID-19 transmission and to build predictive models of daily case counts. Due to delays in accessing Nova Scotia’s 811 triage and testing data, the team used Ontario county-level data (Toronto, Peel, Durham, York) combined with weather and mobility datasets. The research questions were: (1) Can machine learning models accurately predict daily COVID-19 case counts using recent demographic, temporal, environmental, and mobility variables? (2) Which factors most influence those predictions, and what do they imply for disease control? The study aimed for a forecasting performance threshold of 15% MAPE, motivated by the need for timely, accurate projections for public health decision-making.
Literature Review
The study situates itself within several strands of literature: (a) Time-series and spatiotemporal machine learning for forecasting using decision trees, CNNs, and RNN/LSTM architectures, with prior applications in traffic, video, language, and sales prediction. (b) Epidemiological evidence on environmental influences, particularly temperature and relative humidity, on respiratory virus transmission and aerosol dynamics. Indoor environmental standards (ASHRAE) and guidance suggest an optimal indoor RH band (~30–60%) to reduce airborne transmission risk. Studies reported that low indoor RH facilitates longer aerosol suspension, increases virus survivability on surfaces, and may adversely affect host mucosal defenses. (c) Mobility and non-pharmaceutical interventions: mobility indices from large platforms (Apple, Google, Facebook) have been linked to transmission, but causality is confounded by policy changes and behavioral responses. The report references work showing seasonality, humidity’s role (including studies from Buenos Aires and international cohorts), and the complex cause–effect chain between case counts, public health restrictions, and mobility.
Methodology
Process: Adopted a standard data analytics lifecycle (Business Understanding, Data Understanding/Collection, Data Preparation, Model Development, Model Evaluation, Iteration), akin to CRISP-DM, executed iteratively due to evolving data availability. Data sources and scope: - COVID-19 case counts and demographics (age, sex), and calendar variables (day of week, day of year) from Ontario’s open data portal. - Weather from Environment and Climate Change Canada via weatherstats.ca (temperature, RH, visibility, wind, pressure, precipitation, dew point, indices). - Mobility from Facebook “Data for Good” movement range maps (movement relative to baseline, proportion staying put). - Public health restriction stages compiled from Ontario’s COVID-19 site (five levels from Normal to Lockdown). - Indoor relative humidity (IRH) from Mississauga-area measurements obtained in May 2021. Study region/time: Four Ontario counties (Toronto, Peel, Durham, York) representing ~40% of Ontario’s population, with daily records from March 1 to Dec 31, 2020 (pre-vaccine period), extended into early 2021 for data continuity. Final 2020 dataset: 306 daily records. Variable selection and lags: Initial broad set across five categories (case dynamics, mobility/policy, demographics, time, indoor/outdoor environment). Pearson correlations were computed across lags up to 14 days; variables with |r|<~0.1 or high missingness (>50%) were dropped. Key retained predictors (16 total) included avg_temperature (outdoor), IRH, avg_visibility, DOW, precipitation, avg_health_index, DOY, Mobility; autoregressive (AR) terms for case counts were considered in some models. Data preparation: - Aggregated daily by county group; handled missing weather values; removed variables with >50% missingness (solar radiation, cloud cover, snow_on_ground, max_humidex, min_windchill, etc.). - Stationarity/normalization as needed for neural networks. - For IDTs/CNNs, created fixed windows of 7 or 14 prior days across 16 variables (yielding 112 or 224 columns) to predict D+1 (or D+1..D+7). - For LSTMs, structured as 3D tensors (samples × timesteps × variables) with 14-day sequences. Modeling: - Inductive Decision Trees (WEKA M5P) with linear models at leaves; hyperparameter M=4. - CNN: 1 convolutional block (stride 1) + max pooling (filter size 2, stride 1), ReLU activations, linear output; Adam optimizer (lr=5e-5), MAE loss, 1000 epochs, batch size 1. - LSTM: Single-task D+1 model with two LSTM layers (64 units each), dense layer (8 ReLU), linear output; Multi-task model for D+1..D+7 with two LSTM layers (448, 384 units) and dense layers (128, 64) feeding 7 linear outputs. - Evaluation metrics: MAE and MAPE; persistence baseline (ŷt+1 = yt); repeated runs with different random initializations; hypothesis tests as needed. - Chronological 5-fold cross-validation for LSTMs (train/val windows advancing through 2020), with early stopping on validation loss. Hardware/software: NVIDIA GTX 1660 Ti (6 GB), Intel i7, 16 GB RAM; Python 3.7, TensorFlow 2.4/Keras 2.3.1, scikit-learn; CUDA/cuDNN.
Key Findings
Correlations and lags: - Outdoor average temperature had the strongest negative correlation with case counts: r = −0.468 at a 10-day lag. - Indoor relative humidity (IRH): r = −0.301 at a 6-day lag; outdoor RH showed little direct correlation. - Visibility: r = −0.248 at 14-day lag; Health Index: r ≈ −0.164 at 6-day lag; Precipitation: r ≈ −0.066 at 12-day lag. - Mobility (movement relative to baseline) had its strongest negative correlation at zero lag: r = −0.289. - Restrictions vs Mobility correlation: r ≈ 0.90 at zero lag, decreasing with lag, suggesting a chain from case counts → restrictions → mobility. - Significant predictors generally acted within a 14-day window (avg lag ~10.5 days). Demographic/temporal patterns: - Early 2020 (winter/spring): cases concentrated among ≥70-year-olds; fall 2020: strongest rise among 20–39-year-olds; 40–69 affected in both waves. - Weekend/weekday effects: higher recorded cases Sundays/Mondays; highest mobility on weekends (Saturday/Sunday). Modeling performance (selected): - Persistence baseline (common test periods): MAE ~74–88; MAPE ~14–17.6%. - IDT (7-day window, with AR): MAE 51.32; MAPE 10.90%. Without AR: MAE 61.48; MAPE 13.02%. - IDT (14-day window, with AR): MAE 42.65; MAPE 9.30%. Without AR: MAE 53.23; MAPE 11.28%. Models without DOY performed poorly even with Mobility, indicating DOY’s importance; DOY-only outperforming Mobility-only models. - CNN (14-day window): MAE 98.9; MAPE 11.1% (better than persistence; worse than best IDT/LSTM). - LSTM STL (D+1, 14-day window): MAE 84.6; MAPE 10.75% on fixed train/val/test split. - LSTM MTL (D+1..D+7, 14-day window): average over 7 days MAE 75.12; MAPE 10.24%; for D+1: MAE 57.10; MAPE 8.7%. - LSTM STL (D+1), 5-fold chronological CV (Oct 15–Dec 24 test windows): fold MAEs 24.89–64.38; MAPEs 5.80–12.62; average MAE 43.51; MAPE 8.76% (better than IDT avg MAPE 10.48% and persistence 14.02%). - LSTM STL (7-day average D+1), 5-fold CV: average MAE 8.47; MAPE 1.87% (persistence MAPE 1.74% on same target due to smoothing). Summary model claim: Best Daily D+1 LSTM achieved MAE 44.70; MAPE 9.23% with 95% CI 2.03%. Best 7-day average model: MAE 8.47; MAPE 1.87% with 95% CI 0.98%. Interpretability/feature roles: - Decision trees emphasized IRH, DOY, Mobility, and outdoor temperature. Low IRH (<~30–40%) and colder periods associated with higher future case counts, consistent with aerosol physics and indoor behavior (less ventilation, heating). - Mobility alone can confound modeling unless DOY/seasonality is included; DOY captures broader seasonal/behavioral shifts.
Discussion
The study shows that accurate short-term forecasts of COVID-19 daily cases are feasible using recent sequences of environmental, mobility, demographic, and temporal variables, meeting or surpassing the 15% MAPE target. LSTM models with chronological cross-validation generalized best, reflecting their capacity to learn temporal dependencies without manual window engineering. The analysis clarifies that environmental factors—especially outdoor temperature and indoor relative humidity—have significant lagged associations with case counts within plausible incubation/testing windows. This supports the hypothesis that indoor conditions during colder months facilitate transmission via aerosol persistence and reduced ventilation, aligning with building science and epidemiological literature. Mobility’s relationship with transmission appears mediated by public health restrictions: case surges precede stricter restrictions, which then reduce mobility; thus, naïvely attributing causality from mobility to cases can be misleading without accounting for policy timing and seasonality. Demographic shifts across waves suggest that seasonal environment plus targeted protections (or lack thereof) for vulnerable groups influenced which age cohorts were most affected. Overall, the findings indicate that incorporating seasonality (DOY), IRH, and temperature improves forecasts and provides actionable insights: raising indoor RH within recommended bands and enacting timely, targeted restrictions can modulate transmission risk during high-risk seasons.
Conclusion
This project contributed: (1) a curated multi-source daily dataset for four major Ontario counties across 2020; (2) a comparative evaluation of IDT, CNN, and LSTM models for D+1 and D+1..D+7 forecasting with chronological cross-validation; (3) identification of key predictors and lags, notably outdoor temperature and indoor relative humidity, alongside DOY and mobility; and (4) practical recommendations for public health and building operations. The best-performing models achieved approximately 9% MAPE for daily forecasts and under 2% for 7-day average forecasts, outperforming persistence and matching or exceeding tree-based baselines. Recommendations include: communicating the role of indoor environment in winter; maintaining indoor RH ~30–50% in high-mixing public spaces; prioritizing protective measures for seniors early in outbreaks; and focusing modeling on smoothed targets (7-day averages) and multi-task outputs for improved stability. Future work: apply the framework to Nova Scotia data when available; broaden indoor environment measurements beyond a single region; integrate explicit policy-timing variables; explore causal inference to disentangle policy–mobility–transmission pathways; and examine ensemble and probabilistic forecasts with calibrated uncertainty.
Limitations
- Data substitution: Nova Scotia 811 and testing data were unavailable during the main analysis; Ontario data were used instead, limiting direct generalizability to Nova Scotia. - Indoor RH measurements were sourced from the Mississauga area and may not represent all buildings or microclimates; IRH coverage was limited. - Some environmental variables had substantial missingness and were excluded; quality of weather station coverage may affect estimates. - Potential confounders (testing capacity changes, reporting delays, localized outbreaks) may influence case counts despite lag modeling and smoothing. - Model interpretability varies; neural networks provide limited direct causal insight. - Sample size and timeframe (pre-vaccine 2020) may not capture dynamics under vaccination, variants, or changing public behavior. - Mobility metrics reflect aggregated device movement and may not capture all relevant contact patterns; policy compliance heterogeneity is unobserved.
Related Publications
Explore these studies to deepen your understanding of the subject.