logo
ResearchBunny Logo
Short-term local predictions of COVID-19 in the United Kingdom using dynamic supervised machine learning algorithms

Medicine and Health

Short-term local predictions of COVID-19 in the United Kingdom using dynamic supervised machine learning algorithms

X. Wang, Y. Dong, et al.

This innovative research, conducted by Xin Wang, Yijia Dong, William David Thompson, Harish Nair, and You Li, reveals the development of cutting-edge supervised machine-learning algorithms that leverage digital metrics to accurately predict local-level COVID-19 growth rates in the UK. With real-time visualization tools available through COVIDPredLTLA, this study demonstrates the potential for data-driven decision-making in public health.

00:00
00:00
~3 min • Beginner • English
Introduction
The COVID-19 pandemic has imposed substantial health and economic burdens in the UK and worldwide. Given the continued transmission of SARS-CoV-2, accurate short-term local forecasts are critical for timely public health decision-making and healthcare planning. Population-aggregated digital data streams, such as Google mobility metrics and symptom search trends, can capture near-real-time changes in population behavior without identifying individuals and have been widely used in infectious disease modeling. Prior work indicates certain mobility categories (retail and recreation, workplaces, transit stations) are key drivers of SARS-CoV-2 spread in the UK, and symptom search trends can serve as early indicators of infections. However, earlier approaches often relied on fixed predictor sets and preceded widespread vaccination and the emergence of variants of concern. Symptom profiles and transmission dynamics have evolved (e.g., with vaccination and new variants), necessitating flexible, dynamically updating models. This study aims to develop and validate dynamic supervised machine-learning algorithms that integrate symptom searches, mobility, and vaccination coverage to predict short-term changes in COVID-19 case growth at fine geographic resolution (LTLA) across the UK.
Literature Review
The study builds on a growing literature using digital traces for epidemic prediction. Population mobility data from platforms such as Google, Apple, and Facebook have been leveraged to understand contact patterns and their association with COVID-19 transmission. Previous work in the UK identified specific mobility categories (retail/recreation, workplaces, transit stations) as important correlates of SARS-CoV-2 transmission. Internet search data have historically anticipated influenza trends and other infectious diseases, and during COVID-19, symptom-related search volumes have correlated with incidence and been proposed for early warning. An early warning framework integrating multiple digital data sources improved prediction of COVID-19 activity in the US, though it used fixed predictors and predated variants and mass vaccination. Observational evidence (e.g., ZOE COVID study) suggests the spectrum of reported symptoms has shifted over time, potentially due to vaccination and variant emergence. Vaccination campaigns have reduced transmission and disease burden, implying vaccination coverage is a relevant predictor. These findings motivate a dynamic, data-driven approach that can adapt predictor sets over time to reflect evolving epidemiology.
Methodology
Design and objectives: Developed dynamic supervised machine-learning models (log-linear regression framework) to predict 1-, 2-, and 3-week ahead COVID-19 growth rates at the LTLA level in the UK. Predictions were defined for both case counts by publication date and by specimen collection date; primary focus was publication date due to its more reliable real-time reporting. Data sources and preprocessing: Integrated weekly aggregated series from (i) Google Search Trends Symptoms dataset (relative symptom search popularity vs. 2017–2019 baseline), (ii) Google Community Mobility Reports (six categories: workplaces, residential, parks, retail and recreation, grocery and pharmacy, transit stations), (iii) COVID-19 vaccination coverage (first and second dose cumulative percentages), and (iv) confirmed COVID-19 cases (by publication date and by specimen date). Time frame: weeks from 1 June 2020 (week 1) to 14 November 2021; only data versions available in real time were used when fitting models. Symptom searches were available at upper-tier authority level and extrapolated to LTLAs. Weekly averages computed for mobility, weekly maxima of cumulative vaccination coverage used. Missing values for symptoms, mobility, and vaccination were imputed via linear interpolation per LTLA; 13 LTLAs with insufficient data for interpolation were excluded, leaving 367 LTLAs. Predictors: Base symptoms included eight COVID-19–relevant terms (cough, fever, fatigue, diarrhoea, vomiting, shortness of breath, confusion, chest pain). Dysgeusia and anosmia were not base predictors in main models due to incomplete coverage; they were considered in sensitivity analyses along with headache, nasal congestion, and sore throat. Mobility metrics (six categories) and vaccination coverage (first and second doses) were included. LTLA fixed effects (dummy variables) were used. Dynamic model selection algorithm: Starting from a baseline log-linear model with LTLA fixed effects, six mobility metrics, two vaccination variables, and eight base symptoms, the algorithm iteratively selected: - Optimal lags for predictor blocks (mobility, vaccination, symptoms) among 0–3 weeks for each block (evaluating 64 lag combinations), - Additional symptoms (beyond the base eight) via forward selection from a pool of Google symptoms (173 total considered), - Predictor block combinations (including tests to drop datasets) to maximize predictability. Model selection criterion: For each candidate model, computed retrospective 4-week mean squared error (MSE) by averaging one-step-ahead prediction MSEs over the previous four weeks; models with minimum 4-week MSE advanced at each step. Naïve models (carry-forward last observed growth rate with possible 0–3-week lag) were included as candidates for selection. Prospective evaluation and checkpoints: After selecting the optimal model at a time point, prospective predictive performance was assessed by forecasting the 1-, 2-, and 3-week ahead growth rates and computing prospective MSE. Eight prespecified checkpoints (five weeks apart) were used for evaluation: week 1/40 (1 Mar 2021), 1/45 (5 Apr), 1/50 (10 May), 2/3 (14 Jun), 2/8 (19 Jul), 2/13 (30 Aug), 2/18 (4 Oct), 2/23 (14 Nov). Reference models: (1) Naïve model with optimal lag (based on retrospective 4-week MSE), and (2) fixed-predictors model that fixed the optimal predictor set from the first checkpoint (1/40) and re-estimated coefficients at later checkpoints. Sensitivity analyses: Expanded base symptoms to include dysgeusia, anosmia, headache, nasal congestion, and sore throat to test impact on predictive accuracy. Implementation and reporting: Web application (COVIDPredLTLA) built with R Shiny to present nowcasts and 1–2 week forecasts by LTLA, updating twice weekly. Analyses conducted in R 4.1.1 following STROBE and EPIFORGE guidelines. All data were publicly available aggregated datasets; ethics approval was not required.
Key Findings
- Data coverage and inputs: 367 LTLAs included. Time series showed multiple national peaks in growth rates; mobility and symptom trends varied by category; vaccination coverage reached ~82% first dose and ~75% second dose (age 12+) by week 2/23. - Model specification: Applying time lags and adding Google symptoms beyond the eight-base set improved retrospective 4-week MSEs. Optimal lag combinations and selected symptoms varied by checkpoint and prediction horizon; commonly selected additional symptoms included headache, ear pain, otitis, and tonsillitis. - Prospective accuracy (publication date outcomes): Median prospective MSEs across checkpoints were 0.12 (IQR 0.08–0.22) for 1-week ahead, 0.29 (0.19–0.38) for 2-week ahead, and 0.37 (0.25–0.47) for 3-week ahead predictions, with accuracy decreasing as horizon lengthened. - Geographic variation: Prospective MSEs tended to be smaller in central England and larger in Scottish LTLAs and the south-west of England, with patterns varying across checkpoints. - Comparison to references: Versus naïve models, optimal models reduced prospective MSE by 21–35% across horizons, including during rapid growth around week 2/3 (Delta spread). Versus the fixed-predictors model, advantages of the dynamic approach became evident at later checkpoints; at week 2/23, dynamic optimal models reduced MSE by 28–58% across the three horizons. - Retrospective vs prospective: Prospective MSEs were inflated by 31–73% relative to retrospective 4-week MSEs, but both declined over time and their differences narrowed after the first 3–4 checkpoints. - Sensitivity analyses: Including dysgeusia, anosmia, and additional COVID-19-related symptoms did not meaningfully improve prospective predictive accuracy.
Discussion
The study demonstrates that a dynamic supervised machine-learning framework that updates predictor sets and lags using current data can improve short-term prediction of local COVID-19 case growth rates. By integrating mobility, symptom search trends, and vaccination coverage, the models consistently outperformed naïve persistence baselines and, over time, a fixed-predictors approach. Importantly, the models maintained improved accuracy during periods of rapid epidemiological change (e.g., Delta-driven growth), indicating utility for early detection of surges and for informing timely public health responses at fine geographic scales (LTLA level). The adaptive selection of symptom predictors enables responsiveness to evolving symptomatology associated with vaccination and emerging variants. As expected, predictive accuracy diminishes with longer horizons due to the growing influence of the most recent, not-yet-observed behaviors; nonetheless, the framework provides actionable near-term forecasts and nowcasts that can guide local interventions and healthcare planning.
Conclusion
A dynamic, data-driven modelling framework integrating Google mobility metrics, symptom search trends, and vaccination coverage can predict short-term changes in local COVID-19 case growth across the UK. The approach outperformed naïve and fixed-predictor reference models, including during rapid growth associated with the Delta variant, and is operationalized via a publicly accessible web application (COVIDPredLTLA) that delivers regularly updated local forecasts. Continued monitoring and updating can support decision-making for control measures and healthcare capacity planning during ongoing and future epidemic waves, with adaptability to evolving symptom profiles and epidemiological conditions.
Limitations
- Real-time data delays and reporting issues: Publication date chosen as primary outcome due to incomplete real-time reporting by specimen date; mobility and symptom search data lag cases by 4–5 days. - Testing practice changes over time may have obscured true incidence trends; national differences in test inclusion (e.g., PCR vs. lateral flow confirmations) complicate comparability. - Potential demographic biases: Older adults and other subgroups may be underrepresented in Google mobility and search data due to differential internet access. - Symptom search idiosyncrasies: Early spikes in fever searches may reflect heightened concern; later peaks were lower, potentially due to vaccination or changing public behavior. - Predictor availability: Dysgeusia and anosmia were excluded from main models due to incomplete coverage; including them where available did not substantially change MSEs. - Uncertainty quantification: Intervals in the web application primarily reflect parameter uncertainty and may not capture variability in input data streams. - Unmeasured factors: Non-pharmaceutical interventions (e.g., mask use, hand hygiene) and climatic influences were not explicitly modeled; lack of variant proportion data at LTLA prevented variant-specific performance assessment.
Listen, Learn & Level Up
Over 10,000 hours of research content in 25+ fields, available in 12+ languages.
No more digging through PDFs, just hit play and absorb the world's latest research in your language, on your time.
listen to research audio papers with researchbunny