
Medicine and Health
Real-time tracking and prediction of COVID-19 infection using digital proxies of population mobility and mixing
K. Leung, J. T. Wu, et al.
This innovative study by Kathy Leung, Joseph T. Wu, and Gabriel M. Leung introduces a groundbreaking framework that combines digital proxies of human mobility with established epidemic models. This integration allows for near real-time tracking of COVID-19's transmissibility, offering accurate assessments and forecasts to enhance intervention strategies.
~3 min • Beginner • English
Introduction
Timely monitoring of COVID-19 transmissibility is hampered by an inherent delay of about 9 days between infection and case reporting (incubation ~6 days, onset-to-diagnosis ~3 days, plus reporting delay). A substantial fraction of transmission is pre-symptomatic, and COVID-19 exhibits clustering and superspreading, making rapid assessment crucial. The study addresses the need for real-time analytics by integrating digital proxies of human mobility and mixing into epidemic models to estimate effective reproduction number (Rt) without waiting for case reports, and to nowcast and short-term forecast epidemic dynamics. Hong Kong’s epidemic and interventions provide the case study context, with separate consideration of imported and local transmission due to differing transmissibility under travel-related NPIs.
Literature Review
Prior work demonstrates the utility of digital mobility and mixing data to monitor interventions and transmission (e.g., mobile and transaction data, aggregated mobility indices). Conventional social contact surveys inform contact matrices but are challenging to update in real time, particularly during fast-changing epidemics. Methods for estimating time-varying Rt (e.g., the EpiEstim/Thompson framework) and deconvolution approaches have been established. Studies highlight the role of changing contact patterns, the impact of NPIs, and the contribution of superspreading. The authors note that age-stratified, category-specific digital proxies (e.g., transport transactions) correlate more strongly with transmissibility than broader mobility indicators (e.g., CityMapper or Google reports), motivating their selection.
Methodology
Setting and data: The Centre for Health Protection (CHP) in Hong Kong provided daily confirmed COVID-19 cases by date of symptom onset. Cases were categorized as imported, local, possibly local, and epidemiologically linked to imported or local cases. Given strong travel NPIs, transmissibility for imported vs local cases was estimated separately.
Digital proxies: Octopus card transactions are near-ubiquitous in Hong Kong (used by ~99% aged 16–65; >14 million daily transactions). Daily numbers of transactions stratified by card type (child 3–11, student <26, adult <65, elder ≥65) and category (transport, retail) were obtained and normalized to 1/100% relative to Jan 1–15, 2020 baselines.
Analytic framework comprised four steps:
1) Reconstruct infection curves and estimate empirical Rt: Epidemic curves by date of infection were reconstructed from onset curves via deconvolution using a Gamma incubation distribution (mean 6.5 days, SD 2.6). Imported and local-related curves were separated. Instantaneous Rt was estimated using EpiEstim with a 7-day window and Gamma generation time (base mean 5.2 days, CV 0.33; sensitivity with means 4.2 and 6.2). The resulting Rt for local cases is termed empirical Rt.
2) Select digital proxies: Pearson correlations were computed between empirical Rt (posterior means) and each normalized Octopus proxy g_{a,c}(t). Transport transactions across age groups had r ≥ 0.5; retail proxies showed low correlation except adult fast-food retail. Hence, only age-specific transport transactions were retained as proxies for population mixing.
3) Age-structured SIR model parameterized by proxies: Population age groups were 0–11, 12–18, 19–64, ≥65 years. The contact/mixing between age groups a and b outside households at time t was modeled as β_ab(t) = γ_a γ_b g_{a,tran}(t) g_{b,tran}(t), where γ_a are scaling factors inferred from data. An age-structured SIR with time-since-infection formulation tracked S_a(t), I_a(t,r), R_a(t). The force of infection η_a(t) depended on β_ab(t) and susceptible proportions. The time-varying next-generation matrix NGM(t) yielded Rt as its dominant eigenvalue. The model produced infection incidence and, via reporting proportion P_report, onset incidence. The epidemic was seeded by M local infections on Jan 22, 2020.
Inference: Parameters θ = {M, γ_a (for each age group), P_report} were estimated by fitting to daily onset counts (Jan 22–May 31, 2020) using a Poisson likelihood and Bayesian MCMC with flat priors. Posterior P(θ) was obtained for each fit date. Sensitivity analyses varied generation time and added household contacts using a pre-pandemic household contact matrix weighted by a parameter ρ.
4) Nowcasting and forecasting: On each day t, 5000 posterior samples of θ parameterized simulations to nowcast cases on day t (accounting for infection-to-reporting delay) and forecast days t+1 to t+6 under assumptions about future mixing (typically status quo of proxies). Predictive performance was evaluated by sharpness, bias, ranked probability score (RPS), Dawid–Sebastiani score (DSS), and absolute error (AE).
Additional analyses: Simulations validated empirical Rt accuracy except when daily cases <10. Rt and scaling factor sensitivity to generation time was assessed; effects on nowcast/forecast precision were minimal. Alternative mobility indicators (CityMapper, Google) were compared (lower correlations). Household contact incorporation did not improve fit, suggesting predominant community transmission during study period.
Key Findings
- Empirical Rt trajectory in Hong Kong: ~2.5 in mid-January at onset of community transmission; fell to ~1 by late January following Wuhan lockdown and Chinese NPIs; hovered around 1 in February during work-from-home and distancing; rebounded to ~2.5 in early March with relaxation and >75,000 returnees; declined below 1 in April after renewed NPIs starting March 21.
- Digital proxy validity: Strong correlations between empirical Rt and age-specific Octopus transport transactions: r = 0.62 (children), 0.68 (students), 0.80 (adults), 0.76 (elderly). Retail transactions correlated poorly except adult fast-food retail (r = 0.71), still lower than adult transport; retail proxies were not used.
- Model performance: Rt from the fitted age-structured model parameterized by transport proxies correlated extremely well with empirical Rt (r = 0.98). Deviations in late Feb/early Mar likely due to low case counts causing oscillatory empirical Rt.
- Ascertainment: Estimated that 23% (95% CrI: 13–47%) of local infections were ascertained by surveillance.
- Stability of scaling: Posterior distributions of scaling factors γ_a were stable when refitting at multiple times (Mar 2, 14, 17, 22; Apr 4), supporting temporal robustness of proxy-parameterization.
- Nowcast/forecast accuracy: Predictive metrics (sharpness, RPS, DSS, AE) improved from Jan 30 to Feb 28 as data accrued. Forecast errors coincided with superspreading events (e.g., Feb 29 religious group cluster and Diamond Princess returnees; Mar 22–23 bar cluster) where incidence was underestimated, and during very low prevalence (Mar 15–17) where deterministic model overestimated growth due to stochastic effects. Despite these, nowcasts during Feb–Apr were largely robust, with tight 95% prediction intervals adequate for practical inference of unreported-onset cases due to delay.
- Sensitivity to generation time: While Rt estimates and γ_a values varied with generation time assumptions, nowcast/forecast accuracy and precision were largely unaffected.
Discussion
The study demonstrates that age-specific digital mobility proxies can be integrated into mechanistic, age-structured epidemic models to obtain near real-time estimates of transmissibility and to nowcast/short-term forecast COVID-19 dynamics, overcoming the typical ~9-day reporting delay. The strong proxy–Rt correlations and the model’s high concordance with empirical Rt indicate that transport transaction volumes capture relevant changes in population mixing by age. Stable scaling factors over time support epidemiologic validity. Forecast performance was generally strong but degraded during superspreading events or at very low prevalence, highlighting the need to account for overdispersion and stochasticity. Compared with traditional contact surveys, digital proxies offer higher temporal resolution and practicality for surveillance. The framework is adaptable to other platforms (e.g., WeChat/Alipay, transit cards, Google/Facebook data), enabling jurisdictions to leverage real-time big data for timely epidemic intelligence and to evaluate intervention impacts in near real time.
Conclusion
By parameterizing an age-structured transmission model with age-stratified transport transaction data, the authors accurately tracked Rt in near real time and produced robust nowcasts and short-term forecasts of COVID-19 in Hong Kong. Key contributions include validating digital proxies of mixing, quantifying strong proxy–Rt relationships, demonstrating high agreement between model-based and empirical Rt, and estimating under-ascertainment. The approach enables rapid assessment of NPIs and behavioral changes. Future work should incorporate stochastic transmission and superspreading heterogeneity, explore integration of additional data streams (e.g., location services, wearable data), refine spatial granularity, and evaluate generalizability to other settings and diseases while ensuring privacy-preserving data access.
Limitations
- Dependence on proxy validity: Results hinge on transport transactions accurately reflecting age-specific mixing relevant to transmission; proxies with weaker correlation (e.g., retail or broader mobility indices) underperform.
- Underrepresentation of household contacts: Incorporating household contact matrices did not improve fit, possibly due to predominant community transmission; household contributions could be underestimated in other contexts.
- Superspreading and stochasticity: Framework did not explicitly model overdispersion; forecasts underestimated incidence during SSEs and overestimated during very low prevalence due to deterministic assumptions.
- Low counts: Empirical Rt estimation is less reliable when daily cases <10, causing oscillations and discrepancies.
- Parameter sensitivity: Rt estimates and scaling factors are sensitive to assumed generation time, though nowcast/forecast accuracy remained robust.
- Ascertainment assumptions: Constant reporting proportion and consistent testing strategy are assumed; changes could bias inference.
Related Publications
Explore these studies to deepen your understanding of the subject.