Medicine and Health

Digital Disease Surveillance for Emerging Infectious Diseases: An Early Warning System Using the Internet and Social Media Data for COVID-19 Forecasting in Canada

Y. Yang, S. Tsao, et al.

Dive into this compelling study by Yang Yang, Shu-Feng Tsao, Mohammad A Basri, Helen H Chen, and Zahid A Butt, exploring how internet search trends and social media data can act as early warning signals for COVID-19 surveillance in Canada. Discover how symptom searches and social media hashtags correlated with case incidences and the innovative forecasting techniques employed!

00:00

~3 min • Beginner • English

Index

Introduction

Emerging infectious diseases (EIDs) such as COVID-19, Ebola, and SARS pose major threats to global health and economies, with COVID-19 causing hundreds of millions of cases and substantial health expenditures in Canada. During epidemics, the public increasingly relies on internet searches and social media for health information, creating opportunities for real-time digital disease surveillance and epidemic prediction. Prior work, including our scoping review, identified a gap in real-time surveillance systems for COVID-19, while platforms like HealthMap and other initiatives demonstrate the promise of leveraging online data. This study aims to evaluate the association between Google Trends (GT) and Twitter data related to COVID-19 in Canada and to investigate whether these signals have predictive power as early warning indicators for forecasting COVID-19 cases.

Literature Review

Methodology

Data sources and period: Google Trends (GT) and Twitter data from 01/01/2020 to 03/31/2020 in Canada. GT symptom keywords included: cough, runny nose, anosmia, sore throat, shortness of breath, fever, headache, body ache, dyspnea, and fatigue. Twitter data included COVID-19-related hashtags and symptom-related terms (e.g., pneumonia, cough, fever, running nose, breath). COVID-19 daily case counts were obtained from the COVID-19 Canada Open Data Working Group. All signals were normalized to a common scale. Signal processing: To reduce noise, the study evaluated Fast Fourier Transform (FFT), Moving Average (MA), Savitzky-Golay (SG), and Lowess smoothing. Denoising was applied using adjacent averaging on both GT and Tweet signals. MA, SG, and Lowess were preferred due to better fit for the data, while FFT was less effective. Analytical approach: Time-lagged cross-correlation analyses were performed between each denoised signal and daily COVID-19 cases to identify lead–lag relationships and maximum correlation coefficients. Forecasting model: Long short-term memory (LSTM) neural networks were developed in Python using TensorFlow and Keras. Models used the past 20 days of data as input. Training used the Adam optimizer; the number of hidden units was set to 148 based on experiments. Hyperparameters for the initial learning rate and learning rate decay were tuned in the range 0.0001–0.001. Model performance was evaluated using mean squared error (MSE), R2, and adjusted R2. Different feature subsets were tested: GT-only (with filters by correlation threshold), Twitter-only, and combined GT+Twitter.

Key Findings

Time-lagged cross-correlation revealed that GT symptom searches correlated with daily cases with lags ranging from 1 to 13 days. Strongest GT signals: cough (max r = 0.825 at 9 days earlier), runny nose (r = 0.816 at 11 days earlier), anosmia (r = 0.812 at 3 days earlier). Other GT terms: sore throat (r = 0.790 at 6 days earlier), shortness of breath (r = 0.762 at 9 days earlier), fever (r = 0.752 at 10 days earlier), headache (r = 0.723 at 8 days earlier), body ache (r = 0.612 at 5 days earlier), dyspnea (r = 0.501 at 13 days earlier), fatigue (r = -0.275 at 1 day earlier). Twitter signals also correlated with cases: symptom-related tweets (r = 0.868 at 11 days earlier) and COVID-19-related tweets (r = 0.840 at 10 days earlier). LSTM forecasting using GT signals with correlation coefficient >0.75 achieved the best performance (MSE = 124.78, R2 = 0.88, adjusted R2 = 0.87). Twitter-only models performed poorly (e.g., MSE ≈ 23,500–26,612; negative R2), and combining GT and Twitter did not outperform the best GT-only model.

Discussion

The study demonstrates that denoised internet search queries and social media activity can serve as early warning signals for EIDs, with GT and Twitter signals leading reported COVID-19 cases by roughly 1–13 days. This lead time supports the feasibility of real-time digital disease surveillance to inform public health responses. Among denoising methods, MA, SG, and Lowess were more effective than FFT for these data. GT data exhibited more noise than Twitter, likely due to normalization, yet GT-based LSTM models outperformed Twitter-only and combined models, suggesting GT’s stronger predictive utility while Twitter may be more reactive. The findings align with prior evidence that online signals can precede incidence peaks by roughly 10–14 days and indicate the potential to extend surveillance to other platforms (e.g., Facebook, Reddit). Nonetheless, modelling challenges, data limitations early in the pandemic, and platform-specific biases warrant careful interpretation and integration with traditional surveillance.

Conclusion

Internet search engine queries and social media data can underpin a real-time digital disease surveillance system that functions as an early warning mechanism for forecasting EIDs such as COVID-19. Such systems can support epidemiological control, monitor public perceptions, and forecast outbreak trends. However, modelling challenges persist due to noisy self-generated data and the difficulty of identifying relevant symptom keywords early in an outbreak. Future work should incorporate multiple data sources, robust multimodal modelling, and ontology-based methods to dynamically identify and organize symptom keywords as knowledge evolves.

Limitations

Challenges include limited data availability at the beginning of the COVID-19 pandemic, changing conditions and unforeseen events affecting accuracy, and biases in social media data (e.g., under-representation of certain groups). GT provides relative search volumes with lower resolution, and only English tweets were analyzed, potentially omitting information from other languages. Twitter geolocation data may be inaccurate. Retrospective time-series analyses risk overfitting, particularly with complex, high-dimensional data. Noisy self-generated data complicate modelling, and identifying relevant EID symptom keywords early requires a dynamic, adaptive approach. A multifaceted strategy integrating multiple data sources and infectious disease models is recommended.

Related Publications

Explore these studies to deepen your understanding of the subject.

Business

Connecting with fans in the digital age: an exploratory and comparative analysis of social media management in top football clubs

E. Romero-jara, F. Solanellas, et al.

Humanities

Measuring the impact of COVID-19 on heritage sites in the UK using social media data

Z. Liu, S. A. Orr, et al.

Medicine and Health

An Emerging Health Crisis in Turkey and Syria after the Earthquake Disaster on 6 February 2023: Risk Factors, Prevention and Management of Infectious Diseases

M. M, M. S, et al.

The Arts

The impact of COVID-19 on digital data practices in museums and art galleries in the UK and the US

L. Noehrer, A. Gilmore, et al.

Listen, Learn & Level Up

Over 10,000 hours of research content in 25+ fields, available in 12+ languages.

No more digging through PDFs, just hit play and absorb the world's latest research in your language, on your time.

listen to research audio papers with researchbunny