logo
ResearchBunny Logo
Introduction
Emerging infectious diseases (EIDs) like COVID-19 pose significant threats to global health and economies. The high costs associated with EIDs, exemplified by the substantial healthcare expenditures in Canada during the COVID-19 pandemic, underscore the need for effective early warning systems. The increased reliance on the internet and social media for health information during outbreaks presents an opportunity for real-time disease surveillance and prediction. Previous research has demonstrated the potential of platforms like Google Trends and Twitter in tracking disease outbreaks; however, a gap remains in the development and implementation of real-time surveillance systems specifically for COVID-19 in Canada. This study aimed to address this gap by investigating the predictive power of Google Trends and Twitter data as early warning signals for COVID-19 cases in Canada. The researchers hypothesized that trends in online searches for COVID-19 related symptoms and social media discussions about the virus would correlate with and potentially predict the incidence of COVID-19 cases, enabling earlier and more effective public health interventions.
Literature Review
The introduction cites several studies supporting the use of online data for disease surveillance. A scoping review by the authors themselves identified a gap in real-time COVID-19 surveillance systems. The introduction mentions examples of existing systems like HealthMap and the Pandemic Response Platform, highlighting the use of machine learning algorithms and real-time data to track outbreaks. Several additional sources are cited which demonstrate the potential of various social media platforms (Twitter, Facebook, Reddit) and internet search data for tracking and forecasting disease outbreaks, including a study showing correlation between online data and COVID-19 incidence with a 10-14 day lead time. This literature review establishes the context and rationale for the current study, highlighting the existing evidence while simultaneously identifying a critical research gap that the current study intends to address.
Methodology
Data was collected from Google Trends (GT) and Twitter from January 1, 2020, to March 31, 2020, focusing on symptom keywords (cough, runny nose, anosmia, etc.) from GT and relevant hashtags from Twitter. COVID-19 case data was obtained from the COVID-19 Canada Open Data Working Group. Data preprocessing involved normalization and the application of several noise reduction techniques: Fast Fourier Transform (FFT), Moving Average (MA), Savitzky-Golay (SG), and Lowess smoothing. Denoising was performed using adjacent averaging. Time-lagged cross-correlation analyses were conducted to determine the relationship between the online data and daily COVID-19 cases. A long short-term memory (LSTM) model, implemented using TensorFlow and Keras in Python, was developed to forecast daily COVID-19 cases. The model was trained using the past 20 days' data, employing the Adam optimizer with 148 hidden units (determined through experimentation). Model performance was evaluated using mean squared error (MSE), R-squared (R²), and adjusted R². Hyperparameters, such as the initial learning rate and learning rate drop, were fine-tuned.
Key Findings
Time-lagged cross-correlation analysis revealed strong correlations between certain symptom keywords and daily COVID-19 cases. Specifically, searches for "cough," "runny nose," and "anosmia" on Google Trends showed high correlation coefficients (>0.8) and peaked 9, 11, and 3 days before the incidence peak, respectively. Similarly, symptom and COVID-19 related tweets showed high correlations with daily cases, peaking approximately 10-11 days earlier. The LSTM forecasting model yielded the best performance (MSE = 124.78, R² = 0.88, adjusted R² = 0.87) when using Google Trends signals with correlation coefficients >0.75. Combining Google Trends and Twitter data did not improve model performance. The Google Trends data proved to be a more effective predictor compared to Twitter data alone. Table 1 and Table 2 in the original paper detail the cross-correlation analysis results and LSTM modelling results, respectively. Figure 1 shows examples of the denoising process and Figure 2 displays the prediction accuracy of the best-performing model.
Discussion
The study successfully demonstrates the potential of using internet search engine queries and social media data for digital disease surveillance and early warning systems for EIDs. The strong correlations found between online data and COVID-19 incidence, particularly the lead time observed in searches, support the use of these data sources for real-time forecasting. The superior performance of the LSTM model using Google Trends data suggests that the structured and quantitative nature of Google Trends data may be more suitable for predictive modeling than the more varied and potentially noisy nature of Twitter data. The findings align with previous research showing a correlation between online data and disease incidence, confirming the utility of this approach. However, limitations in the data and the methods, as discussed below, must be considered when interpreting the results. Future research could explore the use of other social media platforms and improve the modeling techniques to enhance accuracy and address identified limitations.
Conclusion
This study establishes the feasibility of using internet search engine queries and social media data for real-time digital disease surveillance, providing an early warning system for EIDs. The high accuracy of the LSTM model using Google Trends data highlights the potential for improved outbreak response. However, challenges remain in data noise and the dynamic identification of relevant symptom keywords. Future research should focus on developing more robust and adaptable models, incorporating multiple data sources, and employing advanced techniques for keyword identification to overcome these limitations and improve the accuracy and generalizability of the prediction models.
Limitations
Several limitations were acknowledged in the study. The limited data availability at the beginning of the pandemic could affect the model's generalizability. The changing conditions and unforeseen events associated with the pandemic could also influence the accuracy of predictions. Potential biases in social media data, such as the under-representation of certain groups, were also noted. Google Trends data has lower resolution and only provides relative search volume. The study only used English Tweets and there may be geolocation issues. Retrospective time-series analyses are susceptible to overfitting. The use of a multifaceted strategy, incorporating diverse data sources and advanced modeling techniques, is needed to improve the reliability and accuracy of such surveillance systems.
Listen, Learn & Level Up
Over 10,000 hours of research content in 25+ fields, available in 12+ languages.
No more digging through PDFs—just hit play and absorb the world's latest research in your language, on your time.
listen to research audio papers with researchbunny