logo
ResearchBunny Logo
COVID-19 predictability in the United States using Google Trends time series

Medicine and Health

COVID-19 predictability in the United States using Google Trends time series

A. Mavragani and K. Gkillas

This paper explores how Google Trends data can predict COVID-19 cases and deaths in the United States, highlighting its significant implications for public health policy. The research was conducted by Amaryllis Mavragani and Konstantinos Gkillas.

00:00
00:00
Playback language: English
Introduction
The COVID-19 pandemic presented an unprecedented global challenge, necessitating novel approaches to monitor and forecast regional outbreaks. This study investigates the potential of Google Trends data as a real-time indicator for predicting COVID-19 trends in the United States. The rapid spread of the virus, originating in Wuhan, China in December 2019, quickly escalated into a global pandemic, significantly impacting various countries, including the United States. The high number of cases and deaths in the US, particularly in New York state, underscored the need for effective monitoring and forecasting tools. Infodemiology, which utilizes internet-based data for public health purposes, has shown promise in previous outbreaks (Ebola, Zika, MERS, influenza, measles). This research builds upon existing infodemiology studies using Google Trends, which has been a valuable tool for analyzing and forecasting diseases, by focusing specifically on the United States and applying robust statistical methods to address the limitations of limited data and potential outliers.
Literature Review
Several studies have explored the use of web-based data, particularly Google Trends and Twitter, for monitoring and forecasting COVID-19. By April 2020, seven studies had used Google Trends data to analyze COVID-19 in various regions, including Taiwan, China, Europe, the US, and Iran. Eight studies had employed Twitter data for similar purposes. This study distinguishes itself by being, to the authors' knowledge, the first to investigate COVID-19 predictability in the United States at both national and state levels using Google Trends data, focusing on the "coronavirus (virus)" search term.
Methodology
Data were collected from Google Trends (.csv) and normalized, reflecting relative search popularity. COVID-19 data for the US were obtained from the COVID Tracking Project, offering national and state-level details. To assess the relationship between Google Trends and COVID-19 data, both Pearson and Kendall rank correlations were calculated, employing bias correction via bootstrap simulation to handle small sample sizes and potential outliers. A key aspect of the methodology was the use of a bias-corrected quantile regression, a robust method suitable for small samples with outliers, to model the predictability of COVID-19 based on Google Trends time series. The dependent variable was the ratio of COVID-19 deaths to cases, and the explanatory variable was the lagged Google Trends data. A linear trend was also included in the model. The analysis was performed using the R programming environment, specifically leveraging the "quantreg" and "boot" packages for quantile regression and bootstrapping, respectively. The timeframe for data analysis varied by state, starting from March 4th or the date of the first confirmed case, to match available COVID-19 data.
Key Findings
Statistically significant correlations were observed between Google Trends data and COVID-19 data for the US overall and many individual states. Both Pearson and Kendall correlation analyses yielded consistent results, although the Pearson correlation was more sensitive to outliers due to its reliance on deviations from the sample mean. The quantile regression model demonstrated strong COVID-19 predictability using Google Trends data. The heat map of the regression coefficient (β₁) visually represents the predictability across states. Interestingly, the analysis revealed an inverse relationship between the progression of COVID-19 and online interest, with initial increases in search interest followed by a decline possibly due to information overload. This inverse relationship, however, does not invalidate the predictive power of the model. While some states showed a positive correlation (increase in Google Trends searches along with increase in the ratio of deaths/cases), many exhibited a negative correlation (increase in Google Trends searches associated with a decrease in the deaths/cases ratio). This complex relationship highlights the non-linear dynamics at play.
Discussion
The findings support the use of Google Trends data as a valuable tool for predicting COVID-19 trends in the United States. The robust statistical methods employed addressed the challenges posed by limited data and potential outliers. The strong predictability demonstrated by the quantile regression model indicates the potential for leveraging Google Trends to improve public health policy and resource allocation during outbreaks. The inverse relationship between the progression of COVID-19 and online search interest warrants further investigation, considering potential factors like information saturation and reporting delays. The study’s findings can help local authorities make informed decisions regarding resource allocation and public health measures.
Conclusion
This study demonstrates the significant potential of Google Trends data for predicting COVID-19 trends in the United States at both national and state levels. The robust quantile regression analysis revealed strong predictability despite the complexities of the virus's spread and public response to the information. Future research could explore the integration of Google Trends with other infodemiology sources and traditional data to enhance disease surveillance capabilities. Further investigation into the non-linear dynamics observed in the relationship between COVID-19 progression and online search interest is also warranted.
Limitations
This study has limitations. Only Google Trends data were used; incorporating data from other search engines could provide a more comprehensive picture. The relatively small number of observations, particularly for individual states, could limit the generalizability of the results. The diversity in confirmed cases and deaths across states also means conclusions are specific to each case. The dynamic nature of the pandemic and potential for data anomalies due to interventions also impact the interpretation of the findings.
Listen, Learn & Level Up
Over 10,000 hours of research content in 25+ fields, available in 12+ languages.
No more digging through PDFs, just hit play and absorb the world's latest research in your language, on your time.
listen to research audio papers with researchbunny