Medicine and Health
COVID-19 predictability in the United States using Google Trends time series
A. Mavragani and K. Gkillas
The study investigates whether real-time online search activity can help monitor and predict COVID-19 dynamics in the United States. Motivated by the rapid global spread of COVID-19 from late 2019 through early 2020 and the shift of the epidemic’s center to the US, the authors position infodemiology—using internet data to inform public health—as a timely approach to complement traditional surveillance. The paper focuses on the Google Trends topic “coronavirus (virus)” and examines its relationship with COVID-19 cases and deaths at national and state levels. The research questions are: (1) Is there a statistically significant association between Google search interest and COVID-19 epidemiological indicators? (2) Do Google Trends time series improve predictability of COVID-19 outcomes, specifically the ratio of cumulative deaths to cumulative cases, across the US and individual states? The purpose is to provide evidence that online search behavior can serve as an early, robust indicator to aid public health decision-making, resource allocation, and preparedness.
The paper situates its contribution within the emerging infodemiology literature leveraging web data for outbreak monitoring (Ebola, Zika, MERS, influenza, measles). During COVID-19, Google Trends has been widely used, with at least seven early studies (as of April 20, 2020) addressing monitoring, tracking, and forecasting across Taiwan/China, Europe, the US, and Iran. Additional work has used Twitter for surveillance, public sentiment, and misinformation tracking. Table 1 systematically lists Google Trends–based COVID-19 studies by author, date, region, objective, and journal, highlighting rapid growth in this research area and supporting the plausibility of using online search data for pandemic surveillance.
Data sources: Google Trends (CSV exports) for the topic “coronavirus (virus)” were retrieved and normalized by Google over geography/time (scaled 0–100). COVID-19 epidemiological data (cumulative cases and deaths) were obtained from the COVID Tracking Project for the US and each state (plus DC). Worldwide distributions used Worldometer data. Visualization tools included Pixelmap and Chartsbin; analysis and plotting used Microsoft Excel. Timeframes: Google Trends and COVID-19 data were aligned exactly per state to match availability and onset of first confirmed case. Time windows varied by state between March 4 and April 15, 2020 (detailed in Table 2). Preprocessing: Each variable was standardized by its full-sample standard deviation so all variables had SD=1, enabling comparability of effect magnitudes. A nonparametric unit root test indicated stationarity; variables were used without further transformation. Correlation analyses: Primary analysis estimated Pearson correlation coefficients between Google Trends and the ratio (cumulative COVID-19 deaths)/(cumulative COVID-19 cases), using a minimum-variance bias-corrected bootstrap estimator with B=999 to mitigate small-sample bias. A secondary Kendall rank correlation (nonparametric, distribution-free, robust for ratios) was also computed with bootstrap bias correction. Predictability analysis: A quantile regression framework (median regression, q=0.5) modeled the dependent variable Y_t = (cumulative deaths)/(cumulative cases) using lagged Google Trends X_{t-1} as the explanatory variable, plus a linear trend. Quantile regression was chosen for robustness to outliers and non-normal errors; estimates were bias-corrected via balanced bootstrapping. The bias-corrected parameter for each quantile was obtained by subtracting the bootstrap-estimated bias from the original estimate. Analyses were conducted in R (quantreg and boot packages). Code was provided as Supplementary Online Material.
- Google interest in the “coronavirus (virus)” topic was high worldwide and across all US states during Jan 22–Apr 15, 2020, with US interest generally above 70 on the 0–100 scale (Fig. 3).
- Correlations:
- USA-level Pearson correlation between Google Trends and deaths/cases ratio: r = -0.7054 (SE 0.0536), p < 0.0001; Kendall τ = -0.6230 (SE 0.0780), p ≈ 1.36E-15, indicating a strong inverse association.
- Many states showed significant negative Pearson correlations (e.g., New York r = -0.5925***, Washington r = -0.5805***, Massachusetts r = -0.4285***, Wisconsin r = -0.3972***, California r = -0.4123***), while a few exhibited positive or nonsignificant correlations (e.g., Arkansas r = 0.4282***; Oregon r = 0.4605***; Illinois r = 0.2501, ns). Kendall results were broadly consistent, with numerous states significant and generally negative, though with some variability in sign and significance across states (Tables 3–4, Figs. 4–5).
- Predictability (Quantile regression, median q=0.5): Models exhibited strong COVID-19 predictability nationally and for most states (Table 5, Fig. 6). Example coefficients (β1 is the effect of lagged Google Trends):
- USA: β1 = -0.7506***, indicating higher search interest predicts lower subsequent deaths/cases ratio.
- Many states significant and negative (e.g., VA β1 = -0.3147***; MI β1 = -0.1884***; TN β1 = -0.2157***; WI β1 = -0.2143***; NV β1 = -0.2192***; KY β1 = -0.2415***).
- Some states showed positive significant β1 (e.g., OR β1 = 0.4395***; IL β1 = 0.3943***; SD β1 = 0.2815**), highlighting heterogeneity.
- Overall pattern: As COVID-19 progressed, online interest tended to decrease, producing an inverse relationship between search interest and the deaths/cases ratio. The relationship is non-linear and time-varying; interest peaks early and declines as the epidemic advances (Fig. 7).
The findings demonstrate that Google Trends data are significantly associated with COVID-19 outcomes and can enhance predictability of the deaths-to-cases ratio at national and state levels. This addresses the research question by showing that online search activity serves as a useful, timely indicator for monitoring and forecasting COVID-19 dynamics. The inverse association likely reflects behavioral responses: initial surges in public attention during rapid case growth, followed by declining interest due to information overload or habituation even as the epidemic continues. Methodologically, quantile regression offered robustness to outliers and small samples, and Kendall correlations corroborated Pearson results while being less sensitive to distributional assumptions. The significant predictive power across many states suggests practical value for public health, including earlier detection of regional shifts, support for resource allocation, and informing interventions. Heterogeneity in sign and magnitude across states emphasizes regional differences in epidemic timing, reporting, and population behavior, and cautions against uniform interpretations.
This study contributes evidence that Google Trends time series can monitor and predict COVID-19 at national and state levels in the US. Significant correlations and robust quantile regression models indicate search interest is informative for forecasting the deaths/cases ratio. These infodemiology tools can augment traditional surveillance to help flatten the curve, optimize resource allocation, and improve preparedness. Future research should: (1) integrate multiple infodemiology sources (e.g., additional search engines, social media); (2) combine online indicators with traditional epidemiological data for hybrid models; (3) investigate temporal dynamics and lag structures as outbreaks evolve; and (4) examine regional heterogeneity to tailor public health responses.
- Only one search engine (Google) was analyzed; relevant data from other engines were not included.
- Limited time span and small sample sizes may affect stability and generalizability of estimates.
- Substantial heterogeneity exists across states in epidemic progression and reporting; conclusions are state-specific and should be interpreted cautiously.
- Online search traffic data carry known limitations (e.g., changing public interest, media effects), and formal statistical correlations should be interpreted with care, especially amid dynamic non-pharmaceutical interventions and potential data anomalies.
Related Publications
Explore these studies to deepen your understanding of the subject.

