Health and Fitness

Leveraging social media data for pandemic detection and prediction

B. Shi, W. Huang, et al.

This study by Boyang Shi, Weixiang Huang, Yuanyuan Dang, and Wenhui Zhou explores the powerful link between social media data and pandemic detection, specifically during the COVID-19 outbreak in China. By analyzing user-generated content on platforms like Sina Weibo, researchers uncovered that public sentiment significantly indicates outbreak severity. Dive into the predictive capabilities that stem from individual self-reporting on illness symptoms.

00:00

~3 min • Beginner • English

Index

Introduction

The study investigates which types of social media information best reflect ground-truth pandemic dynamics and why. Motivated by delays and limitations in traditional surveillance systems, the authors examine the predictive value of social media signals across multiple COVID-19 waves in China. They identify gaps in prior work: focus on single waves, lack of differentiation by user type (public vs. opinion leaders vs. organizations), limited analysis of sentiment types, and black-box feature engineering with little causal explanation. The research question asks which social media users and sentiments most closely track and predict outbreak severity, and seeks explanatory evidence for why these signals work, aiming to inform timely monitoring and policymaking.

Literature Review

Prior studies show social media can predict disease prevalence and broader societal impacts during COVID-19 (e.g., petroleum markets, stock markets, tourism, transit). Twitter/Weibo signals correlate with cases and can improve forecasts; some integrate epidemiological models (SIR/SEIR) with social media. However, most works analyze single waves, aggregate all users without distinguishing user types, treat sentiment coarsely, and emphasize prediction accuracy over interpretability. Agenda-setting and crisis communication literature suggest organizational accounts (government, media) may amplify official narratives and lag case trends, whereas the public can act as social sensors, especially during emergent crises. Sentiment research indicates emotional signals can precede case trends, but not all sentiments are equally predictive; symptom- and diagnosis-related content is particularly informative.

Methodology

Data sources: (i) Sina Weibo microblogs collected via advanced search API using COVID-19-related hashtags (#Epidemic#, #COVID-19#, #COVID-19 Pneumonia#, #Novel Coronavirus#, #Positive#, #Infection#, #Cough#) from 2020-01-10 to 2023-01-03; 545,814 original posts collected, 486,166 after cleaning (removing links/emails/IDs/hashtags/extra spaces). Fields: content, date, user IP (geo), user type (public, opinion leader, government, media). (ii) Baidu Search Index for COVID-19 and symptom keywords. (iii) External data: daily provincial new cases, deaths, cures (NHC), hospital visits (monthly), registered urban unemployment, per-capita disposable income, passenger traffic, weather (dew point, temperature), government response and economic support indices (OxCGRT). Bot/spam/dedup handling: trained classifiers for bot account detection and spam content; applied two text dedup methods; repeated all analyses on resulting datasets; findings robust (details in Supplementary Appendix C1–C3). Sentiment analysis: Polarity (negative, neutral/nature, positive) using chinese-roberta-wwm-ext-large fine-tuned on CCIR 2020; performance: accuracy 0.7613, F1 0.7301. Emotion recognition using SMP 2020 (neutral, happy, angry, sad, fear, surprise); performance: accuracy 0.9744, F1 0.7646. Emotions mapped to sentiments: neutral+surprise→neutral; happy→positive; anger/sad/fear→negative. Empirical models: Panel regressions relating daily provincial new cases Case_{t,p} to lagged log counts of user-type posts (public, opinion leader, government, media) and to lagged sentiment-specific public posts (neutral, negative, positive), with control variables (deaths, cures, weather, policy/response indices, economic/activity indicators), and fixed effects. One-day lag for predictors following prior work. Also cross-correlation analyses between social media activity/search indices and new cases to assess temporal alignment. Prediction models: Built univariate predictor models for national daily new cases (Case_t) using: Weibo-based predictors (public activity by sentiment), Baidu Search Index keywords (e.g., Coronavirus, Cough), and baseline Case_t (autoregressive). Models: Linear Regression (LR), K-Neighbors Regressor (KNR), Random Forest Regressor (RFR), Extra Trees Regressor (ETR). Training strategy: expanding-window time series cross-validation (EXPANDING_WINDOW=56). Feature selection via lag window parameter d with bounds MIN_d and MAX_d empirically set to 15 days (aiming for early warning and reflecting observed cross-correlations). Evaluation metric: MAPE. Content analysis: Topic modeling on public users’ posts using LDA and Biterm Topic Model (BTM) to identify themes; coherence (UMass, CV) and manual interpretability used to choose 3 topics. Semantic analysis via Integrated Gradients on the fine-tuned BERT to extract salient sentiment-indicative tokens, linking negative sentiment terms to symptom self-reports.

Key Findings

- Public users are the most informative social sensors: In panel regressions, log(public posts) is positively and significantly associated with next-day new cases (e.g., OLS coef ≈ 8.230 (SE 2.768), Poisson ≈ 0.903*, NB ≈ 1.066**), while media and government signals are weaker or negatively associated; opinion leader effects are not consistently significant. - Negative sentiment from public users is the strongest sentiment indicator: log(negative) is positively and significantly associated with cases across models (e.g., OLS ≈ 17.50*, Poisson ≈ 0.402*, NB ≈ 0.687*), whereas neutral/nature and positive are not consistently significant. - Temporal lead of social/web signals: Daily Weibo blog volume shows local maximum cross-correlation with cases of 0.553 at a lag of 12 days. Baidu Search Index keyword cross-correlations with cases: Coronavirus 0.5569 (lag 14), Cough 0.6734 (lag 19); symptom-related terms exhibit stronger and consistently positive lags than general COVID-19 terms. - Prediction performance: Among Weibo predictors, negative sentiment from public yields the lowest MAPE (e.g., ETR ≈ 24.80 at optimal lag 7), outperforming neutral/positive. Across all predictors, Baidu Search Index symptom keyword Cough achieves very low MAPE (e.g., ≈ 5–6 with tree-based models) and outperforms the disease-term Coronavirus; autoregressive Case_t baseline also performs well but less than the best symptom search signal. - Content explains mechanism: Topic modeling reveals three themes—(1) COVID-19 infection/management (local max cross-corr ≈ 0.6248, lag 12), (2) epidemiological trends/diagnosis (≈ 0.3405, lag 7), and (3) physical symptoms of illness (≈ 0.7469, lag 15). The strongest association with cases comes from symptom-related content. Semantic analysis shows public negative-sentiment keywords are predominantly symptom terms; sentiment keyword sets are distinct (Spearman: neutral vs negative −0.08; neutral vs positive 0.17; negative vs positive 0.14; all p<0.01). Together, these support that individual self-reports of illness (often negative in sentiment) drive predictive power.

Discussion

Findings address the research question by showing that public users—particularly their negative-sentiment posts—most closely track and anticipate pandemic severity. Organizational accounts (government, media) tend to amplify official narratives or follow case trends, limiting their utility as early sensors. The predictive advantage of negative sentiment arises because symptom self-reports and illness experiences are often expressed negatively and precede official confirmations, providing lead time. Leveraging these signals can enable earlier and more localized monitoring and forecasting than traditional systems, informing resource allocation and policy responses. Comparison with search indices further indicates symptom-focused web behaviors (e.g., queries about cough) are superior early indicators compared to general disease terms.

Conclusion

The study demonstrates that social media activity from public users, especially negative-sentiment posts, effectively reflects and predicts pandemic dynamics across multiple COVID-19 waves in China. Empirical regressions, time-series prediction with multiple models, topic modeling, and semantic analyses consistently indicate that individual symptom self-reports are the core mechanism underlying predictive power. Practical implications include prioritizing public-origin, symptom-focused, and negative-sentiment content for real-time monitoring and forecasting, enabling earlier warnings and targeted resource allocation. Future work should extend to additional platforms and modalities, enhance sentiment/emotion modeling accuracy, and further examine the role of automated/bot accounts.

Limitations

- Single-platform scope (Sina Weibo); generalizability to other platforms (e.g., Twitter, Facebook, Instagram) requires validation. - Text-only analysis; multimodal signals (images, audio, video) were not incorporated. - Bot accounts were identified but not deeply analyzed for their specific roles or impacts. - Sentiment classification accuracy can be improved; better models may refine estimates and signals.

Related Publications

Explore these studies to deepen your understanding of the subject.

Business

Big Data and Security: A Review of Social Media Risks and Insights for Indonesia

A. Harakan, Abdillah, et al.

Medicine and Health

Digital Disease Surveillance for Emerging Infectious Diseases: An Early Warning System Using the Internet and Social Media Data for COVID-19 Forecasting in Canada

Y. Yang, S. Tsao, et al.

Health and Fitness

Situated data analysis: a new method for analysing encoded power relationships in social media platforms and apps

J. W. Rettberg

Business

From insights to impact: leveraging data analytics for data-driven decision-making and productivity in banking sector

R. Gul and M. A. S. Al-faryan

Listen, Learn & Level Up

Over 10,000 hours of research content in 25+ fields, available in 12+ languages.

No more digging through PDFs, just hit play and absorb the world's latest research in your language, on your time.

listen to research audio papers with researchbunny