
Medicine and Health
Using big data to understand the online ecology of COVID-19 vaccination hesitancy
S. Teng, N. Jiang, et al.
This fascinating study explores the reasons behind COVID-19 vaccine hesitancy revealed through an analysis of over 43,000 YouTube comments. Conducted by Shasha Teng, Nan Jiang, and Kok Wei Khong, the research uncovers how safety concerns, distrust, and political influences shape public perception of vaccines. Discover how addressing these issues with evidence-based messaging could make a difference!
~3 min • Beginner • English
Introduction
The study is situated in the context of late 2020 COVID-19 vaccine efficacy announcements (Pfizer-BioNTech, Moderna, AstraZeneca-Oxford) and early vaccination rollouts with varying uptake across countries. Despite high reported efficacies, low uptake threatens herd immunity targets. Prior surveys indicated varying willingness to vaccinate (e.g., USA ~69% willing pre-authorization; substantial hesitant/unsure shares in Europe and the UK) and highlighted concerns about safety, side-effects, and distrust. Social media became a primary arena for vaccine discourse during the pandemic, yet limited research has examined individual-level social media data to link discourse themes to vaccination intention over time. This study seeks to: (1) identify major themes in YouTube audience discussions of COVID-19 vaccines and (2) assess how these themes (mapped to Health Belief Model constructs and contextual factors) relate to vaccination intention.
Literature Review
Vaccine hesitancy is defined by WHO SAGE as delay or refusal of vaccination despite availability, influenced by confidence, complacency, and convenience. Traditional determinants include trust, perceived risks/benefits, access, cost, and religion. For COVID-19, prior reasons for hesitancy may not generalize due to novelty and evolving safety evidence. Surveys across the US, UK, and Europe documented sizable hesitant or unwilling proportions, with concerns focused on safety and side-effects. Social media shapes vaccine decisions, with studies showing anti-vaccine clusters entangling undecided users and misinformation reducing vaccination intent in randomized trials (~6 percentage-point declines in the UK/USA). Prior social media research largely performed sentiment/topic analyses (Twitter, YouTube, TikTok) but rarely linked identified themes to behavioral intentions. Comparing methods, surveys are slower, costly, and limited in tracking nuanced, real-time changes, while social media analytics provide large-scale, real-time insights into beliefs and discourse dynamics, potentially reducing response biases and capturing sensitive attitudes.
Methodology
Design: Mixed-method approach combining qualitative thematic analysis of YouTube comments with quantitative predictive modeling. Themes from text mining were mapped to Health Belief Model (HBM) constructs and contextual factors, then used in multiple regression to predict vaccination intention.
Data collection: Focused on YouTube videos concerning vaccine efficacy announcements in November 2020 (Pfizer-BioNTech, Moderna, AstraZeneca-Oxford). Search keyword: 'COVID-19 vaccine efficacy'. Video sources limited to accredited mainstream news outlets (e.g., ABC, BBC, CBC, CBS, CNBC, CNN, FOX, Los Angeles Times, MSNBC, NBC, SkyNews, The Sun, TODAY). Exclusion: comments disabled or <500 comments. Comments were scraped using Botster (Seobot). Total collected: 43,775 comments; exclusions: non-English (n=41), virus-focused rather than vaccine (n=206), hyperlinks/ads (n=325), resulting in 43,203 comments. Identifiers (nicknames, URLs, dates) were removed to ensure anonymity.
Text analytics: Conducted in SAS Text Miner 9.4. Pipeline: text parsing (tokenization, POS tagging, stopword removal), text filtering (rare term removal, synonym grouping, TF-IDF weighting), and text clustering. Singular Value Decomposition (SVD) reduced dimensionality (k=50, default). A hierarchical clustering solution with target of 40 clusters was produced; clusters were evaluated by RMSSTD (low values indicated good cohesion). Researchers inductively interpreted and labeled clusters via independent reading and consensus, aligning cluster content with measurable constructs (HBM and contextual factors). Eleven salient clusters were identified and labeled: Vaccination intention (Cluster 20), Political ideologies (21), Perceived trust in pharma (19), Perceived trust in government (10), Perceived trust in media (14), Perceived barriers (18), Perceived severity (24), Perceived benefit (9), Perceived susceptibility (13), Vaccine misinformation (clusters 23 and 15).
Quantitative modeling: The text cluster algorithm produced distances to centroids across iterations (46 samples). Observed mean values per cluster served as variables. Dependent variable: Vaccination intention (Cluster 20). Independent variables initially included political ideologies, perceived trust in pharma, government, media, perceived barriers, susceptibility, benefit, severity, and two misinformation clusters. SPSS 22.0 multiple regression was used. Diagnostics: outlier check via boxplots (kept case 1 as removing reduced R^2 >40% without meaningful coefficient changes); collinearity assessed via VIF (threshold 5.0); variables with high VIF were removed sequentially (misinformation1, then perceived trust in government). Linearity, homoscedasticity, and normality of residuals were assessed via residual plots, LOESS trend, P–P and Q–Q plots, and were found acceptable.
Key Findings
- The social media discourse exhibited polarization, with a majority of commenters expressing unwillingness to receive a COVID-19 vaccine.
- Reasons for hesitancy included concerns about safety and side-effects, questions about effectiveness, lack of knowledge, and distrust of government, pharmaceutical companies, and mainstream media.
- Political partisanship was evident in discussions and associated with vaccination intention.
- Anti-vaccine movements and misinformation (conspiracy narratives such as 5G, microchips, New World Order) were prevalent and perceived to undermine confidence.
- Text clustering yielded 11 labeled themes mapped to HBM constructs and contextual factors; prominent cluster frequencies included Vaccination intention (13%), Political ideologies (11%), Perceived trust in pharma (10%), and several others at 8–9%.
- Multiple regression predicting vaccination intention showed strong model fit: R=0.864, R^2=0.747, Adjusted R^2=0.692, F(8,37)=13.628, p<0.001.
- Significant predictors: Perceived susceptibility (β=0.379, t=2.981, p=0.005) and Political ideologies (β=0.371, t=2.663, p=0.011).
- Non-significant in the final model: Perceived trust in pharma (β=0.137, p=0.230), Perceived trust in media (β=0.174, p=0.238), Perceived barriers (β=0.182, p=0.114), Perceived benefit (β=-0.129, p=0.442), Perceived severity (β=-0.119, p=0.474), Misinformation2 (β=0.060, p=0.631).
Discussion
Findings indicate a highly polarized online ecology around COVID-19 vaccines, where hesitancy is driven by perceived risks (safety/side-effects), questions of efficacy, and distrust toward institutions (government, pharma, mainstream media). Mapping discourse to HBM constructs suggests that perceived susceptibility plays a central role in shaping intention, aligning with prior health behavior research. The significant association between political ideologies and intention underscores the politicization of vaccination; conservative-leaning users were less likely to intend vaccination, consistent with partisan echo chambers, lower trust in authorities, and greater exposure to or acceptance of misinformation. While perceived trust in media and pharma featured prominently in discourse, their direct statistical associations with intention were not significant in the final regression, though they may influence attitudes indirectly. Practically, results support the value of social media big data analytics for near real-time surveillance of vaccine sentiment, enabling targeted communication strategies that address key belief structures (enhancing perceived susceptibility and benefits, addressing barriers) and acknowledge the political information environment. Restoring trust through transparent, depoliticized communication and engaging communities on-platform may mitigate hesitancy.
Conclusion
The study demonstrates a big data approach to extract and prioritize vaccine hesitancy factors from YouTube discourse and to link them to vaccination intention via HBM-informed modeling. Eleven themes were identified, with perceived susceptibility and political ideologies emerging as the strongest predictors of intention. These insights can guide public health practitioners to craft evidence-based, platform-adapted messages that address core beliefs and the politicized context of vaccine information, helping improve uptake toward herd immunity thresholds.
Limitations
- Temporal scope: Data limited to November 2020 around vaccine efficacy announcements and coinciding with the 2020 US election, potentially inflating political content.
- Platform/sample bias: Only YouTube comments from mainstream media channels were analyzed; results may not generalize across platforms or broader populations.
- Lack of demographics: No user-level demographic data were available, preventing subgroup analyses (age, gender, ethnicity).
- Novel measurement approach: Text-derived constructs and single observed variables per cluster limit psychometric validation; future work should develop survey scales to validate and extend the predictive model with larger samples and effect size estimation.
- NLP method constraints: Unigram-based parsing and stemming may lead to cluster overlap; n-gram approaches could improve linguistic precision but require more computational resources.
- HBM scope: The cues to action construct was not observed in the data and thus not modeled.
Related Publications
Explore these studies to deepen your understanding of the subject.