Computer Science

Not cool, calm or collected: Using emotional language to detect COVID-19 misinformation

G. Asher, P. Bohlman, et al.

COVID-19 misinformation is a major hurdle to effective pandemic management, and a team of researchers from Dartmouth College has tackled this issue head-on. Their cutting-edge model harnesses both tweet emotion and misinformation encoders to better detect false information on Twitter, revealing superior results compared to traditional methods.

00:00

~3 min • Beginner • English

Index

Introduction

COVID-19 has caused widespread mortality and significant social disruption. Social media platforms (Twitter, Facebook, YouTube, Instagram) amplify misinformation related to vaccine hesitancy, virus origins and transmission, and unsubstantiated medical advice. This misinformation can undermine public health interventions and reduce vaccine uptake, especially among high-risk populations. Prior research has explored machine learning methods for misinformation detection (e.g., dEFENSE; BERT-based classifiers) and created several benchmark datasets (e.g., Patwa et al., CoAID, COVID-HeRA, COVIDLies). Emotions play a critical role in the spread of misinformation: anxiety, skepticism, and other strong affective states can increase belief in and sharing of false claims. Twitter, in particular, is known for toxic and emotive language, which is especially pronounced in COVID-19 discussions. Inspired by successes incorporating external knowledge into NLP models in other domains (e.g., mental health classifiers using external encoders), the authors pose the primary research question: to what extent does encoding tweet emotions into COVID-19 misinformation detection models improve model accuracy? Secondary questions include: What is the relationship between tweet emotion and COVID-19 misinformation? What are the trends in correctly and incorrectly predicted misinformation labels? The paper proposes a dual-encoder architecture leveraging both emotion and misinformation encoders to address these questions.

Literature Review

Multiple studies have applied machine learning to COVID-19 misinformation detection, including generalized misinformation systems like dEFENSE (with word, sentence, and comment encoders) and various BERT fine-tuning approaches. Several datasets have facilitated research: Patwa et al.’s 10,700-post dataset, CoAID (over 200,000 annotated items), Kim et al.’s 722-claim/150,000-tweet dataset, and COVID-HeRA (61,286 tweets labeled by misinformation severity). Beyond COVID-19, prior NLP work shows benefits from encoding external information alongside task-specific features; for instance, Lokala et al. improved cardiovascular disease symptom classification with separate external and task-specific encoders. Given Twitter’s prevalence of emotional and toxic language and the emotional salience of pandemic discussions, incorporating emotion signals may aid misinformation detection.

Methodology

Datasets and acquisition: The primary dataset is a subset of COVID-HeRA (Dharawat et al.), which labels tweets across five classes: Real news/claims, Possible severe, Highly severe, Other, and Refutes/rebuts. Original dataset size: 61,286 tweets with inter-annotator Cohen’s kappa of 0.7037. Due to computational constraints and class imbalance, the authors filtered to 6,355 tweets. Emotion dataset: From the COVID-HeRA subset, 100 unique training tweets were randomly sampled and manually annotated by three native English speakers with one of seven emotions: anger, disgust, fear, joy, sadness, surprise, neutral. Annotator agreement had a kappa of 0.54 (moderate). A Sentence-BERT (SBERT) base model was fine-tuned using a 90/10 train/test split of these annotations, achieving 72% agreement with the labels. Proposed architecture: Two task-adaptive SBERT encoders are fine-tuned: (1) an emotion encoder trained on the hand-annotated 7-emotion dataset; (2) a COVID-19 misinformation encoder trained on the COVID-HeRA subset for severity classification. For a given input tweet, embeddings are generated by both encoders; the [CLS] tokens from each embedding are extracted and concatenated, then fed into a multi-layer perceptron (MLP) to predict the misinformation severity label. Preprocessing: Tweets unavailable due to deletion/removal were filtered out; mentions and links were removed via regex; data was split 80% train / 20% test. Baselines: (a) TF-IDF feature vectors passed to a Random Forest classifier; (b) average GloVe embeddings (pre-trained on 6B tokens, 400k vocabulary from Wikipedia + Gigaword5) passed to Logistic Regression; (c) SBERT (all-MiniLM-L6-v2, pre-trained on 1B sentence pairs) embeddings passed to Logistic Regression or an MLP for classification. Evaluation: Multi-class classification across the five COVID-HeRA severity labels with reporting of per-class Precision/Recall/F1, as well as macro- and weighted-average scores and accuracy.

Key Findings

- The dual-encoder Emotion model outperforms all baselines (TFIDF, GloVe, SBERT) on aggregate metrics (macro and weighted Precision, Recall, F1, and Accuracy). - Per-class improvements are strongest for misinformation-containing categories: Other, Possibly severe, and Highly severe. Performance is marginally better on Real News/Claims and worse on Refutes/Rebuts relative to SBERT. - Reported metrics (examples from Table 4): • Precision (weighted avg): Emotion 0.64 vs SBERT 0.60; macro avg: Emotion 0.54 vs SBERT 0.48. • Recall (weighted avg): Emotion 0.60 vs SBERT 0.56; macro avg: Emotion 0.37 vs SBERT 0.32. • F1-score (weighted avg): Emotion 0.54 vs SBERT 0.48; macro avg: Emotion 0.39 vs SBERT 0.32; Accuracy: Emotion 0.60 vs SBERT 0.56. - Recall gains are emphasized as most important for real-world moderation pipelines aiming to minimize false negatives. - Interpretability analyses (Captum) show the emotion model focuses on charged words (e.g., “quack”, “sheep”, “indoctrination”) for anger/disgust/fear and inversely for neutral, aligning with intuition.

Discussion

The findings support the hypothesis that emotional features add complementary signal to contextual/factual encodings for COVID-19 misinformation detection. The dual-encoder setup consistently outperforms a single SBERT encoder, indicating that emotion-aware representations help identify tweets asserting misinformation, especially in the Other, Possibly severe, and Highly severe classes. This is particularly valuable when optimizing for recall in applications like content moderation, where surfacing potentially harmful content for human review is critical. However, the model struggles with tweets labeled Refutes/Rebuts and Highly severe—classes that often share similar content and rely on tonal cues (e.g., sarcasm) for differentiation. UMAP projections of embeddings show clustering overlaps between these categories and a tendency for test embeddings to cluster as Real News, suggesting distributional and labeling challenges. Overall, the results demonstrate promise for emotion-augmented architectures but also highlight the need for better data quality, label definitions, and handling of nuanced language phenomena like sarcasm.

Conclusion

Encoding emotion alongside content improves transformer-based detection of COVID-19 misinformation severity on Twitter. The proposed dual SBERT encoder architecture achieved higher accuracy, precision, recall, and F1 scores than TFIDF, GloVe, and single-encoder SBERT baselines. The team also created a hand-annotated emotion dataset to fine-tune the emotion encoder and showed that correctly distinguishing emotionally charged rebuttals from misinformation remains challenging. Future research should refine emotion representations (including sarcasm and nuanced affect), improve label quality and distribution alignment, and investigate generalization to other domains (e.g., elections) and platforms (e.g., news headlines), where emotional language is prevalent.

Limitations

- Train–test label distribution mismatch (e.g., higher proportion of Real News in training; more Other and Refutes/Rebuts in test) likely biased learning and reduced generalization. - Low-quality annotations in COVID-HeRA: (1) insufficient context (e.g., retweets/links) impeding reliable labeling; (2) irrelevant content within Real News class; (3) label ambiguity (sarcasm, tonal cues) complicating class separation and inducing feature confounds. - Emotion dataset limitations: small size (100 tweets), moderate inter-annotator agreement (kappa 0.54), and coarse emotion taxonomy (Ekman’s 6 + neutral) that misses nuance (e.g., sarcasm, sub-emotions), potentially limiting the emotion encoder’s effectiveness. - Domain mismatch: SBERT pre-training on formal, sentence-level corpora vs. informal, noisy, short tweets with emojis and non-standard tokens. - Evolving ground truth: some topics (e.g., COVID-19 lab-leak) have shifted from being considered misinformation to plausible/likely, potentially invalidating labels and impacting model evaluation.

Related Publications

Explore these studies to deepen your understanding of the subject.

Environmental Studies and Forestry

Using high-frequency household surveys to describe energy use in rural North India during the COVID-19 pandemic

C. F. Gould, A. Pillarisetti, et al.

Engineering and Technology

Leveraging driver vehicle and environment interaction: Machine learning using driver monitoring cameras to detect drunk driving

K. Koch, M. Maritsch, et al.

Food Science and Technology

Metabolic profiling to detect different forms of beef fraud using rapid evaporative ionisation mass spectrometry (REIMS)

K. Robson, N. Birse, et al.

Medicine and Health

CovidCTNet: an open-source deep learning approach to diagnose covid-19 using small cohort of CT images

T. Javaheri, M. Homayounfar, et al.

Listen, Learn & Level Up

Over 10,000 hours of research content in 25+ fields, available in 12+ languages.

No more digging through PDFs, just hit play and absorb the world's latest research in your language, on your time.

listen to research audio papers with researchbunny