Computer Science

Behavioral Forensics in Social Networks: Identifying Misinformation, Disinformation and Refutation Spreaders Using Machine Learning

E. M. Khan, A. Ram, et al.

Dive into the intriguing world of user behavior on social networks with groundbreaking research by Euna Mehnaz Khan, Ayush Ram, Bhavtosh Rath, Emily Vraga, and Jaideep Srivastava. This innovative study tackles misinformation head-on with a new behavioral forensics model that classifies users based on their reactions, providing vital insights into user intent and cybersecurity.

00:00

Playback language: English

Index

Introduction

The proliferation of misinformation and disinformation on online social networks poses a significant threat. Misinformation is unintentional false information, while disinformation is intentionally deceptive. Identifying spreaders of both, along with those who share refutations, is crucial for effective countermeasures. Existing research often limits itself to a binary classification (true vs. false information spreaders), failing to consider the nuanced intentions behind information sharing. This study proposes a more granular approach. The context is the urgent need to combat the harmful effects of misinformation, as dramatically highlighted during the COVID-19 pandemic. The purpose of this research is to develop a model capable of differentiating between various categories of information spreaders, considering their behavior after exposure to both misinformation and its refutation. This importance stems from the need for targeted interventions—educating unintentional spreaders, stopping malicious actors, and leveraging those who actively share refutations. The existing lack of granular classification and consideration of post-refutation behavior creates a gap in the current research that this paper attempts to fill. This study uses a multi-faceted approach by incorporating behavioral patterns, social network analysis, and machine learning to achieve a more accurate and insightful understanding of information spreaders' intentions.

Literature Review

Prior research in misinformation detection often focuses on content analysis (linguistic features, styles), deep learning models, propagation structures, and user profile features (age, follower count). Some studies utilize network features extracted through graph embedding algorithms to identify spreaders, but these often employ a binary classification (misinformation vs. refutation spreaders), overlooking the sequential behavioral actions of users exposed to both misinformation and its refutation. This binary classification fails to capture the intent behind sharing; a user might retweet misinformation unintentionally and then correct themselves after seeing a refutation. Existing approaches also tend to ignore the crucial role that a user's response to refutations plays in identifying their intentions. This study addresses these limitations by creating a multi-class classification system based on a more comprehensive analysis of user behavior.

Methodology

This research employs a novel behavioral forensics approach. The methodology consists of three main stages: 1. **Labeling Mechanism:** Users are exposed to at least one pair of misinformation and its refutation. Their subsequent actions (retweeting or not) are tracked, and they are classified into one of five categories: (1) malicious (knowingly spreads misinformation even after seeing a refutation); (2) maybe_malicious (shows ambiguous behavior); (3) naive_self_corrector (initially spreads misinformation but corrects the mistake); (4) informed_sharer (only shares refutations or correctly identifies and avoids spreading misinformation); and (5) disengaged (shares neither). If a user is exposed to multiple pairs, a Likert scale-like approach using the median of the resulting labels is applied to obtain a single final label. 2. **Graph Embeddings:** Since behavioral data is not available for all users, the researchers leverage network features. A network is constructed using the follower-followee relationships of the labeled users. Graph embedding models, LINE (Large-scale Information Network Embedding) and PyTorch-BigGraph (PBG), generate low-dimensional vector representations of users, capturing their network positions and relationships. These embeddings preserve the topological structure and homophily of the network. LINE captures both local and global structures, important because users in the same class might not be directly connected but share common neighbors. PBG scales well to large networks. 3. **Node Classification:** A two-step classification process addresses class imbalance. First, a binary classifier distinguishes between disengaged and other users (undersampling is used for the disengaged class). Then, a multi-class classifier categorizes the 'other' users into the four remaining classes. Several machine learning models (Logistic Regression, k-NN, SVM, Naive Bayes, Decision Tree, Random Forest, Bagged Decision Tree) are evaluated for both steps. The profile features (follower count, followee count, statuses count, listed count, verified user, protected account, account age) are combined with the embedding features for the multi-class classification. SMOTE is used to address class imbalance in the multi-class classification.

Key Findings

The study utilizes a publicly available dataset containing misinformation and refutation data from 8 Twitter news events. The results demonstrate the effectiveness of the proposed behavioral forensics model. Both LINE and PBG embeddings show similar performance; LINE is used for reporting due to its faster processing speed. The two-class classification (disengaged vs. others) achieves over 99% accuracy using Logistic Regression and Bagged Decision Trees, with 128-dimensional embeddings. In the multi-class classification, Bagged Decision Trees outperform other models, achieving 73.64% accuracy and 72.22% weighted F1 score. Specifically, in detecting malicious actors, the model attains 77.45% precision and 75.80% recall. t-SNE visualizations show the formation of distinct clusters for different user categories based on their network embedding features. The model's performance improves initially with increased embedding dimensionality, but the improvement plateaus around 64 dimensions. The choice of evaluation metrics (precision, recall, F1 score) depends on the chosen misinformation mitigation strategy. For example, maximizing precision is crucial if banning malicious accounts is the primary strategy, while recall is more important if the focus is on informing the followers of malicious users.

Discussion

The findings demonstrate the feasibility of using behavioral data along with network analysis and machine learning to identify various types of information spreaders on social media platforms. The nuanced categorization beyond the simple binary classification offers a more comprehensive understanding of user intentions and motivations. The high precision and recall in identifying malicious actors highlight the potential for targeted interventions. The use of network embedding features proves valuable, especially in cases where behavioral data is limited. The study's success in clustering users based on their network features underscores the importance of considering social network structures when analyzing information spread. The results are highly relevant to the field of misinformation detection, offering a more robust and insightful approach compared to existing methods. The model provides valuable information that social media platforms can use for designing more effective strategies to counter misinformation.

Conclusion

This paper introduces a novel multi-class classification model for identifying misinformation, disinformation, and refutation spreaders on social networks. The model successfully combines behavioral data, network embeddings, and machine learning techniques, achieving high accuracy in identifying malicious actors. The results highlight the potential for using behavioral forensics to combat misinformation. Future research could focus on dynamically updating user labels as their behavior changes over time, improving the model's adaptability. Investigating the impact of misinformation topics on user classification could also provide further insights.

Limitations

The study relies on a specific dataset and labeling mechanism. The generalizability of the findings might be limited by the characteristics of this dataset and the chosen definition of user categories. The assumption that all followers see a retweet might overestimate exposure. The class imbalance issue required specific techniques (undersampling, SMOTE), potentially affecting the results. Future work could explore alternative sampling methods or focus on different datasets to enhance the generalizability of the model.

Related Publications

Explore these studies to deepen your understanding of the subject.

Psychology

Using machine learning to understand social isolation and loneliness in schizophrenia, bipolar disorder, and the community

S. J. Abplanalp, M. F. Green, et al.

Engineering and Technology

Improved Fault Classification and Localization in Power Transmission Networks Using VAE-Generated Synthetic Data and Machine Learning Algorithms

M. A. Khan, B. Asad, et al.

Medicine and Health

Recent Advancements and Perspectives in the Diagnosis of Skin Diseases Using Machine Learning and Deep Learning: A Review

J. Zhang, F. Zhong, et al.

Food Science and Technology

Synchronously Predicting Tea Polyphenol and Epigallocatechin Gallate in Tea Leaves Using Fourier Transform-Near-Infrared Spectroscopy and Machine Learning

S. Ye, H. Weng, et al.

Listen, Learn & Level Up

Over 10,000 hours of research content in 25+ fields, available in 12+ languages.

No more digging through PDFs, just hit play and absorb the world's latest research in your language, on your time.

listen to research audio papers with researchbunny