Humanities

Decoding violence against women: analysing harassment in Middle Eastern literature with machine learning and sentiment analysis

H. Q. Low, P. Keikhosrokiani, et al.

This groundbreaking study by Hui Qi Low, Pantea Keikhosrokiani, and Moussa Pourya Asl employs advanced natural language processing and machine learning techniques to delve into the nuances of sexual harassment as depicted in twelve Middle Eastern novels. With a remarkable 75.8% accuracy in harassment classification, the findings reveal a predominance of negative sentiment, even in instances of physical harassment. Discover the fascinating insights from their analysis!

00:00

Playback language: English

Index

Introduction

Sexual harassment is a significant problem in the Middle Eastern region. While scholarly work exists on this topic, analyzing literary texts to understand the prevalence and typology of harassment presents challenges due to cognitive limitations and potential biases in human analysis. This study addresses these challenges by developing a machine learning framework for text mining of sexual harassment content within Middle Eastern novels. The framework leverages natural language processing (NLP) techniques to automate the identification and classification of various types of sexual harassment. The study uses a hybrid computational method combining interpretative social analysis and computational techniques, acknowledging the limitations of relying solely on either approach. Traditional manual text analysis is prone to inconsistencies due to subjective interpretations and biases. The proposed framework aims to mitigate these issues by integrating manual annotation with computational strategies, thereby ensuring a more accurate and reliable representation of sexual harassment in the selected literary works. The study aims to contribute to a more nuanced understanding of the issue by analyzing both the frequency and the emotional context of sexual harassment depicted in these novels, ultimately informing future research and intervention strategies.

Literature Review

The study reviews existing literature on sexual harassment types (gender harassment, unwanted sexual attention, and sexual coercion), and the prevalence and cultural context of sexual harassment in the Middle East. It notes that patriarchal norms and traditional gender roles contribute to gender harassment, with the concept of 'honor' often used to control women and justify harassment. The literature also highlights the challenges of studying sexual harassment through manual analysis of large literary works, emphasizing the need for computational methods. The authors discuss various text classification techniques, including machine learning algorithms (Rocchio classification, Boosting, Bagging, Logistic Regression, Naive Bayes, K-nearest neighbors, Support Vector Machine, Decision Tree, Random Forest, Conditional Random Field, and Semi-supervised methods) and deep learning techniques (Feed-forward Neural Network, Recurrent Neural Networks, Capsule Neural Networks, Attention-based Networks, Memory-augmented Networks, Graph Neural Networks, Siamese Neural Networks, and Unsupervised Transformer-based Pre-trained Language Models). Existing work on sentiment and emotion analysis is also reviewed, highlighting the use of supervised and lexicon-based methods, and comparing different approaches in terms of their advantages and limitations. The study specifically references works that use machine learning for sexual harassment detection and sentiment analysis in online settings.

Methodology

The study used twelve Middle Eastern novels written in English as its data source. The methodology involved several stages: 1. **Data Preparation:** The novels (in epub and pdf formats) were converted to text format. Text preprocessing involved sentence tokenization, contraction expansion, POS tagging, word tokenization, lowercasing, stop word removal (including customized removal of frequent words and character names), and lemmatization using Python libraries like NLTK, EbookLib, PyPDF2, and the 'contractions' library. 2. **Rule-based Sexual Harassment Detection:** Sentences containing sexual harassment-related words (from a previously created corpus) were identified. Manual labeling then determined whether these sentences conceptually represented sexual harassment, categorizing them into three types (gender harassment, unwanted sexual attention, sexual coercion) and two offense types (physical, non-physical). 3. **Machine Learning for Sexual Harassment Classification:** Six machine learning algorithms (KNN, LR, RF, MNB, SGD, SVC) were used to classify sentences as containing physical or non-physical sexual harassment. Hyperparameter tuning was performed using GridSearchCV. 4. **Sentiment and Emotion Analysis:** Lexicon-based sentiment analysis (using NLTK) and emotion analysis (using Text2emotion) were performed on the sentences. Sentiment was categorized as positive, negative, or neutral, and emotions included happy, angry, surprise, sad, and fear. 5. **Deep Learning for Sentiment and Emotion Classification:** An LSTM-GRU deep learning model was trained to classify sentiment (negative, positive, neutral) and emotion (happy, angry, surprise, sad, fear) using TF-IDF features and PCA for dimensionality reduction. Model performance was evaluated using accuracy, precision, recall, and F1-score.

Key Findings

The logistic regression model for sexual harassment type classification achieved an accuracy of 75.8%, outperforming five other algorithms. The Random Forest model showed the highest F1 score (0.826) for this classification task. The LSTM-GRU model for sentiment classification achieved an accuracy of 84.5%. Lexicon-based sentiment analysis revealed that most sentences, including those describing physical harassment, exhibited negative sentiment. The analysis of emotions revealed that fear and surprise were more prevalent than sadness, happiness, or anger in the identified instances of sexual harassment. Physical sexual harassment showed a higher distribution of fear compared to non-physical harassment. The word clouds generated from the analysis revealed key nouns (e.g., female, Lolita, rape, woman) and verbs associated with sexual harassment in the corpus.

Discussion

The findings demonstrate the effectiveness of machine learning and sentiment analysis in identifying and classifying sexual harassment in literary texts. The high accuracy of the models suggests the potential for automated analysis of large corpora to gain a more comprehensive understanding of the prevalence and nature of sexual harassment. The prevalence of negative sentiment associated with descriptions of sexual harassment, particularly physical harassment, highlights the emotional distress and trauma associated with this form of violence. The identification of fear and surprise as prominent emotions underscores the vulnerability and helplessness experienced by victims. The study’s focus on Middle Eastern literature provides valuable insights into the cultural context of sexual harassment in this region. The framework developed in this study can be adapted to analyze similar literary works in other contexts, helping to reveal patterns and trends in depictions of sexual harassment across different cultures and time periods. The limitations of the current study, such as the use of a single lexicon, and bias introduced through the rule-based sexual harassment identification, should be addressed in future research to refine the model’s accuracy.

Conclusion

This study presents a novel framework for analyzing sexual harassment in Middle Eastern literature using machine learning and sentiment analysis. The high accuracy achieved by the models demonstrates the potential of these techniques for automated analysis of large text corpora. The findings highlight the prevalence of negative sentiment and specific emotions associated with sexual harassment. Future work should focus on expanding the dataset, using multiple lexicons, and refining the rule-based detection system. The framework could be adapted to different languages and literary genres to further investigate the representation of sexual harassment in literature globally.

Limitations

The study's limitations include the potential bias inherent in the rule-based detection of sexual harassment sentences and the reliance on a single lexicon for sentiment and emotion analysis. The selection of novels might not be fully representative of all Middle Eastern literature. The accuracy of the models could be improved by expanding the datasets and incorporating more sophisticated NLP techniques. The study focuses solely on Anglophone literature from the Middle East, limiting the generalizability of findings to non-English texts. Future research should address these limitations to enhance the robustness and generalizability of the findings.

Related Publications

Explore these studies to deepen your understanding of the subject.

Linguistics and Languages

Stylistic and linguistic variations in compliments: an empirical analysis of children's gender schema development with machine learning algorithms

X. Liao and Y. Zhang

Medicine and Health

Nutritional and Exercise-Focused Lifestyle Interventions and Glycemic Control in Women with Diabetes in Pregnancy: A Systematic Review and Meta-Analysis of Randomized Clinical Trials

C. F. Dingena, D. Arofikina, et al.

Medicine and Health

A wearable sensor and machine learning estimate step length in older adults and patients with neurological disorders

A. Zadka, N. Rabin, et al.

Computer Science

On responsible machine learning datasets emphasizing fairness, privacy and regulatory norms with examples in biometrics and healthcare

S. Mittal, K. Thakral, et al.

Listen, Learn & Level Up

Over 10,000 hours of research content in 25+ fields, available in 12+ languages.

No more digging through PDFs, just hit play and absorb the world's latest research in your language, on your time.

listen to research audio papers with researchbunny