Computer Science

BESSTIE: A Benchmark for Sentiment and Sarcasm Classification for Varieties of English

D. Srirag, A. Joshi, et al.

BESSTIE introduces the first labelled benchmark for sentiment and sarcasm across three English varieties (en-AU, en-IN, en-UK), built from Google Places reviews and Reddit comments with manual and automatic validation. Nine large language models were fine-tuned and evaluated, revealing consistent advantages on inner-circle varieties (en-AU, en-UK) and challenges in cross-variety generalisation—especially for sarcasm. Research conducted by Dipankar Srirag, Aditya Joshi, Jordan Painter, and Diptesh Kanojia. Dataset available on Hugging Face.... show more

Introduction

Benchmark-based evaluation is the norm in NLP, but existing benchmarks rarely include non-standard language varieties (national varieties, dialects, sociolects, creoles). Prior work shows LLMs can be biased against certain English varieties (e.g., African-American English, Indian English), yet empirical evidence across other varieties is limited. Synthetic approaches like Multi-VALUE degrade performance on GLUE but do not fully capture real-world variety features beyond syntax (e.g., orthography, vocabulary, pragmatics). To address these gaps, the paper introduces BESSTIE, a benchmark for sentiment and sarcasm classification for Australian (en-AU), Indian (en-IN), and British (en-UK) English. BESSTIE provides manually labeled datasets of natural text sourced from Google Places reviews and Reddit comments, using location-based and topic-based filtering, respectively. The authors validate variety representation via manual annotation and automated variety identification. The work models both tasks as binary classification and evaluates nine LLMs (encoders/decoders; monolingual/multilingual). Key research questions include: how well current LLMs perform on sentiment and sarcasm across varieties; how model properties and domain affect performance; and whether models trained on one variety generalize to others. The contributions are: (a) creation of a manually annotated dataset with sentiment and sarcasm labels across three English varieties; (b) evaluation and analysis of nine LLMs on BESSTIE to identify challenges arising due to language varieties.

Literature Review

The paper situates BESSTIE among major NLP benchmarks such as GLUE, SuperGLUE, and DynaBench, which lack coverage of non-standard language varieties. It highlights efforts in dialectal and variety-focused resources (e.g., Afrisenti for African languages, Arabic dialect benchmarks like ORCA and ArSarcasm-v2, CreoleVal for creole languages, and DialectBench for 281 dialectal varieties) noting that sentiment and sarcasm datasets for English varieties are missing. For sarcasm detection, existing English datasets focus on standard English (Reddit, Amazon reviews, Twitter) and do not account for non-standard varieties. Prior work demonstrates LLM bias against African-American English and Indian English, while synthetic transformations (Multi-VALUE) fall short of representing real usage. BESSTIE is positioned as the first benchmark targeting sentiment and sarcasm classification across English varieties.

Methodology

Dataset creation follows three stages: collection, quality assessment, and annotation. Data Collection: Two domains are used: Google Places reviews (GOOGLE) and Reddit comments (REDDIT). For Google, location-based filtering selects reviews from cities in Australia, India, and the UK. City selection thresholds: en-AU ≥ 20K population, en-IN ≥ 100K, en-UK ≥ 50K. Reviews are collected via the Google Places API, including ratings (1–5 stars), with English filtering using fastText language probabilities at a threshold of 0.98. Reviews from 'tourist attractions' are excluded to reduce tourist-authored content. For Reddit, topic-based filtering selects up to four subreddits per variety chosen by native speakers: en-AU ('melbourne', 'AustralianPolitics', 'AskAnAustralian'); en-IN ('India', 'IndiaSpeaks', 'BollyBlindsNGossip'); en-UK ('England', 'Britain', 'UnitedKingdom', 'GreatBritishMemes'). For each variety, 12,000 comments are scraped (≤ 20 comments per post, recent posts), then randomly sampled and standardized to 3,000 comments per variety before annotation; identifiers are discarded to preserve anonymity. Label Semantics Pilot: Using DistilBERT-Base (DISTIL), sentiment classification on Google reviews is tested under two schemes: NAIVE (extreme ratings 1/5) vs OURS (moderate ratings 2/4). Macro F-score results (Table 2): en-AU 0.97 vs 0.79; en-IN 0.94 vs 0.76; en-UK 0.96 vs 0.81; average µ 0.96 vs 0.79. The OURS labels increase task difficulty and nuance; the final dataset retains intermediate ratings (2 and 4) to avoid overfitting to polarised examples. Quality Assessment: Manual variety identification is performed by en-IN and en-UK annotators on 300 texts (150 reviews, 150 comments), labeling en-AU, en-IN, en-UK, or 'Cannot say'. Cohen’s κ w.r.t. true labels is 0.41 (en-IN annotator) and 0.34 (en-UK annotator); inter-annotator agreement is 0.26. Confusion analysis shows highest agreement for identifying en-IN and difficulty with inner-circle differentiation. Automated validation uses two predictors: a language predictor (fastText probabilities of English) and a variety predictor fine-tuned DISTIL on ICE-Australia and ICE-India (S1A, S2A headers), modeling inner- vs outer-circle classification; ICE-GB was unavailable. Average probabilities and F-scores (Table 3): en-AU GOOGLE P(eng)=0.99, P(v)=0.99, F=0.99; en-AU REDDIT 0.98, 0.95, 0.93; en-IN GOOGLE 0.99, 0.94, 0.91; en-IN REDDIT 0.87, 0.78, 0.69; en-UK GOOGLE 0.99, 0.99, 0.99; en-UK REDDIT 0.98, 0.93, 0.90. Lower en-IN REDDIT scores reflect code-mixed text. Annotation: One annotator per variety (two are paper authors involved in evaluation) labels sentiment and sarcasm as positive (1), negative (0), or discard (2). Discarded items include uninformative or generated text and are removed. Reliability is assessed by an additional independent annotator on 50 instances per variety; Cohen’s κ (Table 4): en-AU Sentiment 0.61, Sarcasm 0.47; en-IN 0.65, 0.51; en-UK 0.79, 0.63. Annotators are compensated at ~22 USD/hour; guidelines provided (Appendix A). Dataset Statistics: Table 5 summarizes splits and label distributions. en-AU GOOGLE: Train 946, Valid 130, Test 270, %PosSent 73%, %PosSarc 7%, AvgWords 63.97; en-AU REDDIT: 1763/241/501, 32%, 42%, 51.72. en-IN GOOGLE: 1648/225/469, 75%, 1%, 44.34; en-IN REDDIT: 1686/230/479, 25%, 13%, 26.92. en-UK GOOGLE: 1817/248/517, 75%, 0%, 72.21; en-UK REDDIT: 1007/138/287, 12%, 22%, 38.04. Reddit subset is from 2024. Experimental Setup: Nine LLMs are evaluated: six encoder models (ALBERT-XXL-v2, BERT-Large, RoBERTa-Large; multilingual MBERT, MDISTIL, XLM-R-Large) and three decoder models (Gemma2-27B-Instruct, Mistral-Small-Instruct-2409, Qwen2.5-72B-Instruct). Encoders are fine-tuned with full precision using weighted cross-entropy; decoders are quantized and fine-tuned with QLoRA adapters on linear layers using maximum likelihood estimation for instruction fine-tuning. All models (and variety predictor) fine-tuned for 30 epochs, batch size 8, Adam optimizer, learning rate grid search {1e-5, 2e-5, 3e-5}; hardware: 2× NVIDIA A100 80GB GPUs. Task prompts for decoders: sentiment (“Generate the sentiment... 1 for positive, 0 for negative...”), sarcasm (“Predict if the given text is sarcastic... 1 if sarcastic, 0 if not...”). Evaluation uses macro-averaged F-score. Error Analysis: Up to 30 misclassified examples per variety/domain-task are analyzed for dialect features (eWAVE), locale-specific colloquialisms, context requirements, and code-mixing. Counts (Table 8): en-AU sample 70 (DIAL 9, COLL 28, CONT 6); en-IN sample 90 (DIAL 97, COLL 33, CONT 3, CODE 8); en-UK sample 53 (DIAL 7, COLL 15, CONT 5). Detailed feature examples are provided in appendices.

Key Findings

Representation and Reliability: Manual and automated validation confirm collected texts represent en-AU, en-IN, and en-UK varieties; lower en-IN probabilities on Reddit reflect code-mixing. Inter-annotator reliability is moderate for sentiment and lower but comparable to prior work for sarcasm (e.g., κ≈0.44 reported by Joshi et al., 2016). Label Semantics: Using moderate ratings (2/4 stars) substantially lowers DistilBERT sentiment performance versus extreme ratings (1/5) (average F-score 0.79 vs 0.96), indicating increased task difficulty and nuance. Overall Task Performance: Averaged across all models (Table 6):

GOOGLE-Sentiment: en-AU 0.94; en-IN 0.64; en-UK 0.86.
REDDIT-Sentiment: en-AU 0.78; en-IN 0.69; en-UK 0.78.
REDDIT-Sarcasm: en-AU 0.62; en-IN 0.56; en-UK 0.58.
Average µ across tasks: en-AU 0.78; en-IN 0.63; en-UK 0.74. Model Comparisons (Table 7):
Sentiment: MISTRAL achieves highest average F-score across varieties (GOOGLE 0.91; REDDIT 0.84). QWEN is lowest on GOOGLE (0.45); GEMMA is lowest on REDDIT (0.60).
Sarcasm: MISTRAL and MBERT are highest (0.68); GEMMA is lowest (0.45).
Encoders outperform decoders overall; monolingual models marginally outperform multilingual models, with multilinguals relatively stronger on REDDIT. Domain Effects: Models perform better on GOOGLE than REDDIT (average F-score 0.81 vs 0.75), reflecting more formal, informative review style versus colloquial, variety-rich social media language. Cross-Variety Generalization: For sentiment, pre-trained MISTRAL achieves high performance on en-AU/en-UK; fine-tuning improves in-variety performance and yields relatively stable cross-variety results, suggesting limited domain influence for sentiment. For sarcasm, fine-tuning improves in-variety performance but degrades cross-variety generalization, underscoring the need for variety-specific contextual and cultural understanding. Error Analysis: en-IN exhibits the highest count of dialect features in misclassified samples, indicating regional variations and code-mixing challenge models. Locale-specific colloquialisms are pervasive across all varieties.

Discussion

The findings demonstrate consistent performance disparities between inner-circle (en-AU, en-UK) and outer-circle (en-IN) English varieties, supporting the hypothesis that contemporary LLMs are biased toward varieties with greater representation in pre-training and standardized forms. Higher performance on GOOGLE suggests that formal, structured text with clearer sentiment cues benefits models, while conversational, culturally nuanced Reddit language exposes limitations, especially for sarcasm detection. Encoders’ superiority indicates that architecture optimized for classification remains advantageous over generative decoders for these tasks. The marginal advantage of monolingual models suggests that multilingual pretraining does not automatically confer robustness to within-language varieties. Cross-variety analyses reveal that sarcasm classification is particularly sensitive to variety-specific linguistic and cultural context, with fine-tuning on one variety reducing generalization to others. This highlights the necessity for variety-aware training strategies, richer context modeling, and inclusion of colloquialisms and dialect features during training. The BESSTIE benchmark provides a foundation to quantify and mitigate such biases, encouraging development of equitable, variety-robust models.

Conclusion

BESSTIE introduces a novel benchmark for sentiment and sarcasm classification across Australian, Indian, and British English, comprising manually annotated datasets from Google Places and Reddit. Validation confirms variety representation and annotation reliability. Across nine LLMs, models consistently perform better on inner-circle varieties and struggle with en-IN and with sarcasm classification overall. Encoders outperform decoders; monolingual models marginally outperform multilingual ones; and formal review text yields higher scores than social media comments. Cross-variety experiments show limited generalization for sarcasm, indicating the importance of cultural and contextual signals. BESSTIE serves as a resource to evaluate and address bias in LLMs toward non-standard English varieties and to guide future work in variety-aware modeling, improved sarcasm detection, and domain adaptation.

Limitations

The benchmark assumes national varieties as proxies for dialect, which simplifies substantial intra-national regional, sociolectal, and generational variation. As language on social media evolves rapidly, a static snapshot cannot fully reflect dynamic changes in usage. Despite English filtering, code-mixing may persist, especially in en-IN, affecting performance. Annotation relied on single annotators per variety, potentially introducing individual bias; an additional validation step mitigates but does not eliminate this concern. Reddit’s user demographics and discourse style differ from other platforms (e.g., Twitter/X), influencing topic-based discussions. Reliance on Reddit is partly due to cost constraints for Twitter/X API access.

Related Publications

Explore these studies to deepen your understanding of the subject.

Linguistics and Languages

The effectiveness of ChatGPT as a lexical tool for English, compared with a bilingual dictionary and a monolingual learner's dictionary

R. Lew, B. Ptasznik, et al.

Political Science

The national security law for Hong Kong: a corpus-driven comparative study of media representations between China's and Anglo-American English-language press

Z. Hou and Q. Peng

Medicine and Health

Fostering a healthy public for men and HIV: a case study of the Movement for Change and Social Justice (MCSJ)

C. J. Colvin, M. V. Pinxteren, et al.

Biology

Neuroimaging the effects of smartphone (over-)use on brain function and structure-a review on the current state of MRI-based findings and a roadmap for future research

C. Montag and B. Becker

Listen, Learn & Level Up

Over 10,000 hours of research content in 25+ fields, available in 12+ languages.

No more digging through PDFs, just hit play and absorb the world's latest research in your language, on your time.

listen to research audio papers with researchbunny