Computer Science
Sentiment Analysis in the Era of Large Language Models: A Reality Check
W. Zhang, Y. Deng, et al.
This paper offers a comprehensive investigation of large language models (LLMs) across 13 sentiment-analysis tasks on 26 datasets, comparing them to small, domain-tuned models. Findings show LLMs excel in few-shot settings and simpler tasks but struggle with complex, structured sentiment phenomena; the authors also introduce the SENTIEVAL benchmark and release data and code. This research was conducted by Authors present in <Authors> tag.
~3 min • Beginner • English
Introduction
Sentiment analysis (SA) has been a long-established area of research in natural language processing (NLP), aiming to study people’s opinions, sentiments, emotions, and related phenomena. In this work, the authors conduct a reality check on the current state of sentiment analysis in the era of large language models (LLMs). They pose three research questions: (1) What is the current maturity of various sentiment analysis problems? (2) Compared to small specialized models trained on domain-specific data, how do large models fare in both zero-shot and few-shot settings? (3) Are current SA evaluation practices still suitable to assess models in the era of LLMs? To address these questions, the paper systematically reviews SA tasks spanning conventional sentiment classification (SC), aspect-based sentiment analysis (ABSA), and multifaceted analysis of subjective texts (MAST), covering 13 tasks across 26 datasets. The study evaluates both open-source LLMs (Flan-T5 and Flan-UL2) and GPT-3.5 family models (ChatGPT and text-davinci-003), and compares them with smaller language models (SLMs) such as T5 trained on in-domain labeled data. The intent is to provide a holistic understanding of how well models comprehend human subjective information and to reassess evaluation practices for LLMs.
Literature Review
The paper’s background section surveys sentiment analysis from its inception (Turney, 2002; Hu and Liu, 2004; Yu and Hatzivassiloglou, 2003) to its modern developments (Liu, 2015; Poria et al., 2020; Yadav and Vishwakarma, 2020), noting applications such as product reviews and social media analysis. It outlines the broader SA landscape: document-, sentence-, and aspect-level sentiment classification, aspect-based sentiment analysis (ABSA) involving aspects, opinions, and polarities, and multifaceted analysis of subjective texts (MAST) tasks like hate speech, irony, offensive language, stance, comparative opinions, and emotion recognition. The paper also reviews advances in LLMs (GPT-3, PaLM, Flan-UL2, LLaMA, ChatGPT) and initial efforts applying LLMs to SA: zero-shot SA performance comparable to fine-tuned BERT (Zhong et al., 2023), ChatGPT studies on polarity shifts and sentiment inference (Wang et al., 2023), emotional conversation capabilities (Zhao et al., 2023), and distilling LLM-generated weak labels to smaller models (Deng et al., 2023). It highlights the limitations of prior work, which often targets specific tasks with varying datasets and experimental settings, leaving the true breadth of LLM capabilities in SA unclear.
Methodology
The study investigates 13 SA tasks across 26 datasets, grouped into three categories: (1) Sentiment Classification (SC): document-level (IMDb, Yelp-2, Yelp-5), sentence-level (MR, SST2, SST5, Twitter), and aspect-level (Lap14, Rest14); evaluation metric: accuracy. (2) Aspect-Based Sentiment Analysis (ABSA): Unified ABSA (UABSA) extracting (aspect, sentiment) pairs evaluated on SemEval 2014–2016 laptop and restaurant datasets; Aspect Sentiment Triplet Extraction (ASTE) extracting (aspect, opinion, sentiment) triplets on datasets from Xu et al. (2020) built upon UABSA datasets; Aspect Sentiment Quadruple Prediction (ASQP) extracting (category, aspect, opinion, sentiment) quadruples on two restaurant datasets (Zhang et al., 2021; Cai et al., 2021). For ABSA tasks, the evaluation metric is Micro-F1 with exact match on all elements. (3) MAST tasks: implicit sentiment analysis (Li et al., 2021) on combined SemEval 2014 Laptop and Restaurant implicit reviews; hate speech detection (SemEval2019 HatEval), irony detection (SemEval2018 Task 3A Irony18), offensive language identification (SemEval2019 OffensEval), stance detection (SemEval2016 Task 6, aggregated over domains; macro-F1 over favor/against ignoring none), comparative opinion mining (CS19), and emotion recognition (TweetEval Emotion20; macro-F1 across anger, joy, sadness, optimism). For balance, up to 500 examples are sampled from each original test set. Models: LLMs include Flan-T5 XXL (13B) and Flan-UL2 (20B) via Hugging Face checkpoints, and OpenAI GPT-3.5 series: ChatGPT (gpt-3.5-turbo, May 12 version) and text-davinci-003 (text-003, 175B). SLM baseline uses T5-large (770M) fine-tuned on each dataset in a unified text-to-text format. Training details for T5: Adam optimizer (lr=1e-4), batch size 4; 3 epochs for full training; 100 epochs for few-shot; three runs with different seeds, reporting average. Prompting strategy: consistent prompts across datasets and models with essential components—task name, task definition including label space, and output format; few-shot prompts include k demonstration examples per class. To study prompt sensitivity, five additional prompts per task were generated using GPT-4 to reduce manual bias; performance variance was analyzed via boxplots. Few-shot settings: K-shot experiments with K ∈ {1, 5, 10}, sampling K examples per sentiment type (ASQP: per aspect category); examples serve as in-context learning for LLMs and training data for SLMs. Cost analysis compares average costs per task category for ChatGPT and T5-large (Appendix A.3). Finally, the SENTIEVAL benchmark was constructed to unify evaluation across SA tasks: ten candidate prompts per task (five GPT-4-generated plus five manually written), random selection of one prompt per sample, and a 50% chance to include few-shot examples; total 12,224 samples with natural language instructions and optional demonstrations.
Key Findings
Zero-shot performance (Table 1) shows LLMs, especially ChatGPT, strong on simpler tasks: SC average close to fine-tuned T5 (ChatGPT reaches ~97% of T5 on SC) and strong on several MAST tasks (~85% of T5). For example, text-003 achieves 97.40 accuracy on IMDb and 98.20 on Yelp-2, while T5 achieves 93.93 and 96.33, respectively; ChatGPT attains 90.60 on IMDb and 97.80 on Yelp-2. However, LLMs struggle on ABSA tasks requiring structured outputs: Flan-T5 and Flan-UL2 effectively fail (0.00 Micro-F1), and text-003/ChatGPT are well below fine-tuned T5 (e.g., ABSA average Micro-F1: text-003 33.16, ChatGPT 34.47, T5 61.06). Prompt sensitivity analysis (Figure 2) indicates SC is relatively robust to prompt variations, while ABSA exhibits large variance; certain words (e.g., “analyze”) can trigger undesired verbose outputs despite formatting instructions. Few-shot experiments (Table 2) show LLMs consistently outperform SLMs with limited data across task types. For instance, in 1-shot: ChatGPT vs T5 averages—Doc-SC 81.47 vs 66.76, Sent-SC 76.20 vs 46.80, Aspect-SC 81.57 vs 58.97, UABSA 52.57 vs 15.70, ASTE 44.45 vs 6.81, ASQP 31.07 vs 5.61, MAST 68.46 vs 34.09. Increasing shots benefits SLMs more steadily, while LLM improvements vary: ABSA gains notably with more shots; MAST can decline due to demonstration bias. SENTIEVAL benchmark results (Table 3) under diverse instruction styles and formatting requirements: overall exact match—ChatGPT 47.55, Flan-UL2 38.82, text-003 36.64, Flan-T5 29.07; by task type—SC: ChatGPT 72.73, Flan-UL2 63.13; ABSA: ChatGPT 14.77, text-003 11.66; MAST: ChatGPT 57.71, Flan-UL2 58.35. These findings indicate (1) some SA tasks (binary SC, basic emotions) are approaching maturity, (2) LLMs excel in few-shot regimes, (3) structured sentiment extraction remains challenging, and (4) evaluation practices and prompt design significantly affect measured performance.
Discussion
The study’s findings address the research questions by showing that simple SA tasks like binary document- or sentence-level classification are largely mature: LLMs in zero-shot can match or surpass fine-tuned SLMs. In contrast, complex tasks requiring structured extraction (ABSA) or nuanced understanding (some MAST tasks) reveal clear gaps where fine-tuned SLMs with sufficient data still dominate. LLMs’ superior few-shot performance suggests strong utility when annotations are scarce, but their sensitivity to prompts and format requirements underscores the need for standardized, instruction-based evaluations. SENTIEVAL demonstrates that robustness to diverse, natural instructions is a differentiator among LLMs (e.g., ChatGPT outperforming others), reflecting more realistic user interactions. The results encourage shifting research focus toward improving LLMs’ structured information extraction, mitigating prompt sensitivity, and developing benchmarks like SENTIEVAL that holistically test SA capability across tasks and instruction styles.
Conclusion
The paper provides a systematic evaluation of LLMs on 13 sentiment analysis tasks across 26 datasets, revealing that LLMs perform well on simpler tasks and consistently outperform SLMs in few-shot settings, but lag on complex, structured ABSA tasks. It highlights the limitations of existing evaluation practices and proposes the SENTIEVAL benchmark for comprehensive and realistic assessment using diverse natural language instructions. Future work should explore improving LLMs’ ability to extract fine-grained structured sentiment information, reduce prompt sensitivity, and extend evaluations across languages and cultural contexts.
Limitations
The set of 13 tasks and 26 datasets, while broad, is not exhaustive of sentiment analysis problems; including more tasks and formats could better reveal strengths and weaknesses. All datasets are in English; sentiment phenomena are language- and culture-dependent, so multilingual and cross-cultural evaluations are necessary for a comprehensive understanding of LLM performance.
Related Publications
Explore these studies to deepen your understanding of the subject.

