logo
ResearchBunny Logo
Evaluating the capacity of large language models to interpret emotions in images

Computer Science

Evaluating the capacity of large language models to interpret emotions in images

H. Alrasheed, A. Alghihab, et al.

Discover how GPT-4 can streamline emotional stimulus selection by rating visual images on valence and arousal, closely approximating human judgments under zero-shot conditions while noting challenges with subtler cues. This research was conducted by Hend Alrasheed, Adwa Alghihab, Alex Pentland, and Sharifa Alghowinem.

00:00
00:00
~3 min • Beginner • English
Introduction
The study investigates whether a large language model (GPT-4) can recognize and rate emotions elicited by non-facial images, focusing on valence (negative–neutral–positive) and arousal (calm–neutral–stimulated). Traditional stimulus selection and validation are time- and labor-intensive and can be biased; standardized image datasets like GAPED and IAPS help but are costly to build and maintain. The research asks if GPT-4 can approximate human emotional ratings to automate and scale stimulus selection/validation. The work targets general, non-facial imagery (objects, environments, animals, abstract scenes), an underexplored area for LLMs despite its importance in affective computing and psychological research. The authors compare GPT-4’s outputs to human ratings from GAPED using numeric (0–100) and 3-point Likert scales under zero-shot and few-shot conditions, and also evaluate performance from textual descriptions of the images.
Literature Review
The related work reviews image- and text-based emotion elicitation. Image-based: Early visual sentiment approaches used handcrafted features (color, texture, composition) and emotional semantic retrieval. Deep learning (CNNs) and transfer/reinforcement learning improved automatic feature extraction and task performance. Recent multimodal LLMs can process images and text; LLMs showed competitive zero-shot performance on facial expression datasets without fine-tuning, though specialized CNNs can still lead. Studies report GPT-4’s strong visual emotion understanding (e.g., Reading the Mind in the Eyes) and solid multimodal capabilities, yet challenges remain for micro-expressions and specialized tasks. In-context learning and Chain-of-Thought can improve interpretability and performance for emotion recognition tasks, sometimes surpassing traditional baselines even without fine-tuning. Non-facial image emotion research frequently uses GAPED to elicit discrete emotions or rate valence/arousal across cultures. Text-based: Early methods relied on lexicons and classic ML (n-grams, polarity). Deep learning (RNNs, attention, transformers like BERT) advanced state of the art via contextual embeddings and supervised fine-tuning on emotion datasets. Unsupervised/semi-supervised and transfer learning address low-label scenarios. LLMs achieve strong sentiment/emotion performance, with studies showing high alignment with human emotional/values assessments and reliable estimation of valence/arousal for multi-word expressions. Zero-shot LLM annotation of emotions has sometimes been preferred over human annotations across datasets.
Methodology
Dataset: Geneva Affective Picture Database (GAPED), 730 images across Positive, Neutral, and Negative categories (negative subdivided into Animal Mistreatment, Human Concerns, Snakes, Spiders). Human ratings: 60 participants (avg age 24), primarily French-speaking, rated valence and arousal on 0–100 scales (valence: 0 very negative to 100 very positive; arousal: 0 calm to 100 stimulated; 50 neutral). Typical category profiles: Positive high valence (>70), low arousal (<22); Negative lower valence (<50), moderately higher arousal (~53–61); Neutral valence ~55, arousal ~25. Dataset updates: (1) GPT-4 generated a textual description for each image (“What’s in this image?”). (2) Continuous ratings mapped to 3-point Likert: Valence Negative [0,40), Neutral [40,71), Positive [71,100]; Arousal Calm [0,23), Neutral [23,45), Stimulated [45,100]. Image-based ratings: Model: GPT-4 Turbo. Two rating types and two learning conditions were used. - Numeric (0–100) ratings. Prompt: instructs the model to rate valence and arousal on 0–100 and output “Valence: [value], Arousal: [value]”. Zero-shot and few-shot settings used. For few-shot, prompts included example images with human ratings: 3 Positive, 3 Neutral, and 8 Negative (2 per negative subcategory). For each image, 10 responses were collected and averaged. - Likert (3-point) ratings. Prompt: instructs the model to choose among Negative/Neutral/Positive (valence) and Calm/Neutral/Stimulated (arousal), outputting the same format. Zero-shot and few-shot settings used. For each image, 9 prompts were issued and the modal class taken (9 chosen to avoid ties). Text-description-based ratings: For each image, the previously generated GPT-4 description was used as input instead of the image. The same two rating types and zero-/few-shot conditions were employed. Few-shot text prompts included a set of example descriptions with human labels analogous to the image-based setup. Evaluation: Comparisons to human GAPED ratings included Pearson correlations (numeric ratings), descriptive statistics, and mean absolute error (MAE). For Likert ratings, confusion matrices and performance metrics (accuracy, precision, recall, F1) were computed overall and per category. Two-class evaluation was used within individual negative subcategories (since only Negative/Neutral appear), and three-class evaluation for all images.
Key Findings
Image-based numeric ratings: Strong alignment with human ratings. Zero-shot Pearson r = 0.87 (valence, p<0.001) and r = 0.72 (arousal, p<0.001). Few-shot r = 0.86 (valence, p<0.001) and r = 0.80 (arousal, p<0.001). For all images, MAE: valence 10.5 (zero-shot) vs 10.2 (few-shot); arousal 11.1 (zero-shot) vs 9.5 (few-shot). Category examples: Positive valence MAE decreased from 13.9 to 8.5 with few-shot; Neutral valence MAE from 6.7 to 5.5. Positive arousal MAE dropped from 13.3 to 7.8 with few-shot. Some negative categories showed little or worse improvement with few-shot. Distributional differences: Animal Mistreatment and Human Concerns showed humans assigning stronger negative valence than GPT-4 (peak near +5 for GPT4–human), and GPT-4 slightly lower arousal (peak near −5). For Snakes and Spiders, GPT-4 tended to give more negative valence and higher arousal than humans. Image-based Likert ratings: Overall three-class valence accuracy 0.77 (zero-shot) and 0.73 (few-shot). For arousal, overall accuracy 0.57 (zero-shot) and 0.55 (few-shot). Positive and Negative valence classes were highly accurate in zero-shot (Positive 0.98, Negative 0.77), while Neutral improved with few-shot (from 0.67 to 0.77) at slight cost to Positive/Negative. For arousal, Calm and Stimulated were predicted well in zero-shot (100% and 69%), with few-shot mainly improving Neutral arousal. Category metrics: valence accuracy ≥0.75 for most categories in zero-shot except Snakes (0.57) and Spiders (0.65). Arousal precision was high in Animal Mistreatment and Human Concerns but recall lower; Snakes/Spiders had relatively higher F1 for high arousal. Text-description numeric ratings: Zero-shot Pearson r = 0.79 (valence, p<0.001) and r = 0.65 (arousal, p<0.001). Few-shot r = 0.78 (valence, p<0.001) and r = 0.71 (arousal, p<0.001). All-images MAE: valence 11.8 (zero-shot) vs 12.4 (few-shot); arousal 12.1 (zero-shot) vs 11.2 (few-shot). Few-shot reduced arousal MAE in most categories (e.g., Positive arousal MAE 16.3 → 11.7), while valence MAE sometimes worsened for negative categories (Human Concerns 19.8 → 22.4). Text-description Likert ratings: Overall three-class valence accuracy 0.67 (zero-shot) and 0.66 (few-shot). Arousal overall accuracy 0.48 (zero-shot) and 0.50 (few-shot). Category-level valence accuracy exceeded 0.70 for most categories except Snakes/Spiders. Arousal classification from descriptions remained challenging, though Positive arousal recall was high. Additional observations: In the combined Negative set, humans labeled 42% as Negative (valence) vs GPT-4 labeling 41% Negative, with remaining mostly Neutral; 3% of Negative images were mistakenly rated Positive by GPT-4 due to missed critical details (e.g., misinterpreting harm or context). Overall, few-shot examples did not consistently boost performance, likely due to high intra-category variance in human ratings.
Discussion
Findings indicate GPT-4 effectively approximates human emotional ratings for non-facial images, supporting its use to automate selection and validation of emotion-elicitation stimuli. High correlations and modest MAEs suggest reliable numeric alignment, and Likert-scale results show strong performance for clearly positive/negative valence and for calm/stimulated arousal states. This addresses the research goal by demonstrating that a general-purpose LLM, without task-specific training, can closely mirror human assessments across diverse image content. However, discrepancies emerge for subtle or context-dependent cues, and categories like Snakes/Spiders or complex human social scenes exhibit variability. GPT-4 sometimes misses crucial visual details, as reflected in a few misinterpretations in generated descriptions (e.g., failing to recognize injury or coercion), which can invert or dampen emotional judgments. A notable misalignment appears for arousal in Neutral images, where GPT-4 often selects Calm, suggesting a different internal mapping of arousal intensity compared to human raters. GPT-4 generally performed slightly better on direct images than on textual descriptions, indicating potential loss of nuanced context in text-only inputs. Few-shot prompting provided inconsistent benefits, particularly for negative categories, likely due to heterogeneous human responses within category labels. Despite these challenges, the model’s overall performance and error distributions (differences peaking near zero) reinforce its practical utility in scalable, standardized emotion stimulus workflows.
Conclusion
GPT-4 can automate aspects of emotion elicitation from visual stimuli, closely approximating human ratings for valence and arousal under zero-shot and few-shot conditions. This can reduce the time, cost, and subjectivity of traditional stimulus validation. Nonetheless, the model struggles with nuanced, context-heavy imagery and shows arousal-scale misalignments, and few-shot prompting does not consistently improve outcomes. Future work should: (1) expand to additional affective dimensions and datasets, (2) compare multiple LLMs and open-source models, (3) investigate richer, context-aware prompting strategies, (4) integrate multimodal information and more diverse emotional corpora, and (5) develop methods (e.g., justification prompts) to detect and mitigate hallucinations and misinterpretations.
Limitations
Limitations include: (1) use of specific 0–100 valence/arousal scales and 3-point mappings that are not common in everyday emotion discourse and may not align with GPT-4’s training distributions; (2) focus primarily on GPT-4, limiting generalizability across models; (3) closed-source nature of GPT-4 restricts transparency into training data and architecture, limiting interpretability; (4) occasional misinterpretations of image content (rare but impactful) can lead to erroneous emotional ratings; (5) arousal scale interpretation appears misaligned for Neutral images; and (6) lack of justification-based auditing—future work could pair ratings with model rationales to detect inconsistencies and reduce hallucinations.
Listen, Learn & Level Up
Over 10,000 hours of research content in 25+ fields, available in 12+ languages.
No more digging through PDFs, just hit play and absorb the world's latest research in your language, on your time.
listen to research audio papers with researchbunny