logo
ResearchBunny Logo
Affect Recognition in Conversations Using Large Language Models

Computer Science

Affect Recognition in Conversations Using Large Language Models

S. Feng, G. Sun, et al.

This study probes whether large language models can recognise human affect—emotions, moods, and feelings—in both open-domain chit‑chat and task‑oriented dialogues using IEMOCAP, EmoWOZ, and DAIC‑WOZ, evaluating zero‑shot and few‑shot in‑context learning as well as task‑specific fine‑tuning and the impact of ASR errors. The research was conducted by the authors: Shutong Feng, Guangzhi Sun, Nurul Lubis, Wen Wu, Chao Zhang, and Milica Gašić.

00:00
00:00
~3 min • Beginner • English
Introduction
The paper examines whether large language models can recognise human affect in conversations comparably to humans, across chit-chat/open-domain and task-oriented dialogues. Affect comprises emotions, moods, and feelings and is essential for empathetic, effective conversational AI. The authors frame affect recognition targets as: (1) categorical emotion classes from generic models (e.g., basic emotions), (2) custom categorical emotion classes tailored to task-oriented dialogue contexts (as in EmoWOZ), and (3) depression detection at the dialogue level (DAIC-WOZ). They use IEMOCAP, EmoWOZ, and DAIC-WOZ to assess LLMs (GPT-2, LLaMA-7B, Alpaca-7B, LLaMA-3-8B, GPT-3.5, GPT-4) under zero-shot and few-shot prompting via in-context learning (ICL), and under task-specific fine-tuning. They also evaluate robustness to automatic speech recognition (ASR) errors when LLMs are used in spoken dialogue pipelines. Contributions: evaluating LLMs on diverse dialogue types and labels (emotion classes; binary depression), probing zero-shot/few-shot ICL and fine-tuning capacities, and quantifying the impact of ASR errors on LLM predictions.
Literature Review
Related work summarises the evolution and capabilities of LLMs (GPT family, LLaMA series), including emergent abilities like few-shot prompting and chain-of-thought. While multimodal LLMs exist, text-based LLMs are frequently used as backends in pipelines with ASR. Prior affect-related studies assessed LLM empathy, emotional intelligence, zero/few-shot sentiment analysis, and emotional dialogue generation and understanding. This paper focuses specifically on affect recognition across dialogue types and modeling setups, adding an ASR-noise perspective relevant to real-world deployment.
Methodology
Affect recognition is operationalised as an LLM’s ability to verbalise the expressed emotion given dialogue context. The pipeline optionally ingests speech, transcribes it with Whisper-medium (beam size 3, English-only, punctuation removed), then constructs prompts comprising task definition, ICL samples, and query tailored to each dataset. For open-source LLMs, class-token probabilities are examined and the argmax selected as prediction; for commercial models, regex is used to extract predicted labels due to lack of logits and potential format deviations. Task-specific fine-tuning employs LoRA to efficiently adapt attention projection matrices by learning low-rank updates (ΔW = A B) while freezing pretrained weights, reducing trainable parameters and memory. LoRA is applied to LLaMA-7B, Alpaca-7B, and LLaMA-3-8B; GPT-2 is fine-tuned without LoRA. Models are trained with next-token prediction over prompt-plus-label tokens using cross-entropy loss; training used approximately 30GB memory and 4 hours per epoch on an Nvidia A100 40GB for an EmoWOZ epoch. Datasets and evaluation: IEMOCAP (utterance-level ERC, 9-class original labels, evaluated in 4-way and 5-way settings; leave-one-session-out CV; report weighted and unweighted accuracy), EmoWOZ (user-utterance ERC with seven custom labels designed to encode task performance; report macro- and weighted F1 excluding Neutral, with neutral metrics provided separately), DAIC-WOZ (dialogue-level binary depression detection using PHQ-8 threshold of 10 aligned with AVEC2016; report F1 for Depressed class). Prompts include label explanations for EmoWOZ and DAIC-WOZ (mapping PHQ-8 severities), while IEMOCAP uses generic emotion labels without additional definitions. Models evaluated: GPT-2 (124M), GPT-3.5 (ChatGPT), GPT-4, LLaMA-7B, Alpaca-7B (instruction-following fine-tune of LLaMA-7B), LLaMA-3-8B. Supervised SOTA references are multimodal/textual architectures specific to each dataset for contextual comparison. Training configurations: learning rate 3e-5, batch size 2, gradient accumulation 4, cosine scheduler, weight decay 0.01 (excluding bias and layernorm), FP16 for open-source LLMs, early stopping on dev perplexity, stratified sampling for IEMOCAP dev sets. ASR specifications: Whisper-medium WERs of 12.0% (IEMOCAP) and 16.5% (DAIC-WOZ); EmoWOZ uses an ASR simulator trained to generate Whisper-like hypotheses, WER 17.1%.
Key Findings
Zero-shot performance across datasets shows: (a) IEMOCAP (4-way): Alpaca-7B best WA 48.8, UA 51.4; IEMOCAP (5-way): Alpaca-7B best WA 40.5, UA 36.2; (b) EmoWOZ: GPT-4 best MF1 52.4, WF1 62.3; (c) DAIC-WOZ: GPT-4 dev F1 63.6, GPT-3.5 test F1 64.3. Supervised SOTAs remain substantially higher: IEMOCAP (4-way) WA 77.6, UA 78.4; IEMOCAP (5-way) WA 73.3, UA 74.4; EmoWOZ MF1 65.9, WF1 83.9; DAIC-WOZ dev F1 88.6, test F1 85.7. Larger models do not universally outperform smaller ones in zero-shot ERC (e.g., Alpaca-7B surpasses GPT-3.5/4 on IEMOCAP), but for tasks requiring more reasoning or long-range context (EmoWOZ; DAIC-WOZ), performance tends to increase with size (GPT-4 strongest on EmoWOZ; GPT-3.5/4 strongest on DAIC-WOZ). Instruction-following (Alpaca-7B) consistently improves zero-shot ERC versus base LLaMA models, indicating better prompt utilisation. ASR noise impacts: emotion recognition is generally robust, with small metric changes resembling supervised SOTA sensitivity; notable exception is LLaMA-7B in IEMOCAP 5-way where UA drops −5.0 and misclassifications shift towards Other (e.g., decreases in recall for Happy 12.3→7.3, Angry 50.0→11.0; precision for Sad 65.5→0.0; Other recall increases 47.3→78.2). Depression detection is more sensitive to ASR errors, especially for larger commercial models due to longer prompts: GPT-4 DAIC-WOZ dev F1 −19.2, test F1 −17.6; GPT-3.5 dev −17.0, test −8.3, whereas smaller open-source models show smaller degradations. In-context learning: larger models benefit more from additional ICL samples. GPT-4 improves from IEMOCAP 4-way WA 42.4 (N=0) to 63.8 (N=3), IEMOCAP 5-way WA 49.4 (N=3), EmoWOZ WF1 66.5 (N=3). GPT-3.5 similarly rises to IEMOCAP 4-way WA 62.1 (N=3), 5-way WA 48.3 (N=1 or 3), EmoWOZ WF1 46.7 (N=3). Smaller open-source models often peak at N=0 or N=1 and may be biased at N=1 in DAIC-WOZ (predicting all samples as Not Depressed). Task-specific fine-tuning: substantial gains with modest data. IEMOCAP 4-way: LLaMA-7B WA/UA reach ~76.0/76.4 at 50%; Alpaca-7B ~75.6/75.8 at 50%; LLaMA-3-8B ~79.8/80.0 at 100%, approaching supervised SOTA. EmoWOZ: LLaMA-7B WF1 83.1 MF1 66.4 (50%); Alpaca-7B WF1 83.2 MF1 64.5 (50%); LLaMA-3-8B WF1 82.9 MF1 68.7 (100%), near SOTA WF1 83.9 and MF1 65.9. IEMOCAP 5-way remains below SOTA even after full fine-tuning (e.g., LLaMA-3-8B WA 69.5, UA 68.7 vs SOTA WA 73.3, UA 74.4), likely due to the semantically uninformative Other class. Fine-tuning on DAIC-WOZ tends to bias models toward Not Depressed due to class imbalance (>70% Not Depressed), highlighting limitations of pure language modeling objectives for this task.
Discussion
The study shows LLMs can recognise affect to a useful degree across dialogue types, but zero-shot performance lags behind specialised supervised systems, particularly for depression detection. For emotion recognition, larger LLMs leverage label definitions and context better (notably GPT-4 in EmoWOZ), and are relatively robust to moderate ASR errors, supporting their integration into spoken dialogue pipelines for emotion-aware interactions. Depression detection demands longer-range reasoning over nuanced cues and is more vulnerable to ASR noise; long prompts exacerbate error accumulation for larger models. Few-shot ICL improves performance markedly for larger models, indicating synergy between model capacity and example-based guidance, whereas smaller open-source LLMs struggle to effectively use lengthy context and may be biased by minimal ICL. Task-specific fine-tuning with efficient adapters (LoRA) rapidly elevates ERC performance toward SOTA with as little as 25–50% data, suggesting practical pathways to deploy LLMs as modular emotion recognisers in dialogue systems. Persistent gaps in IEMOCAP 5-way imply that semantically vague labels (Other) hinder LLM utilisation of language priors; more informative label taxonomies may further close gaps. Overall, the findings answer the core question: LLMs can recognise human affect in conversations to a significant extent, yet matching expert supervised systems requires fine-tuning, careful prompt design, and attention to task specifics and input noise.
Conclusion
The paper benchmarks LLMs for affect recognition in chit-chat, task-oriented, and clinical dialog settings under zero-shot, few-shot ICL, and task-specific fine-tuning, with attention to ASR-induced noise. Key contributions include demonstrating: (1) zero-shot LLMs remain behind supervised SOTAs but instruction-following and clear label definitions help; (2) emotion recognition is relatively robust to ASR errors, whereas depression detection is not; (3) larger models benefit more from ICL; (4) LoRA-based fine-tuning yields near-SOTA ERC with limited data. Future research directions include improving long-context effectiveness and scalability, designing task-aligned learning objectives for depression detection, developing semantically meaningful label sets to better exploit language priors, addressing class imbalance robustly, exploring multi-dataset and multi-domain evaluations, and enhancing inference efficiency for real-time applications.
Limitations
Inference with LLMs is relatively slow and scales with prompt length and number of ICL samples, challenging real-time use. Attention span limits hinder effective utilisation of long dialogue contexts, especially in depression detection. The study uses one dataset per dialogue domain, leaving broader affect types and settings for future work. Class imbalance in datasets (notably DAIC-WOZ) impacts results; while supervised references use augmentation to mitigate imbalance, such augmentation and synthetic data issues are out of scope. Fine-tuning GPT-2 did not help and sometimes hurt due to neutral prediction bias. Practical considerations include computational resource demands and, for server-based models, data transfer and privacy policies.
Listen, Learn & Level Up
Over 10,000 hours of research content in 25+ fields, available in 12+ languages.
No more digging through PDFs, just hit play and absorb the world's latest research in your language, on your time.
listen to research audio papers with researchbunny