Computer Science

Affect Recognition in Conversations Using Large Language Models

S. Feng, G. Sun, et al.

This study, by Shutong Feng, Guangzhi Sun, Nurul Lubis, Wen Wu, Chao Zhang, and Milica Gašić, evaluates large language models' ability to recognise emotions, moods, and feelings across open-domain and task-oriented dialogues using IEMOCAP, EmoWOZ, and DAIC-WOZ, and explores zero-shot, few-shot, fine-tuning, and the impact of ASR errors on LLM predictions.

00:00

~3 min • Beginner • English

Index

Introduction

The paper examines whether large language models can recognize human affect (emotions, moods, feelings) in conversational settings at a level comparable to humans and supervised systems. Affect recognition is crucial for conversational AI to move beyond information exchange toward empathetic, contextually sensitive interactions. The study focuses on two dialogue types—open-domain chit-chat and task-oriented dialogues (ToDs)—and on different affect targets: (1) categorical basic emotions, (2) domain-specific emotion labels tailored to ToDs, and (3) binary depression detection aligned with PHQ-8 thresholds. The authors evaluate LLMs under zero-shot, few-shot in-context learning (ICL), and task-specific fine-tuning setups, and also assess robustness to ASR errors to reflect spoken-dialogue pipelines. The work aims to clarify the extent and conditions under which LLMs exhibit reliable affect recognition across heterogeneous dialogue contexts and label schemes, and to identify pathways (prompting, ICL, fine-tuning) to bridge performance gaps with supervised state of the art (SOTA).

Literature Review

Related work covers (i) LLMs: modern transformer-based foundation models (e.g., GPT family, LLaMA series) exhibit emergent abilities like few-shot prompting and chain-of-thought, and can serve as back-ends in dialogue systems; (ii) Affective capabilities of LLMs: prior studies probed empathy, emotional intelligence (EQ), and sentiment analysis abilities of LLMs, and evaluated emotional dialogue understanding/generation in models like ChatGPT. This paper differs by systematically evaluating text-based LLMs for affect recognition across multiple dialogue types (open-domain, ToD, clinical) and tasks (emotion classification, depression detection), considering zero-/few-shot prompting, fine-tuning, and ASR-induced input noise.

Methodology

Problem framing: The authors proxy affect recognition ability by asking LLMs to verbalize an emotion/depression label given dialogue context. For open-source LLMs, class-token probabilities are constrained to a predefined label set and the argmax is taken as the prediction. For commercial APIs (no logits), predictions are extracted via regular expressions guided by prompt formatting. Pipeline: Optional speech input is transcribed by Whisper-medium (beam size 3; punctuation removed). The prompt comprises: (1) task definition with label list/definitions (EmoWOZ and DAIC-WOZ include label explanations and PHQ-8 severity criteria; IEMOCAP uses generic emotion labels without extra definition), (2) optional ICL examples, (3) a query tailored to utterance-level (IEMOCAP, EmoWOZ) or dialogue-level (DAIC-WOZ) classification. The model outputs a label token which is post-processed to form the final prediction. Task-specific fine-tuning: Parameter-efficient fine-tuning via LoRA is applied to open-source LLMs (LLaMA-7B, Alpaca-7B, LLaMA-3-8B). LoRA factorizes weight updates (ΔW=AB) with low rank r to reduce trainable parameters and memory while freezing the backbone weights. Models are trained with next-token prediction over prompt-plus-label sequences using cross-entropy. Approximate compute: ~30 GB memory and ~4 hours per epoch on EmoWOZ with an Nvidia A100 40GB. Datasets and evaluation: - IEMOCAP: 151 dialogues, 10,039 utterances, 5 sessions, multi-annotator labels from 9 emotions. Evaluated with 5-fold leave-one-session-out. Two setups: 4-way (Sad, Neutral, Angry, Happy[Excited merged]) and 5-way (Sad, Neutral, Angry, Happy, Other). Metrics: weighted accuracy (WA), unweighted accuracy (UA) at utterance level. - EmoWOZ: 10,438 human-human MultiWOZ dialogues + 1,000 human-machine ToD dialogues. User emotions: Neutral, Fearful, Dissatisfied, Apologetic, Abusive, Excited, Satisfied; encode task performance. Utterance-level evaluation with macro-F1 (MF1) and weighted-F1 (WF1) excluding Neutral; per-class including Neutral reported in appendix. - DAIC-WOZ: 193 clinical interviews with PHQ-8 scores; dev: 35 (12 depressed), test: 47 (14 depressed). Binary classification at dialogue level: PHQ-8 ≥10 is Depressed; prompts include PHQ-8 severity definitions. Metric: F1 for Depressed (dev/test); Not Depressed F1 in appendix. Models: Evaluated GPT-2 (124M), LLaMA-7B, Alpaca-7B, LLaMA-3-8B, GPT-3.5 (June 13, 2023), GPT-4 (June 13, 2023). For comparisons, supervised SOTAs: Wu et al. (2020) for IEMOCAP, Feng et al. (2023) for EmoWOZ, Wu et al. (2023) for DAIC-WOZ. Training/decoding specifics: LoRA applied to open-source models (not GPT-2). For open-source decoding, output space is constrained to label tokens. ASR: Whisper-medium WERs—12.0% (IEMOCAP), 16.5% (DAIC-WOZ). EmoWOZ ASR simulated via text-to-text generation trained on IEMOCAP transcript→Whisper hypothesis pairs; simulated WER 17.1%.

Key Findings

Zero-shot performance (Table 2): - IEMOCAP: Best zero-shot among LLMs achieved by Alpaca-7B with WA/UA 48.8/51.4 (4-way) and 40.5/36.2 (5-way), far below supervised SOTA (4-way WA/UA 77.6/78.4; 5-way 73.3/74.4). - EmoWOZ: GPT-4 leads with MF1 52.4 and WF1 62.3, below supervised SOTA (MF1 65.9, WF1 83.9) but comparable to several supervised baselines reported in prior work. - DAIC-WOZ: GPT-3.5 attains best test F1 for Depressed at 64.3 (dev 54.5), while GPT-4 has dev 63.6/test 59.3; still below supervised SOTA (dev 88.6, test 85.7 F1). - Model size trends: Larger does not always mean better (e.g., Alpaca-7B tops IEMOCAP). For EmoWOZ and DAIC-WOZ, larger models (GPT-3.5/4) generally perform better, likely due to greater reasoning over nuanced labels and long context. - Instruction-following tuning (Alpaca-7B) improves prompt compliance and ERC vs. base LLaMA-7B across datasets. Robustness to ASR errors (Table 3): - Emotion recognition (IEMOCAP, EmoWOZ): Generally small metric changes. Example: IEMOCAP 4-way WA changes within about ±2 points for most models; EmoWOZ MF1/WF1 changes are small (e.g., GPT-4 MF1 +0.9, WF1 −1.5). - Depression detection (DAIC-WOZ): ASR hurts more, especially for larger commercial models due to long noisy prompts. GPT-3.5: dev −17.0, test −8.3 F1; GPT-4: dev −19.2, test −17.6. Open-source smaller models show minor changes (e.g., Alpaca-7B test 0.0 change). In-context learning (Table 4): - Larger models benefit more from more ICL examples. GPT-4 IEMOCAP WA improves from 42.4 (N=0) to 63.8 (N=3) in 4-way; EmoWOZ WF1 from 62.3 to 66.5 (N=3); DAIC-WOZ dev/test F1 improved at N=1 or N=3 but with instability. - Smaller open-source models do not consistently gain from more ICL; best often at N=0 or N=1, indicating difficulty leveraging long context effectively. - For DAIC-WOZ, N=1 often biases predictions to Not Depressed causing zero F1; N=3 can mitigate but remains limited by extreme input length. Task-specific fine-tuning (Figure 2; Appendix): - With LoRA fine-tuning of LLaMA-7B/Alpaca-7B/LLaMA-3-8B, substantial gains using only 25–50% of training data. IEMOCAP 4-way reaches WA/UA ≈76/76 around 50–100% data, approaching supervised SOTA; 5-way remains below SOTA, likely due to semantically weak "Other" label. - EmoWOZ WF1 approaches 80–83 with 50–100% data for open-source LLMs, close to SOTA WF1 83.9. GPT-2 shows limited or degraded gains (e.g., predicting Neutral predominantly). - DAIC-WOZ fine-tuning tends to overfit majority class (Not Depressed), yielding near-zero F1 for Depressed, reflecting label imbalance and task difficulty under LM objective. Additional qualitative observations: - LLaMA-7B over-predicts Neutral on EmoWOZ in zero-shot (Neutral recall 100), harming metrics that exclude Neutral. - ASR errors can shift predictions to Other in IEMOCAP 5-way for LLaMA-7B, with large recall increases for Other (47.3→78.2) and drops for Happy/Angry recalls, and Sad precision collapse (65.5→0).

Discussion

Findings show that while LLMs demonstrate emerging ability to recognize affect, zero-shot performance lags behind specialized supervised models across chit-chat ERC, ToD ERC, and clinical depression detection. The gap narrows when prompts include explicit label definitions and when models are instruction-tuned (e.g., Alpaca-7B), highlighting the importance of prompt-following capabilities. Larger models better exploit ICL and longer contexts for nuanced tasks like EmoWOZ and DAIC-WOZ, but are also more sensitive to ASR noise in long prompts, particularly for depression detection. Fine-tuning open-source LLMs with parameter-efficient LoRA yields near-SOTA performance on IEMOCAP (4-way) and EmoWOZ with as little as 50% of the training data, suggesting LLMs can serve as practical ERC front-ends in dialogue systems. Persistent gaps in IEMOCAP 5-way (due to the semantically underspecified Other class) and DAIC-WOZ (label imbalance, long-context nuance) indicate that label schema design, learning objectives beyond standard LM loss, and long-context handling are crucial for further progress.

Conclusion

The study benchmarks LLMs (LLaMA-7B, Alpaca-7B, LLaMA-3-8B, GPT-3.5, GPT-4) for affect recognition across open-domain, task-oriented, and clinical dialogues under zero-shot, few-shot ICL, and task-specific fine-tuning. Zero-shot LLMs remain below supervised SOTAs, but explicit label definitions and instruction tuning help. LLMs are generally robust to ASR noise for emotion recognition but not for depression detection over long noisy contexts. Larger models benefit more from more ICL examples. With LoRA fine-tuning and only 50% of training data, open-source LLMs approach SOTA on IEMOCAP (4-way) and EmoWOZ, underscoring their utility as ERC modules. Future research should address long-context utilization, balanced objectives for clinical detection tasks, semantically meaningful label design (reducing Other), and efficient inference for real-time systems.

Limitations

Key limitations include: (1) Inference latency and resource demands—LLM inference is slow and scales with prompt length and number of ICL examples, challenging real-time deployment. (2) Long-context effectiveness—attention mechanisms’ effective span limits the utility of very long prompts, especially in dialogue-level depression detection. (3) Dataset scope—only one dataset per domain was used; broader affect types and dialogue settings remain unexplored. (4) Class imbalance—datasets exhibit imbalance; while SOTAs used augmentation, addressing imbalance and potential divergence of LLM-generated synthetic data from real data were beyond scope. (5) DAIC-WOZ fine-tuning—LM objective and small, imbalanced training data led to majority-class bias, suggesting a need for task-specific objectives. (6) Use of external API models (GPT-3.5/4) involves data transfer and potential privacy considerations; although compliant with provider policies, this may limit some applications.

Related Publications

Explore these studies to deepen your understanding of the subject.

Computer Science

Affect Recognition in Conversations Using Large Language Models

S. Feng, G. Sun, et al.

Computer Science

Detecting hallucinations in large language models using semantic entropy

S. Farquhar, J. Kossen, et al.

Computer Science

ABScribe: Rapid Exploration & Organization of Multiple Writing Variations in Human-AI Co-Writing Tasks using Large Language Models

M. Reza, P. Dushniku, et al.

Business

Gender stereotypes in artificial intelligence within the accounting profession using large language models

K. Leong and A. Sung

Listen, Learn & Level Up

Over 10,000 hours of research content in 25+ fields, available in 12+ languages.

No more digging through PDFs, just hit play and absorb the world's latest research in your language, on your time.

listen to research audio papers with researchbunny