logo
ResearchBunny Logo
On Multimodal Emotion Recognition for Human-Chatbot Interaction in the Wild

Computer Science

On Multimodal Emotion Recognition for Human-Chatbot Interaction in the Wild

N. Kovačević, M. Gross, et al.

A study conducted by Nikola Kovačević, Markus Gross, Christian Holz, and Rafael Wampfler collected multimodal text, audio, and video from 99 participants interacting with a GPT-3 chatbot over three weeks, uncovering a strong domain gap between human-human and human-chatbot emotion signals and showing personalization can boost recognition by up to 38–41%.

00:00
00:00
~3 min • Beginner • English
Introduction
The study investigates emotion recognition in conversation (ERC) for real-world human–chatbot interactions, a setting increasingly prevalent across applications such as personal assistance, customer service, education, healthcare, and entertainment. Unlike human–human ERC models and datasets, which are often based on acted or controlled lab scenarios with external annotations, human–chatbot interactions involve subtle, subjective emotions and missing modalities on the chatbot side (no audio/video), raising questions about the transferability of existing methods. The research aims to (1) assess the domain gap between human–human and human–chatbot ERC; (2) build multimodal ERC models that operate under the human–chatbot paradigm; and (3) evaluate the effectiveness of user-specific personalization in mitigating performance challenges. The importance lies in enabling conversational agents to perceive and adapt to users’ emotional states, thereby improving personalization, engagement, and applicability in real-world contexts.
Literature Review
ERC has been explored across modalities: text (semantic cues captured effectively by transformer encoders like BERT/RoBERTa), audio (vocal features such as pitch, power, chroma encoded via spectrograms and transformer-based audio models), and video (facial expressions and features via CNNs and affective datasets). Emotion taxonomies include dimensional models such as Valence–Arousal–Dominance (VAD), typically assessed via Self-Assessment Manikin (SAM) on 9-point scales, and categorical models (e.g., Ekman’s six basic emotions). Prior datasets (DailyDialog, MELD, IEMOCAP) predominantly comprise acted/scripted human–human dialogues, collected in lab settings with external annotations, potentially misaligned with intrinsic emotions and lacking the dynamics of human–chatbot interactions. Modeling approaches span text-only and multimodal architectures: context/speaker-sensitive transformers (HiTrans), graph convolutional networks capturing inter/intra-speaker dependencies (DialogueGCN, hierarchical stacked GCN), hierarchical transformers, and transfer learning via pre-trained dialogue models. Multimodal fusion strategies include concatenation, inter-modality attention (IMA), triplet networks, CNN-based fusion, and dynamic memory networks with speaker state modeling (DialogueRNN, A-DMN). However, these methods often assume full modalities for all speakers and human–human conversational dynamics, which do not hold in real-world chatbot interactions where only chatbot text is available and interaction styles differ.
Methodology
Data Collection: A web-based framework recorded text, audio, and webcam video from 99 English-speaking participants during speech-based interactions with GPT-3 chatbots over three weeks. Three chatbot personae (Sarah, Vincent, Albert) varied in attributes and emotional tones. Self-reports assessing both users’ own emotions and perceived chatbot emotions were prompted every 90 seconds, using SAM to rate VAD on 9-point scales, with a binary neutral flag for calibration. Face detection (SSD MobileNet V1 via face-api.js) validated video input; conversations paused if the face was not detected. Participants: 108 initially recruited (56 female, 52 male), aged 18–52 (mean 25.1, SD 4.6), asked to interact on ≥10 days over 3 weeks (average 11 days, SD 2). Incentives included compensation tied to completed self-reports (CHF 60 for ≥24, CHF 110 for ≥48), gamification, and a CHF 1,000 lottery. Preprocessing: From 2,734 conversations and 9,292 self-reports, data were cleaned by excluding (1) constant VAD ratings, (2) self-reports completed in <10s, (3) functional issues (e.g., paused due to no detected face). Final dataset: 1,725 conversations and 8,003 self-reports from 99 participants. Per-user label normalization centered around self-reported neutral levels and used ±1 SD to map continuous VAD to three classes (low/medium/high). Validation & Descriptives: Average engagement time per user 4h18m (SD 2h48m). Average conversation length 12.7 minutes (SD 10.2), with 4.7 self-reports (SD 2.6). Selection of chatbots balanced (Sarah 37.1%, Albert 35.8%, Vincent 27.1%). Significant correlations between users’ and perceived chatbot emotions: valence r=0.59 (p<0.01), arousal r=0.59 (p<0.01), dominance r=-0.18 (p<0.01). Model Architecture: Two transformer-based multimodal classification networks predict self-reported user emotions and perceived chatbot emotions across VAD classes (low/medium/high) using context windows of W=4 (8 utterances). Modalities: user text, audio, video; chatbot text only (no chatbot audio/video). Feature extraction uses pre-trained encoders: RoBERTa for text (768-d per utterance), DistilHuBERT for audio (20 ms windows at 16 kHz yielding sequences; reduced via transformer encoder and CLS token to 768-d per utterance), EfficientNet (pre-trained on AffectNet) for video (face-cropped, 226×226 normalized frames sampled at 6 fps; encoder output 1,280-d, then sequence reduced via transformer encoder and CLS token). Speaker-specific embeddings added to text. Unimodal Models: Text modeled with three parallel Bi-GRUs capturing user state, chatbot state, and global context (MultiGRU). Audio and video modeled with a single Bi-GRU (SingleGRU) since chatbot audio/video are absent. Classification via feed-forward network. Multimodal Models: Two fusion strategies: (1) Multimodal-Concat, concatenating unimodal features; (2) Multimodal-IMA (inter-modality attention) wherein each modality pair (X,Y) passes through attention blocks (X as query, Y as key/value), followed by Hadamard products for pairs sharing the query modality and final concatenation. For multimodal training, branches initialized from best unimodal weights. Training & Evaluation: Split into train/validation/test with disjoint users and similar class distributions: Train 79 users/1,364 conversations/25,478 utterances/6,383 self-reports; Validation 10/166/3,305/755; Test 10/185/3,568/865. Models trained on an Nvidia RTX 3090 (24 GB VRAM). Metric: macro F1 across classes. Personalization: User-specific fine-tuning applied to best unimodal (text) and multimodal (text+audio) models using n self-reports per user (n ∈ {0,10,30,50}) and evaluated on remaining self-reports. Runtime: For W=4, per-branch timings: text 0.01s (±0.001s), audio 0.16s (±1.45s), video 0.34s (±1.07s), Bi-GRU+classifier 0.01s (±0.003s); total 0.52s (±1.83s) per forward pass on RTX 3090, suitable for real-time ERC.
Key Findings
Evidence of domain gap: Models trained/evaluated under the human–chatbot paradigm reveal lower utility of audio/video and challenges due to subjective labels, reduced facial expressivity, and subtle emotions. Performance (macro F1): - Baseline (majority class): User VAD 24.00; Chatbot VAD 22.18. Unimodal: - Text (MultiGRU): User VAD 36.68 (Valence 46.54; Arousal 31.29; Dominance 32.21); Chatbot VAD 41.86 (Valence 52.36; Arousal 36.41; Dominance 36.79). - Audio (SingleGRU): User VAD 29.62 (Valence 28.67; Arousal 27.95; Dominance 32.25). - Video (SingleGRU): User VAD 28.41 (Valence 29.38; Arousal 26.25; Dominance 29.60). Multimodal (best highlights): - Text+Audio (Concat): User VAD 41.72 (Valence 49.13; Arousal 38.26; Dominance 37.79). Chatbot VAD 41.86 (Valence 52.42; Arousal 36.91; Dominance 36.26). - Adding Video generally did not improve performance and often degraded it. Fusion: Inter-modality attention (IMA) underperforms or is comparable to simpler concatenation, particularly due to missing chatbot modalities and low predictive value of audio/video. Correlations: Significant positive correlations between user and perceived chatbot valence/arousal (r=0.59 each, p<0.01); significant negative correlation for dominance (r=-0.18, p<0.01). Action unit analysis: Lower variability in key facial action units (e.g., cheeks, nose, mouth-closed) compared to acted datasets (MELD, IEMOCAP), indicating reduced facial expressivity during human–chatbot interactions. Personalization: User-specific fine-tuning yields substantial linear performance gains up to 41% (depending on number of self-reports), with reported improvements up to 38% for user emotions and up to 41% for perceived chatbot emotions. Runtime: End-to-end inference ~0.52s per context window without branch parallelization, suitable for real-time applications.
Discussion
Feature fusion: IMA is less effective than simple concatenation in human–chatbot settings due to missing modalities on the chatbot side and the dominance of text features; the Hadamard product in IMA can reduce salient text signals when audio/video carry limited predictive value. User vs. chatbot emotion prediction: Chatbot emotion labels (externally annotated by users from text) are easier to predict than intrinsic user emotions, which may be partially unexpressed in modalities; this aligns with prior findings that externally obtained labels are more predictable than self-reported intrinsic emotions. Modality ablation: Text is most informative, followed by audio; video contributes little and can degrade performance, especially for chatbot emotion prediction where chatbot audio/video are unavailable. Nonetheless, chatbot emotion prediction remains above random, aided by correlations with user emotion. Video modality: Reduced facial activation in human–chatbot interactions likely reflects diminished use of facial expressions when users believe expressions do not affect the chatbot; micro-expressions may require user-specific personalization or more specialized video encoders to extract. Personalization: Consistently boosts performance (up to 41%) across models and labels, with practical personalization achievable within ~15–90 minutes of self-reports. More efficient self-report collection could further improve viability and gains. Implications: Current ERC architectures trained on human–human datasets may not generalize to real-world chatbot interactions; personalization offers a pathway to mitigate the domain gap. Applications span educational gaming, e-learning, and healthcare companionship, enabling adaptive, emotion-aware agents. Ethical considerations are paramount, including risks of manipulation and the need for consent, transparency, privacy, and security.
Conclusion
The work systematically examines multimodal ERC in real-world human–chatbot interactions by collecting a large in-the-wild dataset and building transformer-based unimodal and multimodal models under practical constraints (missing chatbot audio/video, real-time operation). It identifies a domain gap between human–human and human–chatbot ERC driven by subjective labels, reduced facial expressivity, and subtle emotions, limiting the utility of audio/video. Text is the most predictive modality, and simple concatenation outperforms IMA fusion. User-specific personalization substantially improves performance (up to 38% for user emotions; up to 41% for perceived chatbot emotions), highlighting personalization as a promising direction for robust ERC in practical deployments. Future research should explore specialized encoders tailored to human–chatbot interactions, scalable personalization strategies (e.g., clustering, partial personalization), improved label collection protocols, and broader cohorts to assess generalizability.
Limitations
The participant cohort (university students) may not represent broader populations; conversational dynamics could differ across social strata. Personalization gains might partly reflect mismatches between pre-trained encoder training data and human–chatbot interaction styles, warranting specialized encoders. End-user willingness to engage in personalization (time cost, interaction burden) requires validation. Where individual personalization is infeasible, alternatives like user clustering or partial personalization should be explored. A dedicated warm-up phase for collecting predictive labels may reduce personalization time.
Listen, Learn & Level Up
Over 10,000 hours of research content in 25+ fields, available in 12+ languages.
No more digging through PDFs, just hit play and absorb the world's latest research in your language, on your time.
listen to research audio papers with researchbunny