logo
ResearchBunny Logo
Naturalistic multimodal emotion data with deep learning can advance the theoretical understanding of emotion

Psychology

Naturalistic multimodal emotion data with deep learning can advance the theoretical understanding of emotion

T. Angkasirisan

Can AI finally settle what emotions are? This paper explores how big-data and deep learning can integrate subjective experience, context, brain–body physiological signals and expressive behaviour to map emotions in multidimensional spaces—offering fresh insights into debates about innate vs learned categories and emotional coherence. Research conducted by Thanakorn Angkasirisan.

00:00
00:00
~3 min • Beginner • English
Introduction
The paper addresses the long-standing question of what emotions are and why theoretical consensus remains elusive. It situates the problem historically, beginning with Darwin’s observations of expressive behaviours and the evolutionary view, and highlights the proliferation of competing perspectives—evolutionary, constructivist, appraisal, and dynamical systems. The author proposes that fragmentation in the field is partly due to methodological constraints: theories have relied on narrow, modality-specific evidence (e.g., facial expressions or self-reports). In light of recent advances in artificial intelligence and the availability of large, naturalistic datasets, the review argues for a data-driven, multimodal approach that integrates subjective experience, contextual factors, physiological signals, and expressive behaviours. Using deep learning to map these components within multidimensional spaces could simultaneously test assumptions across theories (e.g., universality vs. cultural construction, coherence vs. degeneracy) and refine the theoretical understanding of emotion.
Literature Review
The review synthesizes four major perspectives on emotion: (1) Evolutionary (basic emotions theory, BET): posits discrete, universal emotions (e.g., happiness, sadness, surprise, disgust, fear, anger) with distinct elicitors, physiological signatures, and expressive behaviours. Cross-cultural recognition studies (e.g., Ekman’s work) reported above-chance accuracy across diverse cultures and have been widely replicated. Critiques include evidence of cultural variation in sorting facial expressions without labels and meta-analytic findings that do not support distinct, dedicated neural circuits for each emotion. (2) Constructivist: argues emotions are constructed from core affect (valence, arousal), interoception, and learned interpretations of context and language. It explains distributed neural findings and the role of semantics and culture, though direct tests of brain computations constructing emotions remain challenging. (3) Appraisal: emphasizes that emotions depend on cognitive evaluations of events (e.g., goal congruence, certainty, control). Empirical work links appraisal profiles to distinct emotional responses and coping strategies, showing within- and between-person variability tied to subjective interpretations. (4) Dynamical systems: conceptualizes emotions as emergent, self-organizing patterns in a multidimensional space, stabilized through experience and shared ecological pressures. It predicts probabilistic associations and degeneracy (multiple physiological patterns leading to similar emotional states) and complements evolutionary and constructivist views. The review also notes that each perspective’s evidence arises from methodologies aligned with its assumptions, contributing to fragmented findings and underscoring the need for integrative, multimodal approaches.
Methodology
As a conceptual review, the paper outlines a research framework rather than conducting new empirical studies. It proposes leveraging AI and deep learning to integrate naturalistic multimodal emotion data across four interacting systems: (1) Subjective experience: momentary self-reports via ambulatory sensing (e.g., smartphone prompts) that capture dynamic experiences influenced by language, culture, appraisal, and regulation. (2) Contextual factors: multilevel environmental and internal influences (e.g., social context, goals, values) measured via computer vision (objects, race, gender), audio sensing (prosody), natural language processing (text), and experimental manipulation with virtual reality to parse effects on perception and behaviour. (3) Physiological responses: central (brain activity) and autonomic signals (heart rate, respiration, skin temperature, electrodermal activity) measured via lab equipment and wearables; recognizes challenges in large-scale, naturalistic neural data and calls for innovations in wearable brain–body tracking. (4) Expressive behaviours: facial expressions, body posture, and vocal bursts measured in-the-wild using public media and emerging multimodal recording tools (e.g., AR/VR devices) for synchronous capture of temporal dynamics. Analytically, the framework advocates: (a) supervised learning to map associations among modalities (contexts→experience; expressions→experience; contexts→expressions; neurophysiology→experience) and to place multimodal patterns into multidimensional spaces; (b) unsupervised clustering/self-organizing approaches to discover emergent emotional structures independent of labels; (c) comparisons between supervised and unsupervised models to assess whether taxonomies reflect natural clusters or label-driven structures; and (d) within-person, longitudinal modelling to investigate coherence vs. degeneracy across systems. The paper highlights existing multimodal datasets (e.g., HEU Emotion, Emognition) as practical starting points focusing on non-brain physiology, expressions, experiences, and context.
Key Findings
From prior work synthesized in the review: (1) Large-scale, AI-driven analyses of self-reports mapped 27 distinct emotion categories organized by continuous gradients from thousands of short videos and nearly a thousand participants. (2) Expressive behaviours convey many more categories than traditionally assumed: brief vocalizations and facial–bodily expressions convey at least 24 distinct categories, and naturalistic facial expressions map to 28 emotions. (3) Cross-cultural analyses using deep learning trained on 186,744 facially expressive YouTube clips and applied to six million videos across 144 nations identified 16 facial expressions reliably associated with specific real-world contexts (e.g., weddings, sports, fireworks). (4) Neural representations of visually evoked emotions are high-dimensional, categorical, and distributed across transmodal brain regions; fMRI patterns can predict dozens of emotions. (5) Emotion coherence across systems is variable: high coherence between experience and expression, lower coherence with physiology; coherence is moderated by top-down processes (e.g., regulation, body awareness). (6) Supervised and unsupervised models can yield different emotion structures, raising questions about the objectivity of label-based taxonomies. Collectively, these findings indicate emotions are multifaceted, high-dimensional, and arranged along gradients, with probabilistic mappings across modalities rather than simple one-to-one correspondences.
Discussion
The review argues that integrating naturalistic multimodal emotion data with deep learning can illuminate foundational questions about the nature of emotion. By jointly modelling subjective experiences, context, physiology, and expression within multidimensional spaces, researchers can test core assumptions across theories—whether emotions are discrete and universal or constructed and culturally variable; whether emotional systems exhibit coherence with one-to-one mappings or degeneracy where multiple patterns yield the same emotional state. AI-driven findings already suggest high dimensionality and graded organization, reconciling parts of evolutionary, constructivist, appraisal, and dynamical systems views. Comparing supervised models (label-guided mappings) with unsupervised models (emergent structure) can help determine whether commonly recognized categories reflect natural clusters or learned semantic frameworks. Incorporating contextual appraisal processes and within-person dynamics further refines our understanding of how emotions arise and vary across individuals and cultures. Overall, the proposed multimodal, data-driven approach provides a path toward unifying and refining emotion theories.
Conclusion
The paper concludes that AI-driven methods using deep learning on large, naturalistic multimodal datasets can move emotion science beyond siloed methodological traditions. Integrating physiological, experiential, contextual, and expressive data within multidimensional spaces offers an integrative model capable of testing multiple theories simultaneously and probing coherence versus degeneracy. Advances in computational power, multimodal wearables, AR/VR recording, and interdisciplinary collaborations are making such projects increasingly feasible. Future research should prioritize: obtaining comprehensive synchronized multimodal data in the wild; comparing supervised and unsupervised models to assess the naturalness of emotion categories; modelling appraisal-driven internal contexts (e.g., goals, values) with agent-based or reinforcement learning approaches; and conducting within-person, longitudinal analyses to characterize consistency and variability across cultures. This approach promises a transformative trajectory for the theoretical understanding of emotion.
Limitations
Major limitations highlighted include the difficulty of collecting comprehensive, synchronized, naturalistic multimodal emotion data—especially large-scale neural measurements outside laboratory settings. Contextual influences on emotion are complex and multi-level (environmental, social, cultural, goals, values), making them challenging to quantify and integrate coherently. Reliance on supervised learning and preassigned labels may reflect subjective taxonomies rather than objective categories; supervised and unsupervised models can yield discrepant structures. Deep learning models provide probabilistic mappings and are non-deterministic, complicating interpretations of coherence versus degeneracy. Available datasets often focus on non-brain physiology and expressions, limiting fully integrative modelling. As a review, the paper presents conceptual proposals without generating new empirical data.
Listen, Learn & Level Up
Over 10,000 hours of research content in 25+ fields, available in 12+ languages.
No more digging through PDFs, just hit play and absorb the world's latest research in your language, on your time.
listen to research audio papers with researchbunny