logo
ResearchBunny Logo
Comparing supervised and unsupervised approaches to emotion categorization in the human brain, body, and subjective experience

Psychology

Comparing supervised and unsupervised approaches to emotion categorization in the human brain, body, and subjective experience

B. Azari, C. Westlin, et al.

This intriguing study by Bahar Azari and colleagues delves into the effectiveness of machine learning in unearthing the biological truths behind emotional experiences. By pitting supervised classification against unsupervised clustering, the authors reveal alarming inconsistencies, prompting a reevaluation of how we interpret emotional data.

00:00
00:00
~3 min • Beginner • English
Introduction
The paper examines whether common folk emotion category labels (e.g., anger, fear, sadness, happiness) correspond to objective, biologically grounded categories that can be discovered in measurements of brain activity, autonomic physiology, and behavior. The context is more than a century of attempts to map psychological categories onto physical measurements, despite warnings that folk categories may not map cleanly. Recent machine learning studies often assume labels reflect ground truth and use supervised classifiers to find emotion-specific biomarkers, yet patterns are inconsistent across studies. The purpose here is to compare supervised (label-guided) and unsupervised (label-free) approaches across three datasets—fMRI BOLD signals, ambulatory ANS physiology with free-labeling, and self-report of emotion—asking whether intrinsic data structure aligns with folk emotion labels. The study’s importance lies in assessing the validity of labels used to organize psychological science and guiding more data-driven discovery of mental categories.
Literature Review
Prior MVPA studies have claimed distinct neural or physiological patterns for specific emotions, but reported patterns are inconsistent across studies, even when using similar methods and stimuli. Table 1 summarizes representative fMRI MVPA studies (e.g., Kassam et al., Kragel & LaBar, Saarimäki et al., Wager et al.), noting varying sample sizes, induction methods (scenario immersion, movies, music), preprocessing choices, feature selection strategies, and classification algorithms (e.g., Gaussian Naive Bayes, PLS-DA, linear neural nets, Bayesian spatial models). Cross-study variability may reflect methodological differences (e.g., small samples, alignment, preprocessing, classifier choice) or limitations of current measures, but persistent within-category variability and between-category similarity across modalities (ANS, facial movements, BOLD magnitude and connectivity, single-unit recordings) suggest emotion categories may be populations of variable instances rather than fixed biological kinds. This motivates testing whether unsupervised methods recover label structure or reveal alternative organization.
Methodology
The authors reanalyzed three datasets using both supervised classification and unsupervised clustering, choosing methods appropriate to each dataset and using statistical model order selection for clustering. - Dataset 1 (fMRI BOLD; Wilson-Mendenhall et al.): N=16; participants immersed in 60 auditory scenarios per folk category (fear, happiness, sadness) over six runs (total 180 trials). Preprocessing (AFNI): slice timing, motion correction, 6 mm smoothing, percent signal change normalization. For each trial a whole-brain beta map was derived from 9 s immersion windows and, separately, 3 s post-stimulus windows. Supervised: within-subject 3D CNN with leave-one-run-out cross-validation; chance = 33.3%; significance by one-sample t-test vs chance. Unsupervised: PCA for dimensionality reduction followed by Gaussian Mixture Model (GMM) with shared diagonal covariance; Bayesian Information Criterion (BIC) jointly selected number of PCs and clusters per participant. Sensitivity validation used synthetic BOLD-like data with known categories across a range of SNRs to verify GMM could recover clusters when signal exists. - Dataset 2 (Ambulatory ANS physiology; Hoemann et al.): N=46; ECG, ICG collected for 14 days with biological-triggered experience sampling based on sustained IBI change absent movement. For each event, six cardiovascular features were computed as change scores (RSA, IBI, PEP, LVET, SV, CO). Participants freely labeled emotions and rated valence/arousal. Supervised: within-subject fully connected neural network classifying only the participant’s top three most frequent emotion words; fivefold cross-validation; class imbalance preserved; statistical significance assessed via permutation test with 1000 label shuffles to generate a null distribution of group mean accuracy. Unsupervised: Dirichlet Process Gaussian Mixture Model (DP-GMM) per participant (Scikit-learn; full covariance; DP priors); run 100 times with different random states, selecting the solution with highest evidence lower bound; number of clusters inferred from data. - Dataset 3 (Self-reports; Cowen & Keltner): 853 participants rated 2185 clips. One subsample provided yes/no endorsements for 34 emotion words per clip; another rated 14 affective dimensions (Likert 1–9). Unsupervised topic modeling (LDA) was applied to average yes/no emotion category endorsements across clips to test for lower-dimensional topic structure; model selection via validation perplexity. A second analysis used the 14 affective dimensions as features: Supervised classification trained a neural network to predict the highest-consensus emotion label per clip, restricting to nine categories that labeled at least 2.9% of clips (adoration, aesthetic appreciation, anxiety, awe, disgust, fear, nostalgia, romance, sexual desire); eightfold (or sixfold in Methods) cross-validation with class balancing as described; chance = 11.11%; significance by one-sample t-test vs chance. Unsupervised GMM with BIC on the 14-D continuous features to discover cluster number; examined correspondence between clusters and the nine labels.
Key Findings
Across all three datasets, supervised classifiers achieved statistically significant above-chance accuracy, but unsupervised clustering did not align with folk emotion labels. - Dataset 1 (fMRI BOLD): Supervised within-subject 3D CNN accuracy during 9 s immersion was 46.06% (chance 33.3%; t(15)=9.07, p<0.01). During 3 s post-stimulus, accuracy was 47.78% (t(15)=25.01, p<0.01). Unsupervised GMM with BIC yielded participant-specific cluster counts (8 participants: 1 cluster; 6: 2 clusters; 2: 3 clusters), with clusters mixing trials across fear, happiness, and sadness without clear correspondence to labels or to scenario valence/arousal. Synthetic-data validation showed the PCA+GMM pipeline could recover label-consistent clusters even at low SNR, suggesting the lack of correspondence is not due to insensitivity. - Dataset 2 (ANS physiology): Supervised within-subject neural network classification across participants yielded mean accuracy 47.1%, significant relative to a permutation-based null distribution preserving class imbalance. Unsupervised DP-GMM revealed variable numbers of clusters per participant and many-to-many mappings between clusters and participants’ emotion words; valence/arousal ratings varied within and across clusters. - Dataset 3 (Self-reports): LDA on 34 emotion category endorsements showed no clear minimum in validation perplexity, indicating no robust lower-dimensional topic structure. Using 14 affective dimensions to classify nine most frequent emotion categories, supervised neural network achieved mean accuracy 47.04% (chance 11.11%; t(8)=3.79, p<0.01). Unsupervised GMM on the 14-D features selected 3 clusters by BIC; clusters contained mixed labels (e.g., Cluster 1 included adoration, aesthetic appreciation, awe, nostalgia, romance, sexual desire), indicating poor alignment with emotion categories.
Discussion
The findings address whether folk emotion labels reflect intrinsic biological categories detectable in data. Supervised classifiers consistently achieved above-chance performance, implying that some label-related information exists in the measurements. However, unsupervised clustering identified cluster structures that did not map cleanly onto emotion labels across brain, physiology, or self-report features. Two interpretations are discussed: (1) Emotion categories may be biologically real but unsupervised methods failed to detect them due to measurement limitations (sparse sampling, noise, insufficient sensitivity, data reduction), with clusters reflecting other latent factors. (2) Folk emotion categories may not be biological kinds stable across individuals and contexts; instead, emotion categories may be populations of variable, context-specific instances, consistent with extensive within-category variability across modalities and cultures. The results emphasize the need to validate labels, broaden measurements (including internal/external context, appraisal, function), and routinely compare supervised and unsupervised models to assess whether label structures are supported by intrinsic data organization.
Conclusion
This proof-of-concept study demonstrates that supervised and unsupervised approaches can yield incongruent solutions in emotion research: classifiers perform above chance using folk labels, yet unsupervised clustering does not recover those categories across brain, physiology, or self-report data. The work cautions against assuming labels are ground truth and highlights possible population-like organization of emotions. Recommendations include: validate and scrutinize labels; design studies to allow discovery of new categories; collect richer, temporally sensitive multimodal data with internal/external context, appraisal, and functional features; increase within-subject and stimulus sampling power; and routinely report both supervised and unsupervised analyses or multiple modeling approaches. These steps can guide more reliable, biologically meaningful categorization in emotion science and beyond.
Limitations
The analyses cannot discriminate between measurement limitations and conceptual mismatches as causes of discordance between supervised and unsupervised results. Datasets were not designed to identify the specific features driving discovered clusters, limiting interpretability. Existing measures may be sparse or noisy relative to relevant signals; small sample sizes and limited unique stimuli per category may reduce power. In Dataset 3, only a subset of categories had sufficient samples for supervised analysis, and labeling relied on highest-rated categories, potentially obscuring mixed emotions. Overall, the study is a proof-of-concept rather than a definitive test of competing theories.
Listen, Learn & Level Up
Over 10,000 hours of research content in 25+ fields, available in 12+ languages.
No more digging through PDFs, just hit play and absorb the world's latest research in your language, on your time.
listen to research audio papers with researchbunny