Computer Science

Self supervised learning based emotion recognition using physiological signals

M. Zhang and Y. Cui

Emotion recognition is vital for human–machine interaction, yet labeled EEG datasets are scarce. This study, conducted by Min Zhang and YanLi Cui, applies self‑supervised learning to EEG using three pretext tasks—Contrastive Predictive Coding, Relative Position, and Temporal Shuffling—to extract features from unlabeled data. Experiments show SSL can learn effective representations for downstream emotion recognition without manual labels.

00:00

~3 min • Beginner • English

Index

Introduction

The study addresses the challenge of emotion recognition from EEG signals when labeled data are scarce. While non-physiological cues (speech, facial expressions, gestures) can be deliberately masked, EEG provides objective, reliable indicators of emotional states. However, existing labeled EEG emotion datasets are small, making supervised approaches difficult to scale. The research question is whether self-supervised learning can leverage the structure of unlabeled EEG to learn transferable representations that improve downstream emotion recognition with minimal labels. The purpose is to design and evaluate SSL pretext tasks tailored to EEG temporal structure, thereby improving classification performance and generalization without relying on costly, noisy manual annotations. The study is important for advancing affective brain–computer interfaces and practical emotion-aware systems where labels are expensive or unreliable.

Literature Review

Early emotion recognition relied on non-physiological signals (voice, facial expressions, gestures), progressing from shallow handcrafted features to deep learning (CNNs, LSTMs, end-to-end systems). EEG-based emotion recognition then emerged, commonly using supervised pipelines: feature extraction (e.g., differential entropy) plus classifiers such as SVM, KNN, HMMs, or deep models. Cross-subject variability remains a critical challenge, prompting research in adaptive, transfer, and domain generalization methods. Label scarcity, annotation cost, and potential label noise in physiological data motivate unsupervised approaches. Self-supervised learning reframes unsupervised problems into supervised pretext tasks that exploit temporal/spatial structure to generate pseudo-labels, achieving strong representation learning in vision and NLP and showing promise for EEG, where labels are limited.

Methodology

Framework: A self-supervised learning pipeline is proposed with three pretext tasks—Contrastive Predictive Coding (CPC), Relative Position (RP), and Temporal Shuffling (TS)—to learn EEG representations without manual labels. The pretrained encoder is transferred and fine-tuned for downstream emotion classification. - CPC: Transforms a generative modeling problem into a classification objective that maximizes mutual information between current context and future latent representations. An encoder g_enc maps inputs x_t to z_t, and an autoregressive model g_ar produces context c_t. A log-bilinear score f_k(x_{t+k}, c_t) = exp(z_{t+k}^T W_k c_t) is used with a contrastive loss over positive and negative samples to maximize MI. - Relative Position (RP): From a multivariate time series S ∈ R^{C×M}, pairs of time windows (x_t, x_{t′}) of length T are sampled. Using hyperparameters τ_pos and τ_neg, pairs within τ_pos are assigned positive labels and pairs beyond τ_neg are negatives, based on the assumption of slow temporal feature variation. A linear discriminant model on aggregated feature differences predicts y ∈ {−1, 1}. - Temporal Shuffling (TS): Samples two anchor windows from positive context and a third window either between them or from negative context, creating ordered (t < t′ < t″) or shuffled triplets. A contrastive model combines features via element-wise absolute differences and is trained with a logistic loss to predict whether the triplet is temporally ordered. Data and preprocessing: Three public datasets are used. - SEED: 15 participants (7 male, 8 female; ~23 years). Each of three sessions, separated by two weeks, includes 15 video clips (5 per emotion: positive, neutral, negative). EEG recorded with 62-channel ESI NeuroScan at 1,000 Hz, downsampled to 200 Hz. Three-class labels. - SEED-IV: 15 participants. Four emotions (happiness, sadness, fear, neutral). Each of three sessions includes 24 video clips (≈120 s each), with preparation and self-assessment intervals. - DEAP: 32 participants (32-channel EEG at 512 Hz), watching 60-s music videos with self-assessment on Valence, Arousal, Dominance, Liking, Familiarity. For this study, DEAP is converted to a 4-class task using Valence and Arousal binarized at threshold 5. Feature extraction: Time–frequency features via short-time Fourier transform; EEG divided into δ (1–4 Hz), θ (4–8 Hz), α (8–13 Hz), β (13–30 Hz), γ (30–100 Hz). Differential entropy (DE) is computed per band. Assuming Gaussianity within bands, DE h(X) = 0.5 log(2πeσ^2). Signals are sampled at 200 Hz and segmented into 1-s epochs. Feature selection aims to reduce redundancy and improve generalization. Training and evaluation: A CNN serves as the feature extractor across methods. SSL pretraining uses CPC, RP, and TS on unlabeled data; the pretrained model is transferred to downstream emotion classification and fine-tuned. Evaluation uses 10-fold cross-validation. Metrics: Accuracy and F1 Macro. Hardware/software: NVIDIA RTX 3090 (24 GB), Python with NumPy 1.22.3, and MNE-Python for EEG processing. Statistical testing: Wilcoxon pairwise tests with multiple-testing correction compare RP vs. CPC and TS vs. CPC.

Key Findings

- Across datasets, SSL pretext tasks yielded distinct downstream performance; CPC generally outperformed RP and TS. SEED (3-class; baseline 33.33%): - Accuracy (%): RP 35.49 ± 0.43; TS 35.91 ± 1.46; CPC 51.90 ± 4.63 - F1_macro (%): RP 34.41 ± 1.27; TS 35.11 ± 1.69; CPC 50.12 ± 5.75 SEED-IV (4-class; baseline 25.00%): - Accuracy (%): RP 28.51 ± 1.02; TS 29.64 ± 1.05; CPC 33.21 ± 4.34 - F1_macro (%): RP 27.89 ± 1.06; TS 28.59 ± 0.08; CPC 30.89 ± 4.51 DEAP (converted to 4-class; baseline 25.00%): - Accuracy (%): RP 52.76 ± 5.08; TS 50.11 ± 4.65; CPC 55.21 ± 6.62 - F1_macro (%): RP 48.89 ± 2.82; TS 48.27 ± 5.66; CPC 44.83 ± 4.86 - Training time per epoch (s): SEED: RP 1.59; TS 1.60; CPC 1.58 SEED-IV: RP 1.32; TS 1.31; CPC 1.30 DEAP: RP 1.49; TS 1.35; CPC 1.31 - Statistical significance: Wilcoxon pairwise tests (multiple-testing corrected) showed CPC significantly improved results over RP (p = 0.001) and TS (p = 0.001). Overall, SSL—especially CPC—learned effective EEG representations without manual labels, substantially outperforming baseline chance levels and showing competitive performance across datasets.

Discussion

The results support the hypothesis that self-supervised learning can leverage the temporal structure of unlabeled EEG to learn representations beneficial for downstream emotion recognition. CPC, which directly maximizes mutual information between context and future representations, consistently delivered the strongest or most stable performance across datasets, indicating its suitability for complex, high-dimensional EEG dynamics. RP and TS, while exploiting temporal locality and order, were less effective on some datasets (e.g., SEED), suggesting that simple proximity/order cues may capture less discriminative information than CPC’s predictive coding. The improvements over baseline across all datasets demonstrate SSL’s practicality in low-label regimes common in EEG emotion research. The variation in performance across datasets highlights dataset-specific characteristics (number of classes, recording setup, elicitation protocol), underscoring the need for adaptable pretext designs. The modest training times per epoch indicate computational feasibility. Together, these findings validate SSL—particularly CPC—as a promising pathway to reduce reliance on costly and noisy labels while maintaining quantifiable performance for EEG-based emotion recognition.

Conclusion

This work applies self-supervised learning to EEG-based emotion recognition, defining three pretext tasks—Relative Position, Temporal Shuffling, and Contrastive Predictive Coding—to generate pseudo-labels from data structure and learn transferable features. Experiments on SEED, SEED-IV, and DEAP show that SSL, especially CPC, can learn useful representations without manual labels and improve downstream classification beyond chance. The study contributes an SSL framework tailored to EEG temporal dynamics, demonstrates its effectiveness across multiple datasets, and provides computational benchmarks. Future work should enhance generalization and stability across subjects and datasets, integrate interpretability and explainability techniques into the framework, and refine pretext tasks that better capture neurophysiologically meaningful temporal–spatial patterns.

Limitations

- Interpretability: The current framework does not integrate model interpretability/explainability. Incorporating interpretable network components and post hoc techniques (e.g., saliency maps, LRP, SHAP) is important for neuroscience applications to discern neurophysiological features from artifacts. - Variability across datasets/tasks: Pretext task effectiveness varies by dataset (e.g., CPC outperforming RP/TS on SEED; mixed F1 on DEAP), indicating room to improve robustness and generalization, especially for cross-subject scenarios. - Labeling and protocol differences: Differences in elicitation protocols, channel montages, and class definitions (e.g., DEAP 4-class conversion via Valence/Arousal binarization) may limit direct comparability and generalizability. - No manual-label comparisons beyond a single CNN baseline are detailed; broader comparisons with alternative supervised and semi-supervised baselines could further contextualize gains.

Related Publications

Explore these studies to deepen your understanding of the subject.

Computer Science

Self-supervised learning for human activity recognition using 700,000 person-days of wearable data

H. Yuan, S. Chan, et al.

Computer Science

A comprehensive review of deep learning in EEG-based emotion recognition: classifications, trends, and practical implications

W. Ma, Y. Zheng, et al.

Food Science and Technology

Rapid analysis of meat floss origin using a supervised machine learning-based electronic nose towards food authentication

L. A. Putri, I. Rahman, et al.

Computer Science

A deep learning framework for gender sensitive speech emotion recognition based on MFCC feature selection and SHAP analysis

Q. Hu, Y. Peng, et al.

Listen, Learn & Level Up

Over 10,000 hours of research content in 25+ fields, available in 12+ languages.

No more digging through PDFs, just hit play and absorb the world's latest research in your language, on your time.

listen to research audio papers with researchbunny