Psychology
Implementing machine learning techniques for continuous emotion prediction from uniformly segmented voice recordings
H. Diemerling, L. Stresemann, et al.
Discover a groundbreaking method for predicting emotions from short audio samples! Researchers Hannes Diemerling, Leonie Stresemann, Tina Braun, and Timo von Oertzen have leveraged advanced machine learning techniques to achieve accuracy that rivals human evaluative benchmarks. Dive into the world of real-time emotion detection!
~3 min • Beginner • English
Introduction
The study asks whether technical tools can accurately predict emotions from vocal parameters independent of semantic content, using very short audio segments suitable for continuous assessment. It proposes a classifier operating on fixed 1.5 s segments to emulate real-world conditions where emotional boundaries are unclear and to minimize mixed emotions within a segment. The choice of 1.5 s is motivated by evidence that brief nonverbal vocalizations suffice for emotion recognition and by practical dataset considerations. The goal is to develop and compare machine learning models that can classify emotions from 1.5 s clips across languages (German and English) and to benchmark performance against human listeners. If model accuracy approaches human performance, the tool could support practical applications and offer insights into human emotion recognition.
Literature Review
Prior work commonly used semantically constrained recordings of roughly 1.5–5 s to develop classifiers, without explicit focus on uniform segmentation for continuous prediction. Numerous studies using Emo-DB and RAVDESS reported strong results with CNN-based methods and related architectures (e.g., Chen 2018; Jiang 2019; Mustaqeem & Kwon 2019, 2021; Mustaqeem et al. 2020), with reported accuracies/F1 scores ranging from roughly 77% to 95% depending on database and protocol. However, methods and train/test protocols vary substantially (e.g., 5-fold, 10-fold, LOSO), limiting direct comparability. Continuous segmentation remains challenging; Atmaja and Akagi (2020) demonstrated beyond-chance emotion recognition using 4 s windows on audiovisual data. Stresemann (2021) standardized all recordings to 1.5 s, analyzing them as independent units to focus on emotion independent of semantics, supported by evidence that emotion-specific acoustic patterns can be present even without contiguous linguistic sequences. Foundational literature indicates cross-cultural recognition of basic emotions in voice and distinct emotion-specific acoustic patterns (e.g., Ekman, Scherer, Juslin & Laukka), motivating a focus on a limited set of basic emotion categories.
Methodology
Data: Audio was sourced from two public emotion databases representing different cultures/languages. From RAVDESS (English; 24 actors, 8 categories with calm/surprise excluded for this study), 1,056 audio clips were selected. From Emo-DB (German; 10 actors), 454 recordings were selected. Emotions used: joy, anger, sadness, fear, disgust, and neutral (surprise and calm excluded). Total unique samples used: 1,510.
Segmentation and preprocessing: All audio recordings were standardized to 1.5 s. Longer files were symmetrically trimmed around the core to 1.5 s; shorter files were symmetrically padded with silence to 1.5 s to preserve emotional content. To assess the impact of segment length, variants with 3 s and 5 s segments were also created using the same method for parts of the DNN analyses.
Human comparison dataset: Following Stresemann (2021), 61 participants (36 male, 25 female; ages 20–71) performed forced-choice emotion recognition on 82 English RAVDESS and 88 German Emo-DB 1.5 s clips (emotions: fear, anger, joy, sadness, disgust, neutral). Demographics were collected; responses were recorded after single playback, with an optional “no statement.” Results from Stresemann indicated cross-database correlation, suggesting empathic ability may outweigh language bias.
Feature engineering: For each 1.5 s segment, a broad set of features was computed to minimize information loss, totaling 14,244 entries per recording. Features included: unmodified audio signal (variance), HPSS (variance), spectral flatness, spectral centroid, fundamental frequency (F0), spectral rolloff (0.85), spectral bandwidth, zero crossing rate, RMS, spectral contrast, tonnetz, chroma, pitch tracking (variance and mean per 2,048 Hz window), pitch magnitudes (variance and mean), magnitude (variance and mean), phase (variance and mean), and MFCCs. Additionally, Magphase decomposition was used to separate magnitude and phase from the complex spectrogram. Presence/probability of voiced segments were computed in 2,048 Hz windows. Spectrograms were generated per recording as 320×240 RGB PNG images without axes/borders.
Models: Three designs were implemented in Python using TensorFlow and Scikit-learn: (1) DNN operating on the engineered numerical features; (2) CNN operating on spectrogram images; (3) C-DNN hybrid combining DNN features branch and CNN spectrogram branch with concatenation and a final dense softmax layer. Hyperparameters (e.g., number of layers 2–8, neurons 80–400, activations relu/elu/sigmoid/tanh, optimizers SGD/RMSprop/Adam, learning rates 0.01–1e-7) were tuned via Bayesian optimization with a Gaussian process on the training set. Training used up to 4 epochs per candidate model; validation accuracy was used to avoid overfitting.
Evaluation: For comparability with literature, 10-fold cross-validation was conducted for each model and dataset variant (combined, Emo-DB-only, RAVDESS-only). Balanced Accuracy (BAC) was the primary metric. In addition, an Independent Validation with Bayesian Updating scheme was used: models trained initially on 10% of data with 10% held out for validation, then sequentially exposed to new data in batches of 16; before each integration, predictions were made and BAC’s posterior was updated via a beta distribution (beta prior beta(1,1), binomial likelihood). Overlap among posterior distributions was used to compare models vs random guessing and vs each other. Human performance was modeled similarly using beta distributions for correct recognition proportions per emotion. SHAP-based saliency analyses over 48 time segments of spectrograms were used to interpret temporal importance for classification.
Key Findings
- Cross-validation mean Balanced Accuracy (Table 3):
• Combined dataset: DNN 54.49%, CNN 41.56%, C-DNN 56.24%.
• Emo-DB: DNN 64.69%, CNN 30.68%, C-DNN 54.85%.
• RAVDESS: DNN 53.55%, CNN 28.39%, C-DNN 48.09%.
- Segment length variants (DNN): Combined dataset 3 s 62.36%, 5 s 61.79%; Emo-DB 3 s 72.91%, 5 s 69.21%; RAVDESS 3 s 60.01%, 5 s 61.00%.
- Bayesian posterior analyses showed >99% probability that trained classifiers outperform random guessing (random baseline for 6 classes = 16.67%). For the combined dataset, MAP posterior estimates were ~0.436 (DNN) and ~0.433 (C-DNN) with SE ~0.013; CNN ~0.27 with SE ~0.012. Emo-DB MAP ~0.58 (DNN) and ~0.48 (C-DNN) with SE ~0.024; CNN ~0.29 (SE ~0.022). RAVDESS MAP ~0.42 (DNN and C-DNN) with SE ~0.016; CNN ~0.26 (SE ~0.014).
- Human comparison: DNN and C-DNN achieved performance comparable to humans across emotions (fear, joy, anger, disgust, sadness, neutral), while CNN underperformed and showed instability.
- Saliency/SHAP analyses revealed that certain temporal intervals within the 1.5 s segments carry disproportionate importance for emotion prediction, suggesting time-localized cues are informative.
- Overall, models trained on 1.5 s uniformly segmented clips achieved well-above-chance performance, approaching human-level accuracy in several settings, with hybrid C-DNN slightly outperforming DNN on the combined dataset.
Discussion
The findings demonstrate that emotions can be recognized from uniformly segmented 1.5 s speech clips independent of semantic content, addressing the core research question. Both DNN and C-DNN achieved consistent, above-chance BAC and human-comparable performance, indicating that engineered acoustic features capture key emotion-related patterns and that combining with spectrograms yields modest gains in some cases. The CNN alone underperformed, likely due to overfitting and reduced information density in short, truncated spectrograms.
Across datasets, Emo-DB yielded the highest BAC for DNN and typically for C-DNN (apart from one outlier), though its smaller, less diverse content cautions against assuming better generalization. Combining RAVDESS and Emo-DB maintained consistent model performance (except CNN), indicating cross-lingual robustness despite potential dialectal influences (e.g., Canadian Raising affecting vowel formants in RAVDESS) that may subtly impact acoustic patterns. Time-segment saliency patterns suggest certain sub-intervals within 1.5 s carry strong emotional cues, supporting the feasibility of continuous assessment by sliding windows.
Methodologically, the hybrid model’s limited improvement suggests spectrogram information may be redundant with engineered features or that 1.5 s cropping diminishes spectrogram utility, contributing to CNN overfitting. Future approaches incorporating temporal modeling (e.g., overlapping windows, pooling across time, HMMs, sequence models) may aggregate weakly informative segments to boost accuracy. The results underscore the promise of 1.5 s segmentation for real-time emotion tracking, while highlighting the need for better temporal dynamics capture and broader, more naturalistic datasets.
Conclusion
This work proposes and evaluates an approach for continuous emotion prediction from speech using fixed 1.5 s segments. By extracting a rich set of acoustic features and training DNN, CNN, and hybrid C-DNN models on Emo-DB, RAVDESS, and their combination, we obtained balanced accuracies that reliably exceed chance and, for DNN and C-DNN, approximate human performance across emotions. The method supports continuous, language-agnostic emotion assessment without heavy preprocessing, demonstrating feasibility for real-time applications.
Future research should mitigate overfitting (especially in CNNs), improve temporal modeling (e.g., overlapping windows, time-pooling, sequence models), and expand datasets to include spontaneous, culturally diverse, and varied-intensity emotions. Exploring interpretable and multimodal architectures (e.g., SincNet, 1D/2D CNNs, fusion with facial cues) and optimizing segment lengths (possibly as a learned parameter) may further improve accuracy. Developing a deployable software tool with user-friendly interfaces could facilitate adoption in practical settings.
Limitations
- Use of actor-performed emotions may limit ecological validity and fail to capture subtle, authentic expressions encountered in real life.
- Heterogeneous datasets with varying intensity and acoustic qualities complicate reliable labeling and may constrain generalizability.
- Linguistic and dialectal differences (e.g., Canadian Raising) can influence acoustic patterns, potentially confounding emotion-specific cues.
- Fixed 1.5 s segmentation may disrupt temporal dynamics and contribute to CNN overfitting; the chosen length is not necessarily optimal.
- Spectrogram-based CNNs showed instability and overfitting, likely due to limited informative content in short, clipped segments.
- Using test/validation schemes during hyperparameter optimization and multiple evaluation protocols may introduce complexities; careful separation of validation and test data is necessary in future work.
Related Publications
Explore these studies to deepen your understanding of the subject.

