logo
ResearchBunny Logo
A deep learning framework for gender sensitive speech emotion recognition based on MFCC feature selection and SHAP analysis

Computer Science

A deep learning framework for gender sensitive speech emotion recognition based on MFCC feature selection and SHAP analysis

Q. Hu, Y. Peng, et al.

A powerful new deep-learning approach dramatically boosts speech emotion recognition, improving accuracy by up to 15% over prior methods and enabling real-time analysis for applications like live TV audience monitoring. Research conducted by the authors listed in the <Authors> tag: Qingqing Hu, Yiran Peng, and Zhong Zheng, showcases CNN and LSTM-driven models that decode emotions such as happiness, sadness, anger, fear, surprise, and neutrality.

00:00
00:00
~3 min • Beginner • English
Introduction
The study addresses the challenge of accurately recognizing emotions from speech, a critical capability for natural human-computer interaction and affective computing. Prior NLP and speech modeling approaches have ranged from statistical language models to neural networks, yet emotion recognition often treats gender as a confounding factor. The research proposes a gender-sensitive framework that explicitly models physiological and behavioral differences in emotional speech production. The purpose is to improve robustness and accuracy in classifying emotions across arousal levels and among emotions with similar arousal profiles, using multimodal features centered on speech (prosodic and spectral) and deep learning. The importance lies in enabling real-time, reliable emotion-aware systems for applications such as media analysis, customer feedback, accessibility tools, and human-machine interaction.
Literature Review
Classical approaches to language and speech modeling utilized rule-based and statistical methods, with statistical language models decomposing language into conditional probabilities. Neural architectures, especially RNNs, have successfully inferred grammatical structures and semantics from large datasets. Recent hybrid models, such as dual-stream CNN-Transformer fusion networks and Contextualized Convolutional Transformer-GRU (CCTG-NET), have shown promise for speech emotion recognition by capturing local and global patterns. However, most prior works do not explicitly incorporate gender differences, often treating gender as a nuisance variable. The study situates itself within this gap, building on MFCC-based feature extraction, filter and wrapper feature selection strategies, and deep architectures (CNNs, LSTMs) while introducing gender-aware normalization, selection thresholds, and kernel optimization, as well as interpretability via SHAP.
Methodology
Dataset and evaluation: Berlin Emo-DB was used with a 10-fold cross-validation protocol. The database was partitioned into ten non-overlapping folds; in each fold, one subset served as test data and nine as training. Performance was averaged over 10 evaluations. Preprocessing: Speech signals were pre-emphasized using a high-pass filter H(z) = 1 − θz⁻¹ with θ ≈ 0.95 to attenuate low frequencies. MFCCs were extracted following FFT-based spectral computation and Mel-scale warping f_mel = 2595·log10(1 + f/700). Mel filter bank widths were adjusted with wider filters at high frequencies to reflect reduced human sensitivity. Feature selection: A hybrid strategy combined a filter criterion (Fisher score) and a wrapper-based Bat Algorithm optimization. The Fisher score Fu was computed as Fu = [Σc=1..C nc (μc,u − μu)²] / [Σc=1..C nc σ²c,u], favoring features with high between-class variance and low within-class variance. Gender-dependent thresholds were applied (male Fisher threshold 1.8, female 2.3). Wrapper optimization via the Bat Algorithm refined the subset for classification performance, reducing training time versus pure wrapper while improving accuracy over filter-only. Gender-sensitive pipeline: Data were processed separately for male and female speakers. Gender-dependent normalization was applied to fundamental frequency and spectral features. Independent SVM kernel optimization was performed (male: RBF γ = 0.5; female: γ = 0.3) for stages using SVMs. The classification proceeded in stages: initial grouping by arousal-related categories (anger+happiness, normal+fatigue, disgust, fear, discomfort), then secondary classification to disambiguate emotions with similar arousal. Deep classification architecture: A CNN-LSTM hybrid classified seven emotions from MFCC sequences. Architecture: Input MFCCs (e.g., 40 coefficients × N frames); CNN1: 32 filters, 3×3, ReLU; MaxPool (2×2); CNN2: 64 filters, 3×3, ReLU; MaxPool (2×2); Flatten; LSTM (128 units, return_sequences=False); Dropout=0.3; Dense (64, ReLU); Output Softmax with 7 units. Training: Adam optimizer, learning rate 0.001, batch size 32, 100 epochs, early stopping (patience 10), validation split 0.1 within each fold, 10-fold CV. Hyperparameters were selected via grid search over learning rates (0.0001–0.01), batch sizes (16, 32, 64), epochs (50–150). Reproducibility: requirements, preprocessing scripts, and CV setup were documented in a GitHub repository. Probabilistic modeling (for comparative stages): A GMM model p(x) = Σi=1..K ci·N(x|μi, Σi) with Gaussian densities N(x|μi, Σi) = 1/[(2π)^{d/2}|Σi|^{1/2}]·exp(−½(x−μi)^TΣi^{-1}(x−μi)) was used for baseline comparisons. Model order (number of mixtures) affected accuracy and runtime. Interpretability: SHAP analysis was applied to MFCC features to quantify contribution to emotion predictions. Findings highlighted MFCCs 3, 5, 7, and 11 as most important for high-arousal emotions (anger, happiness), with MFCC-5 contributing 28% of variance in SHAP scores for anger. Low-frequency MFCCs (1–3) were more important for subdued emotions (sadness, tiredness). Comparative baselines: Fisher-only, Bat-only wrapper, and other metaheuristics (GA, GSA, TLBO variants, Dempster-Shafer) were evaluated for feature selection. Deep baselines included VGGNet, InceptionV3, ResNet-50, and a Transformer model trained under the same CV protocol.
Key Findings
Arousal-level classification (prosodic/spectral features): Male total accuracy 96.68% (Table 1) and female total accuracy 88.03% (Table 2). Feature counts for optimal detection: ~900 (males), ~800 (females). Emotion pairs with similar arousal (nonlinear dynamic features): Anger vs. happiness: females 99.1% (300 features), males 98.85% (750 features) (Fig. 3). Fatigue vs. normal: females 100% (1000 features), males 98.65% (300 features) (Tables 3–4, Fig. 4). Final 7-emotion classification: Male total accuracy 96.35% (Table 5); female total accuracy 80.31% (Table 6). Class-wise male recognition rates included 100% for Neutral, 98.51% for Anger, 97.83% for Fatigue; lower for Fear (84.38%). Overall proposed deep model (CNN-LSTM) performance on Berlin Emo-DB: Accuracy 89.5%, Precision 90.4%, Recall 87.3%, F1-score 88.7% (Table 11). Compared baselines: VGGNet 85.4% accuracy, InceptionV3 84.7%, ResNet-50 82.9%, Transformer 83.6% (Fig. 10; Table 11). Ablations showed removing LSTM reduced accuracy by ~5.4%, CNN-only dropped ~7.1%. Feature selection comparisons: Fisher+Bat achieved 89.5% accuracy with 12 h training; Fisher-only 84.7% (8 h); Bat-only 88.2% (18 h) (Table 8). Broader algorithm comparison (Table 9) indicated Bat variants offered the best accuracy/time trade-off. Real-time feasibility: On NVIDIA Jetson Nano, average inference time 72 ms/sample (~13.9 FPS), end-to-end latency <120 ms with ONNX-runtime and 8-bit quantization. Statistical validation: Paired t-test vs. best baseline showed significant improvement (t(9)=4.32, p<0.01). One-way ANOVA confirmed significant gender effect (F(1,18)=8.67, p=0.008). 95% CIs: overall 89.5% ± 1.2%; female 96.35% ± 0.9%; male 87.18% ± 1.5%. SHAP insights: MFCCs 3, 5, 7, 11 consistently discriminative for high-arousal emotions; MFCC-5 contributed ~28% of variance in SHAP for anger. Low MFCC indices (1–3) were more salient for sadness and tiredness. Additional baselines: Simple CNN 78.2%, SVM 74.1%, RF 76.3% on same features and protocol, all below proposed 89.5%.
Discussion
The findings demonstrate that explicitly modeling gender differences, combined with a hybrid Fisher+Bat feature selection and a CNN-LSTM architecture, substantially improves speech emotion recognition accuracy and reduces misclassification among emotions sharing similar arousal (e.g., anger vs. happiness). Prosodic and spectral features effectively separate arousal categories, while nonlinear dynamic features resolve within-arousal confusions. SHAP analysis provides interpretable insights into MFCC contributions, enhancing trust in the model’s decisions. Real-time inference results indicate suitability for edge deployments. Compared to state-of-the-art CNNs and Transformers, the proposed framework achieves higher accuracy and better efficiency, validating the hypothesis that gender-aware, hybrid temporal-spectral modeling improves SER performance and generalizability within the evaluated setting.
Conclusion
The study presents a gender-sensitive deep learning framework for speech emotion recognition that integrates MFCC-based feature extraction, hybrid feature selection (Fisher + Bat), and a CNN-LSTM classifier, augmented with SHAP-driven interpretability. Experiments on Berlin Emo-DB show 89.5% overall accuracy and strong performance across arousal-level and fine-grained emotion classifications. The approach offers improved accuracy over several deep baselines, near real-time inference, and interpretable feature importance. Future work will focus on generalization to diverse, multilingual datasets (RAVDESS, SAVEE, CREMA-D), multimodal fusion with facial expressions, fairness analyses across demographics, and efficiency improvements to reduce computational costs associated with wrapper-based selection.
Limitations
Generalizability is limited by the Berlin Emo-DB’s size, language (German), and controlled emotional expressions, which may not reflect real-world variability in accents, cultures, and recording conditions. The wrapper-based feature selection (Bat) introduces higher computational cost compared to end-to-end models. Potential demographic bias due to dataset homogeneity remains a concern. Overfitting risk exists given limited data, mitigated via cross-validation and regularization but warranting broader dataset evaluation and augmentation.
Listen, Learn & Level Up
Over 10,000 hours of research content in 25+ fields, available in 12+ languages.
No more digging through PDFs, just hit play and absorb the world's latest research in your language, on your time.
listen to research audio papers with researchbunny