Computer Science

Study on emotion recognition bias in different regional groups

M. Lukac, G. Zhambulova, et al.

Real-time emotion recognition across cultures is improved by a meta-model that fuses image features, action units, micro- and macro-expressions into a Multi-Cues Emotion Model (MCAM), revealing that regional biases persist and that learning some regional expressions may require forgetting others. This research was conducted by Martin Lukac, Gulnaz Zhambulova, Kamila Abdiyeva, and Michael Lewis.... show more

Introduction

Non-verbal communication (including facial expressions) constitutes a major portion of human interaction. Automated facial expression recognition (FER) typically uses still images from video. Emotional understanding is strongly affected by cultural and regional context, and models trained on region-specific datasets (e.g., FER2013, largely Caucasian/African-American, vs JAFFE, Japanese females with posed emotions) often perform poorly across regions. Beyond posed vs unposed differences, evidence suggests emotional expressions are not universally expressed across cultures, and FER models exploit varied cues (macro/micro-expressions, action units, image features). The research question is whether adaptively combining diverse emotional cues from different region-specific detectors can reduce inter-cultural bias and improve cross-regional emotion recognition without fully retraining for each region.

Literature Review

Prior FER work includes ensemble CNNs and probability-based fusion across similar emotional cues, as well as multimodal fusion (audio, video), yet often merges similar cues rather than diverse ones. Macro-expression approaches map image-level features to discrete emotion labels using deep CNNs and variants (e.g., inception-residual blocks, SVM losses). Micro-expression recognition relies on spatial-temporal features and 3D CNNs, labeling positive/negative/surprised (and sometimes neutral). Action unit (FACS) methods detect facial muscle movements with appearance/geometry features and deep learning. Image features extracted via pre-trained CNNs (e.g., VGG) can be robust but may carry dataset biases when fed into region-specific classifiers. Dataset bias across FER corpora (e.g., FER2013, JAFFE, CK+) is recognized in the literature; ensemble and multimodal methods improve performance but may not adequately address cross-regional cultural differences in facial expressions.

Methodology

The authors propose MCAM (multi-cues attention model) that fuses four emotional cues: (1) context-independent image features (F) using a pre-trained VGG11 (ImageNet) yielding a 2752-dimensional feature vector, (2) action units (U) using OpenFace 2.0 with 11 AU intensities mapped via SVM to emotions, (3) micro-expressions (I) using a 3D CNN trained on SMIC and a fine-tuned variant (I+) trained on SMIC and CK+, producing labels {positive, negative, surprised} and {positive, negative, surprised, neutral}, aggregated from macro categories, and (4) macro-expressions (A) using a VGG-based CNN trained on FER2013 to output {angry, scared, happy, sad, surprised, neutral}. These detectors produce cue outputs M_f, M_u, M_i, M_i+, and M_a. An attention-based multi-layer perceptron (three hidden layers, 10 neurons each, ReLU, learning rate 0.001) serves as a fusion classifier to predict the final macro-emotion label. The MCAM notation A indicates the full set S={M_a, M_i, M_i+, M_u, M_f}; subsets (e.g., A_{−i+,−u}) indicate excluded cues. Training follows two regimes: Global Adaptation Mode (GAM) where detectors are trained/fine-tuned on the target dataset from scratch, and Local Adaptation Mode (LAM) where pre-trained detectors (macro from FER2013; micro from SMIC; micro+ fine-tuned on CK+; AU and image features used off-the-shelf) are kept and only the attention network is trained on the target dataset. Experiments include: (i) LAM-trained attention over all cues, (ii) single-cue attention models, and (iii) GAM fine-tuning of individual detectors followed by attention training. Datasets used: FER2013 and SMIC for training base detectors; evaluation across JAFFE, TFED (Chinese), TFEID (Taiwanese), and CK+ (Caucasian/African-American). Train/test splits are 80/20 on each evaluation dataset.

Key Findings

Baseline SOTA accuracies: JAFFE 96%, TFEID 96%, TFED 79%, CK+ 99%. MCAM (LAM) improved or matched SOTA on multiple datasets: A_f (excluding image features) achieved JAFFE 95%, TFEID 98%, TFED 92.44%, CK+ 99%; A (including image features) achieved JAFFE 95%, TFEID 97.22%, TFED 86.36%, CK+ 98.63. Individual detectors performed substantially worse cross-region: D_a (macro) achieved JAFFE 54%, TFEID 72%, TFED 65%, CK+ 85; D_i (micro) achieved JAFFE 47%, TFEID 44%, TFED 33%, CK+ 59; D_i+ (micro+, CK+ fine-tuned) achieved JAFFE 49%, TFEID 15%, TFED 30%, CK+ 90. Single-cue attention models (LAM) showed striking behavior: A_f (image features only) overfit/perfectly classified JAFFE and CK+ (100%) and achieved 98.19% on TFEID, but reduced TFED accuracy (75.19%). For single cues, LAM A_α (macro) on TFED reached 87.12%, and GAM training of macro/micro detectors further improved results. GAM cross-evaluation (A_f): fine-tuned on TFEID yielded perfect classification on TFEID (100%) and strong cross-region performance (JAFFE 91.30%, TFED 95.65%, CK+ 91.30%); fine-tuned on CK+ achieved CK+ 98.80%, and high accuracies elsewhere. GAM with full A (including image features) produced 100% accuracy on CK+ and TFEID regardless of training dataset; JAFFE reached ~94.44–97.50% depending on source; TFED improved over SOTA but did not reach maxima seen in other tables (~81–83%). Qualitative findings: image-level features can either strongly improve performance (CK+, TFEID, and JAFFE when used alone) or interfere and reduce accuracy (TFED). The fusion attention effectively re-weights cues to extract salient information; however, certain cues learned from one region can confound recognition in another, suggesting a need for selective forgetting or re-learning. Overall, MCAM bridged biases for several datasets and outperformed SOTA in TFED, TFEID, and matched SOTA in CK+, but did not surpass JAFFE SOTA under all configurations.

Discussion

The study addresses whether diverse emotional cues can be adaptively combined to reduce cross-regional FER bias. Findings show that re-weighting pre-trained region-specific detectors via attention improves performance across multiple datasets, confirming that useful information exists within each cue but requires dataset-specific fusion. However, image-level features exhibit dataset-dependent interference: they are beneficial for CK+, TFEID, and JAFFE only when used alone, but detrimental for TFED when combined with other cues. This supports the hypothesis that regional expression differences render some cues mutually interfering, hindering direct transfer. GAM fine-tuning can mitigate biases and, when combined with image features, achieves perfect accuracy on certain datasets (indicative of overfitting and strong separability). The results imply that effective cross-cultural FER may require selective forgetting of previously learned cues and adaptive fusion rather than naive transfer learning, and that attention-based soft selection of cue confidences provides a practical pathway to improved generalization.

Conclusion

The paper demonstrates that cross-regional bias in FER persists across widely used datasets and that a fusion approach (MCAM) combining macro-expressions, micro-expressions, action units, and image features via attention can reduce this bias and outperform SOTA on several datasets without exhaustive retraining. Nonetheless, certain cues are mutually exclusive across regions, and adding image-level features can either help or harm depending on the dataset, implying that perfect unbiased classification is obstructed by intrinsic differences in regional expression patterns. The authors conclude that to learn some regional expressions, others may need to be forgotten and relearned from scratch, and that fusion-scoring across multiple cue categories can partially overcome these obstacles. Future work includes quantifying cue weights across different ethnic datasets and developing a generative model for emotion expression generation.

Limitations

Observed perfect accuracies indicate potential overfitting, especially when using VGG-11 image features. Datasets differ in size, quality, and pose (posed vs unposed) with only six output categories, which may skew results. The study did not mix datasets or deeply analyze each dataset’s composition; thus, bias sources (regional composition, collection methods, resolution, noise, diversity) remain unresolved. Image feature interference varies by dataset, complicating universal fusion strategies; generalizability beyond the evaluated datasets requires further validation.

Related Publications

Explore these studies to deepen your understanding of the subject.

Environmental Studies and Forestry

‘Is climate science taking over the science?’: A corpus-based study of competing stances on bias, dogma and expertise in the blogosphere

L. Pérez-gonzález

Computer Science

On Multimodal Emotion Recognition for Human-Chatbot Interaction in the Wild

N. Kovačević, M. Gross, et al.

Sociology

Nourishing social solidarity in exchanging gifts: a study on social exchange in Shanghai communities during COVID-19 lockdown

Y. Zhou and C. Dong

Food Science and Technology

Chemometric study on the effect of cooking on bioactive compounds in tomato pomace enriched sauces

J. González-coria, C. Mesirca-prevedello, et al.

Listen, Learn & Level Up

Over 10,000 hours of research content in 25+ fields, available in 12+ languages.

No more digging through PDFs, just hit play and absorb the world's latest research in your language, on your time.

listen to research audio papers with researchbunny