logo
ResearchBunny Logo
A hybrid deep learning model with feature engineering technique to enhance teacher emotional support on students' engagement for sustainable education

Education

A hybrid deep learning model with feature engineering technique to enhance teacher emotional support on students' engagement for sustainable education

R. G. Al-anazi, N. M. Alhammad, et al.

Using AI and deep learning, this study introduces HDLMFE-ETESSE — a hybrid model that combines an AdaptSepCX attention network and a C-BiG classifier to detect student emotions from facial expressions and enhance engagement for sustainable education. The approach achieves 98.58% accuracy on a student-engagement dataset and was conducted by the authors listed in <Authors>.

00:00
00:00
~3 min • Beginner • English
Introduction
The study addresses the challenge of reliably recognizing students’ emotions—such as curiosity, pleasure, worry, frustration, confusion, and boredom—during classroom and remote learning activities. Emotions are strongly tied to academic performance and personal development, yet contactless, illumination-independent, and generalizable emotion detection remains difficult. Advances in AI and DL, especially CNNs, enable automated facial emotion recognition by learning spatial visual patterns. The research aims to develop an effective student emotion recognition model to enhance teacher emotional support, improve student engagement, and contribute to sustainable education, including digital and remote contexts where traditional cues are limited.
Literature Review
The paper surveys recent work on multimodal and DL-based emotion recognition and engagement detection: (1) Hierarchical Cross-modal Spatial Fusion Network (HCSFNet) fusing EEG and video features with attention and spatial pyramid pooling, reporting accuracies up to 97.78% (DEAP) and 60.59% (MAHNOB-HCI). (2) Studies on humble teacher leadership and student creative engagement via survey-based models. (3) Temporal multimodal fusion for engagement detection (MSC-Trans) integrating CNNs and encoder-decoder architectures. (4) Knowledge distillation approaches (TelME, HCIFN-SD) to strengthen weaker modalities (e.g., text vs. non-verbal) in conversation emotion recognition with performance in IEMOCAP and MELD. (5) Ensemble CNNs for FER (HoE-CNN), EfficientNet-based and other CNN architectures for improved classification. (6) Classroom behavior analysis in arts education using emotion recognition (ERAM). (7) Cross-modal distillation from GSR/EEG to unimodal models (EmotionKD). (8) DL/ML models on BCI data for concentration detection (DNN, 1D CNN, BiLSTM). (9) Activity-based learning and ChatGPT for personalized feedback. (10) Hybrid predictive analytics (CNN, RF, XGBoost) on large academic records. (11) Integration of VR with educational theories to improve engagement and retention. (12) DL-based multi-attribute evaluation for holistic assessment and automated learning style identification using topic modeling, FSLSM, and DL. (13) Cross-modal adversarial learning for classroom engagement with Transformer-CNN-LSTM. (14) Satisfaction and emotion prediction in online learning (ANN, RF, BFS). (15) Hybrid DL for cognitive engagement classification (CNN-BiLSTM, BERT-BiLSTM), and EfficientNetV2-L with RNNs for student engagement detection on DAISEE. The review highlights limitations such as generalizability across settings, reliance on specific datasets/modalities (EEG/GSR), incomplete handling of temporal dynamics and cross-modal interactions, and insufficient exploration of lightweight models for real-time deployment.
Methodology
The proposed HDLMFE-ETESSE framework consists of three stages: (1) Image pre-processing: Facial alignment and normalization improve image quality and consistency. Keypoint detection and a regression model estimate landmark coordinates; images are aligned via an affine transformation T (p' = T·p). Pixel intensities are normalized to [−1, 1] using I_norm = 2((I − I_min)/(I_max − I_min)) − 1 for training stability. Illumination standardization adjusts pixels by (I − μ)/σ to reduce lighting bias. Images are resized uniformly (e.g., 1024×1024) with bicubic interpolation. (2) Feature extraction: AdaptSepCX attention network isolates salient features with adaptive separable convolutions to balance accuracy and efficiency, using GELU activations. The network begins with two Conv2D layers (e.g., 32 and 64 filters, 3×3 kernels, stride 2). It then employs four Progressive Feature Refinement (PFR) blocks, each comprising two SeparableConv2D layers (3×3), dropout, batch normalization, GELU, residual/skip connections, and optional 1×1 SeparableConv for residual tensor channel matching. A Global Average Pooling (GAP) layer reduces spatial dimensions to 1×1. A scalar dot-product attention mechanism computes attention scores across channels, applies softmax to obtain attention weights, and reweights GAP features element-wise to emphasize important channels. The output stage uses dense layers (e.g., 64 units with GELU, then softmax over classes). (3) Hybrid classification (C-BiG): A CNN extracts spatial features, then a Bidirectional GRU (Bi-GRU) models temporal dependencies by processing feature sequences forward and backward, mitigating vanishing gradients and capturing long-term dynamics more efficiently than LSTM. The final concatenated hidden states are fed to fully connected layers with softmax for classification. Algorithmic equations for Conv2D and GRU gates (update z_t, reset r_t, candidate h_t*) define the temporal modeling, and the overall pipeline integrates spatial-temporal learning for robust emotion recognition.
Key Findings
Dataset: Student-engagement dataset (Kaggle). Two main classes: Engaged (1076 images; subclasses: confused 369, engaged 347, frustrated 360) and Not engaged (1044 images; subclasses: looking away 423, bored 358, drowsy 263). 80:20 split results (Table 2): • Training phase (80% TRPHE) average: Accuracy 98.39%, Precision 95.19%, Recall 95.09%, F1-score 95.12%, MCC 94.16%, Kappa 94.23%. • Testing phase (20% TSPHE) average: Accuracy 98.58%, Precision 95.66%, Recall 95.70%, F1-score 95.64%, MCC 94.82%, Kappa 94.88%. Class-wise examples (20% TSPHE): Engaged Acc 98.82%, Prec 95.65%, Rec 97.06%, F1 96.35; Looking Away Acc 98.82%, Prec 97.78%, Rec 96.70%, F1 97.24; Bored Acc 98.58%, Prec 98.21%, Rec 91.67%, F1 94.83. 70:30 split results (Table 3): • Training (70% TRPHE) average: Accuracy 97.30%, Precision 91.96%, Recall 91.50%, F1-score 91.67%, MCC 90.09%, Kappa 90.16%. • Testing (30% TSPHE) average: Accuracy 97.33%, Precision 91.56%, Recall 91.23%, F1-score 91.33%, MCC 89.78%, Kappa 89.84%. Learning curves: Training/validation accuracy quickly increase and align closely; validation slightly exceeds training at times, indicating good generalization. Loss curves steadily decrease without divergence, suggesting stable training and minimal overfitting. Comparative performance (Table 4): HDLMFE-ETESSE achieves highest metrics: Accuracy 98.58%, Precision 95.66%, Recall 95.70%, F1-score 95.64%, outperforming Ensemble-DL (Acc 75.80), CNN-DALBPFE (Acc 94.82), TCN (Acc 91.20), Bag-of-State (Acc 66.58), EfficientNetB0 (Acc 95.30), InceptionV3 (Acc 94.19), and VGG-19 (Acc 86.15).
Discussion
The HDLMFE-ETESSE framework addresses the research goal of accurate, robust student emotion recognition to support teacher interventions and enhance engagement. Pre-processing (alignment, normalization, illumination correction) reduces dataset variability and lighting bias, enabling more consistent feature learning. AdaptSepCX attention selectively emphasizes salient facial cues while suppressing noise, balancing speed and accuracy for potential real-time use. The C-BiG hybrid leverages spatial features and temporal dependencies in facial sequences, improving recognition reliability over static or single-modality approaches. Empirical results show high accuracy (up to 98.58%) and well-aligned training/validation curves across splits, indicating good generalization. Comparative analyses confirm superiority over several established CNN and temporal models, suggesting the approach’s relevance for sustainable education settings, including digital classrooms where teacher-student interactions are constrained.
Conclusion
The study introduces HDLMFE-ETESSE, combining rigorous image pre-processing, AdaptSepCX attention-based feature extraction, and a CNN–BiGRU (C-BiG) hybrid classifier for student emotion recognition. On the student-engagement dataset, the model achieved state-of-the-art performance, with test accuracy up to 98.58% and balanced precision/recall/F1, outperforming multiple baselines. This framework can support teachers by providing accurate, contactless insights into students’ emotional engagement, contributing to sustainable education practices. Future work includes integrating multi-source data (contextual variables, physiological signals), exploring lightweight architectures for real-time deployment, addressing fairness and class imbalance, and validating across diverse cultural and educational settings and platforms.
Limitations
The study relies on a single dataset, which may limit generalizability across cultural and institutional contexts. It excludes contextual variables and physiological signals (e.g., EEG/GSR), potentially constraining engagement inference. Sample size and class imbalance may affect learning fairness and robustness. There is no real-time validation or deployment assessment, and user-specific factors and subjective annotation of engagement levels may reduce reliability.
Listen, Learn & Level Up
Over 10,000 hours of research content in 25+ fields, available in 12+ languages.
No more digging through PDFs, just hit play and absorb the world's latest research in your language, on your time.
listen to research audio papers with researchbunny