Education

Driving STEM learning effectiveness: dropout prediction and intervention in MOOCs based on one novel behavioral data analysis approach

X. Xia and W. Qi

This study by Xiaona Xia and Wanxue Qi tackles the pressing issue of high dropout rates in online STEM education through an innovative dropout prediction model. By analyzing MOOC learning behavior data, the model effectively predicts dropouts and reveals useful intervention strategies to enhance STEM learning success.

00:00

~3 min • Beginner • English

Index

Introduction

The paper addresses high dropout rates in STEM MOOCs that hinder learning effectiveness despite expanded access. It motivates the need to model learners’ behavior over complete learning periods, capturing both explicit (e.g., demographics, prior credits, final results) and implicit (e.g., participation intensity, interaction patterns) features. The research aims to predict dropout by mining massive behavioral logs, identify key temporal windows (especially early stages), and uncover influential factors and topological paths among interactive learning activities. The goal is to enable timely intervention and early warning to improve STEM learning outcomes in MOOCs.

Literature Review

The related work outlines MOOCs’ benefits and constraints for diverse learners, emphasizing that resource- and provider-centric designs can fail to meet individual backgrounds and interests, contributing to negative experiences and dropout. It reviews two dominant prediction approaches: (1) single-course, unsupervised clustering of ongoing cohorts—limited by ignoring inter-course relations; and (2) platform-level modeling transferred across similar courses—limited by MOOC-defined features and insufficient latent relationship modeling. Prior studies highlight the need to incorporate temporal sequences and address negative propagation of dropout tendencies. The paper positions its contribution as using large-scale, multi-course behavioral data with richer explicit/implicit features and temporal modeling to enhance dropout prediction and actionable interventions.

Methodology

Data: The study uses the anonymized Open University UK dataset (https://analyse.kmi.open.ac.uk/open_dataset), focusing on three STEM courses (DDD, EEE, FFF) across four presentations (2013B, 2013J, 2014B, 2014J). It models complete learning behavior instances: learners, courses, interactive learning activities, and clicks across periods. Activities vary by course/period; distributions are summarized (e.g., Content, Resource, Forum, Subpage, URL show high participation in various periods). Learner descriptors include demographics (Gender, Region, IMD band, Age band, Disability), learning accumulation (Number of previous attempts, Studied credits, Highest education), and assessment (Final result, Assessment type). Learning behavior is dichotomized into dropout (Withdrawn) and non-dropout. Problem framing: Four tested problems (P1–P4) assess whether (P1) Demographic Information, (P2) topological paths among interactive learning activities, (P3) Learning Accumulation, and (P4) Assessment results affect dropout trends. Feature definitions: Explicit features are directly observable (e.g., Age band, IMD band, prior credits, final results). Implicit features are derived from participation and interaction frequencies indicating positive/negative engagement patterns. Latent variables (Demographic Information, Learning Accumulation, Assessment, Learning Behavior) are described via multiple independent variables; Learning Behavior includes all activity types per course. Model (STEM_DP): A fusion of convolutional neural networks (CNN) and recurrent neural networks (RNN) with long short-term memory (LSTM) to analyze temporal sequences and feature topologies. - Step 1 (Explicit features): Use mutual information, random forest, and recursive feature elimination to score and select explicit features; cross-entropy loss for classification; losses for explicit (LE) and implicit (LI) features combine as LT = LE + LI. - Step 2 (Implicit features): A CNN mines implicit feature patterns end-to-end from behavior logs. - Step 3 (Topology and temporal modeling): Construct topological structures (feature correlations and learning paths) and fuse with an improved LSTM-based RNN for temporal dynamics; gradients over hidden states guide weight updates to learn feature-route importance. - Step 4 (Temporal sequence analysis): Iteratively track daily sequences to determine optimal predictive windows; define early-window boundaries for effective intervention. Dropout labeling: Final result Withdrawn = 1 (dropout); Distinction/Pass/Fail = 0 (non-dropout). Training protocol: Mini-batch SGD with learning rate 0.001, batch size 256, 20,000 iterations; 80%/20% train/test split. Performance metrics: Precision, Recall, F1, AUC. For temporal analysis, 30 consecutive days are sampled per period, repeated 10 times; metrics averaged per period and course, then across periods. Analytical focus: Identify strongly associated activities and construct topological paths by course and period; evaluate effects of demographics, learning accumulation, and assessment on dropout from day 20 to course end; determine an optimal early temporal boundary for prediction and intervention.

Key Findings

- Predictive performance: Across three courses (DDD, EEE, FFF), average Precision, Recall, F1, and AUC exceed 0.90, indicating high reliability and accuracy. - Temporal window: Daily prediction over 30-day sequences shows Precision > 89% each day with overall increasing trend; Recall, F1, and AUC slowly trend upward. Predictive performance stabilizes around the first 20 days, establishing day 20 as a practical left boundary for early dropout prediction and intervention. - Class imbalance: The dataset exhibits strong imbalance with about 75% dropout proportion in the analyzed scenarios; metrics accommodate this imbalance. - Demographics (P1): Age and IMD band have significant negative correlations with dropout (younger learners and those from lower IMD bands are more likely to drop out). Gender and Disability show no significant effects; regional differences in dropout exist. - Learning Accumulation (P3): Fewer studied credits, lower highest education, and fewer previous attempts are significantly associated with higher dropout. Learners with weaker prior accumulation show group dropout trends, especially near day 20 as content difficulty and dependency increase. - Assessment effects (P4): Current-course assessment grades (Distinction, Pass, Fail) do not directly cause dropout within the same course (completing assessment implies non-dropout). However, prior failures are positively correlated with dropout in subsequent courses; using a different assessment type than previously used is also positively associated with dropout for some courses (notably DDD/EEE), while FFF learners adapt better. After passing an assessment, 92.22% of learners enroll again and achieve a 95.47% pass rate; after failing, 65.43% re-enroll, and among them 78.19% later drop out. - Activity topological paths (P2): Early (first 20 days) participation nodes such as Forum are critical starting points across courses; later, activities like Quiz, Content, Wiki, Resource, Collaborate (course-dependent) sustain engagement. For FFF, DataPlus and later Questionnaire play important roles in promoting materials propagation and engagement. Constructed effective paths differ by course/period but share common enabling nodes; early routing is pivotal to prevent dropout.

Discussion

The findings demonstrate that integrating explicit and implicit features with temporal modeling enables accurate early prediction of STEM MOOC dropout and reveals actionable levers. The STEM_DP model addresses the research question by: (1) identifying a critical early window (≈ first 20 days) when guidance and support are most impactful; (2) exposing how demographics (Age, IMD) and learning accumulation (credits, prior attempts, education level) shape risk; (3) showing how prior assessment outcomes and alignment of assessment types influence persistence in subsequent courses; and (4) uncovering course- and period-specific topological paths of interactive activities that can be leveraged to sustain engagement. These insights inform targeted interventions: early forum engagement, timely quizzes and content scaffolding, personalized routing through key activities, and adaptive support based on learners’ profiles and histories. Overall, the results substantiate the importance of temporal sequence analysis and feature fusion for effective dropout mitigation in STEM MOOCs.

Conclusion

The study proposes STEM_DP, a CNN–RNN (LSTM) fusion model leveraging explicit and implicit features and temporal sequences to predict and explain STEM MOOC dropout, identify key early windows (≈20 days), and construct effective activity topologies. Using large-scale Open University UK data across three STEM courses and multiple periods, the model achieves high predictive performance (Precision/Recall/F1/AUC > 0.90) and uncovers influential factors: demographics (Age, IMD), learning accumulation (credits, prior attempts, education level), prior assessment outcomes and assessment-type consistency, and critical activity routes. The work offers data-driven strategies for early warning and intervention, guiding resource recommendations, behavioral routing, and teacher support to enhance STEM learning effectiveness in MOOCs. Future work will further optimize behavioral factor definitions, enrich learning-path topologies, and improve robustness and accuracy of STEM_DP for broader, more reliable deployment.

Limitations

Related Publications

Explore these studies to deepen your understanding of the subject.

Medicine and Health

Population Pharmacokinetic and Exposure–Response Analysis of Finerenone: Insights Based on Phase IIb Data and Simulations to Support Dose Selection for Pivotal Trials in Type 2 Diabetes with Chronic Kidney Disease

N. Snelder, R. Heinig, et al.

Medicine and Health

Effectiveness of app-based cognitive behavioral therapy for insomnia on preventing major depressive disorder in youth with insomnia and subclinical depression: A randomized clinical trial

S. Chen, J. Que, et al.

Medicine and Health

A multimodal deep learning approach for the prediction of cognitive decline and its effectiveness in clinical trials for Alzheimer’s disease

C. Wang, H. Tachimori, et al.

Social Work

How the crisis of trust in experts occurs on social media in China? Multiple-case analysis based on data mining

Y. Wen, X. Zhao, et al.

Listen, Learn & Level Up

Over 10,000 hours of research content in 25+ fields, available in 12+ languages.

No more digging through PDFs, just hit play and absorb the world's latest research in your language, on your time.

listen to research audio papers with researchbunny