logo
ResearchBunny Logo
Introduction
Mood disorders (MDs), such as Major Depressive Disorder (MDD) and Bipolar Disorder (BD), represent a leading cause of disability worldwide, imposing substantial economic burdens. The limited availability of specialized care creates a significant bottleneck, hindering timely interventions. MDs manifest through changes in mood, sleep, and motor activity, all potentially captured by wearable sensor technology. Near-continuous physiological data collection, analyzed using ML, could revolutionize MDs monitoring by providing objective data outside traditional clinical settings. Existing studies often focus on predicting a single label (disease state or total psychometric score), an oversimplification of the complex clinical reality. This study addresses this limitation by aiming to predict individual items of the Hamilton Depression Rating Scale (HDRS) and Young Mania Rating Scale (YMRS), two widely used standardized scales for assessing MD symptoms. Predicting individual items provides a more nuanced understanding of the patient's symptom profile, enabling tailored treatment strategies compared to relying solely on overall scores. The study's large dataset (75 subjects, 149 recording sessions) and focus on acute episodes, along with the use of expert clinician assessments and research-grade wearables, makes it a significant contribution to the field.
Literature Review
Previous research in personal sensing for MDs has primarily focused on predicting a single label, such as disease state or total psychometric scale scores. Studies have used various sensor data and ML models but have not attempted to predict individual HDRS and YMRS items. For example, Côté-Allard et al. (2022) focused on binary classification of manic and euthymic states in BD patients, while Pedrelli et al. (2020) and Tawaza et al. (2020) predicted HDRS total scores using traditional ML models and handcrafted features. These previous efforts lack the granularity and clinical relevance of predicting individual symptom items, which is the focus of this study.
Methodology
This prospective, observational study used data from the TIMEBASE/INTREPIBD cohort of patients with MDD or BD. Participants wore an Empatica E4 wristband for approximately 48 hours during assessment periods. Clinicians collected demographic data and HDRS/YMRS scores at each assessment. The E4 recorded various physiological signals (3D acceleration, blood volume pressure, electrodermal activity, heart rate, skin temperature). Data pre-processing involved quality control to remove implausible values and segmentation into non-overlapping windows. The primary analysis used data from the first five hours of each recording (close-to-interview samples), with separate training, validation, and test sets. A deep learning pipeline was developed comprising a classifier (CF) and a critic (CR). The CF predicted HDRS/YMRS item scores using a multi-task learning approach, addressing the ordinal nature of the target variables. The CR encouraged subject-invariant representations to improve generalization. Several strategies were employed to handle class imbalance (focal loss, probability thresholding, resampling and loss re-weighting). Hyperparameter tuning was performed using Hyperband Bayesian optimization. A baseline model using a random forest classifier with handcrafted features was also developed for comparison. The model's performance was evaluated using quadratic Cohen's κ (QCK) and macro-average F1 score, considering the ordinal and imbalanced nature of the data. To assess generalizability, the model was also tested on data from 30-minute intervals after the first five hours (far-from-interview samples). Post-hoc analyses examined residuals, correlation structures between item residuals, and individual subject performance. An ablation study investigated the contribution of each sensory modality.
Key Findings
The best-performing ANN model achieved an average QCK of 0.609 on close-to-interview samples, significantly outperforming the random forest baseline (QCK = 0.214). Individual item performance varied, with QCK ranging from 0.492 to 0.775. Item-level QCK showed a fair correlation with item Shannon entropy (H), suggesting that items with more uncertainty are harder to predict. HDRS/YMRS total scores were predicted with RMSEs of 4.592 and 5.854, respectively. When tested on far-from-interview samples, performance dropped significantly, with average QCK declining to 0.182 in the third 30-minute interval. Items mapping more volatile symptoms showed a greater decline in performance compared to those mapping stable symptoms. Residual analysis revealed a mostly symmetric distribution around zero, indicating no systematic over- or under-prediction. The network of item residuals reflected the correlation structure of the HDRS and YMRS scales. The ablation study demonstrated that acceleration data from the wristband was the most important modality for predicting symptoms, particularly those related to motor activity and anxiety.
Discussion
This study successfully demonstrates the feasibility of predicting individual HDRS/YMRS items from wearable sensor data using deep learning. This approach offers a more nuanced and clinically actionable assessment of MDs compared to simply predicting overall scores. The high agreement with expert clinician assessments validates the model's accuracy. The significant drop in performance on far-from-interview samples highlights the challenge of generalizing to out-of-distribution data, underscoring the need for future research focusing on temporal dynamics and domain adaptation. The identification of acceleration as the most important physiological modality provides valuable insights into the underlying physiological mechanisms of MD symptoms.
Conclusion
This work introduces a novel approach to MDs monitoring, moving beyond single-label predictions to capture the complexity of symptom profiles. The deep learning pipeline demonstrates substantial agreement with clinician assessments, offering a potential tool for real-world clinical applications. Future work should address limitations related to temporal generalization, inter-individual variability, and data efficiency by exploring techniques such as domain adaptation and self-supervised learning. Further research into model explainability and uncertainty quantification is also crucial.
Limitations
The study has several limitations. All patients were assessed by a single clinician, limiting the assessment of inter-rater reliability. The lack of multiple HDRS/YMRS scores within the same session prevents a complete evaluation of within-session symptom variability. The influence of medication and other confounding factors cannot be fully ruled out. Generalizability across different clinics and populations needs further investigation.
Listen, Learn & Level Up
Over 10,000 hours of research content in 25+ fields, available in 12+ languages.
No more digging through PDFs, just hit play and absorb the world's latest research in your language, on your time.
listen to research audio papers with researchbunny