Introduction
Mood disorders (MDs), including major depressive disorder (MDD) and bipolar disorder (BD), represent a significant global health burden. The limited availability of specialized care creates a bottleneck, hindering timely interventions. MDs manifest with changes in mood, sleep, and motor activity, which can be captured through ecological physiological recordings using wearable technology. Near-continuous, passive data collection and machine learning (ML) analysis offer a potential solution for MD monitoring outside clinical settings. Existing work typically focuses on predicting a single label (disease state or total psychometric scale score), oversimplifying the clinical picture. This study addresses this limitation by proposing a novel task: predicting all individual items in the Hamilton Depression Rating Scale (HDRS) and Young Mania Rating Scale (YMRS), the two most widely used standardized scales for assessing MD symptoms. This approach aligns better with clinical practice, where treatment decisions consider specific symptom profiles, rather than a single summary score. The study highlights the importance of this approach by showing that patients with the same overall severity score can exhibit different symptom profiles, requiring tailored treatments. Previous studies utilizing personal sensing for MDs have focused on simpler tasks like binary classification (mania vs. euthymia) or predicting total HDRS/YMRS scores. This paper expands on these previous efforts by tackling a more complex and clinically relevant task with a significantly larger dataset of patients in acute phases of illness.
Literature Review
Several previous studies have explored the use of wearable sensors and machine learning for mood disorder detection and monitoring. Côté-Allard et al. (2022) focused on binary classification, distinguishing between manic and euthymic states in bipolar disorder patients using Empatica E4 data. Pedrelli et al. (2020) and Ghandeharioun et al. (2017) used wearable and smartphone data to predict HDRS total scores using traditional machine learning models. Tazawa et al. (2020) and Jacobson et al. (2019) used gradient boosting to predict depression detection and case-control status in MDD. Nguyen et al. (2022) applied artificial neural networks (ANNs) to actigraphy data for case-control detection in MDD and schizophrenia. Lee et al. (2023) performed mood episode prediction using random forest and data from wearables and smartphones. However, all these studies simplified the clinical picture by predicting a single label, ignoring the rich information contained in individual symptom scores. This study differs significantly by focusing on the prediction of individual symptom scores, providing a more nuanced and clinically useful approach.
Methodology
This study used data from the TIMEBASE/INTREPIBD cohort, a prospective, observational study of patients with MDD or BD. Participants wore an Empatica E4 wristband for approximately 48 hours during each assessment. The E4 collected data on 3D acceleration (ACC), blood volume pressure (BVP), electrodermal activity (EDA), heart rate (HR), and skin temperature (TEMP). Data were pre-processed to remove physiologically implausible values and segmented using a sliding window. The study used a deep learning pipeline consisting of a classifier (CF) and a critic (CR). The CF comprised a channel encoder, a representation module, and 28 item predictors (one for each HDRS and YMRS item). The CR penalized the CF for learning subject-specific features, encouraging subject-invariant representations. The model was trained using various strategies to address multi-task learning, ordinal data, class imbalance, and subject-invariant feature extraction. Different imbalance learning approaches were tested: focal loss, probability thresholding, and resampling with loss re-weighting. Hyperparameter tuning was performed using Hyperband Bayesian optimization. A random forest model with hand-crafted features was used as a baseline. The model was evaluated using quadratic Cohen's κ (QCK) and macro-averaged F1 score. The study also investigated the model's performance on out-of-distribution samples (far-from-interview samples) to assess its generalizability over time and the contribution of each sensor modality.
Key Findings
The best-performing ANN model achieved an average QCK of 0.609 on close-to-interview samples, significantly outperforming the random forest baseline (QCK = 0.214). This indicates moderate agreement between the model's predictions and clinician assessments. Individual item performance varied, with QCK ranging from 0.492 to 0.775. Item-level QCK showed a fair correlation with item Shannon entropy (R = 0.42), suggesting that items with higher uncertainty are more difficult to predict. The model's performance degraded significantly when tested on far-from-interview samples, highlighting the challenge of generalizing to out-of-distribution data over time. Residual analysis showed that the model errors were consistent with the correlation structure of the HDRS and YMRS scales. The ablation study showed that acceleration (ACC) was the most important sensor modality, particularly for items related to anxiety and motor behavior. The model's prediction of total HDRS and YMRS scores resulted in RMSE values of 4.592 and 5.854, respectively.
Discussion
This study demonstrates the feasibility of using wearable sensor data and deep learning to predict individual symptom scores for mood disorders. This approach provides a more nuanced and clinically relevant assessment than predicting single summary scores. The high performance achieved on close-to-interview samples suggests the potential for this approach to augment clinical decision-making. However, the significant performance drop on far-from-interview samples indicates a need for further research on improving the model's robustness to changes in context and distribution shift. The correlation between item-level performance and Shannon entropy suggests that future work should focus on addressing the uncertainty inherent in certain items. The importance of acceleration data highlights the role of motor activity in distinguishing between different mood states. The findings of this study support the development of personalized, continuous MD monitoring systems using wearable technology.
Conclusion
This research introduced a novel task in personal sensing for MD monitoring: inferring individual HDRS and YMRS item scores. A deep learning pipeline achieved clinically significant agreement with expert clinicians. The study highlighted the importance of addressing challenges like distribution shift and class imbalance. Future work should focus on improving model generalizability over time, exploring personalized models, investigating self-supervised learning, and improving model explainability and uncertainty quantification.
Limitations
The study has several limitations. First, all patients were scored by the same clinician, limiting the assessment of model performance against inter-rater reliability. Second, the lack of repeated HDRS/YMRS scores within sessions prevented an accurate estimation of target variable shifts. Third, the effects of medication could not be fully accounted for. Fourth, although out-of-distribution samples were evaluated, further generalization testing (inter-individual, intra-individual, cross-clinic) is needed.
Related Publications
Explore these studies to deepen your understanding of the subject.