Psychology
Automated mood disorder symptoms monitoring from multivariate time-series sensory data: getting the full picture beyond a single number
F. Corponi, B. M. Li, et al.
Discover groundbreaking research by Filippo Corponi and his team as they unveil a revolutionary method for monitoring mood disorders using wearable sensor data and machine learning. This innovative approach not only predicts individual items on HDRS and YMRS scales but also outperforms traditional assessment methods. Dive into the future of mental health monitoring!
~3 min • Beginner • English
Introduction
The study addresses the challenge of monitoring mood disorder (MD) symptoms in real-world settings using wearable sensor data and machine learning. Traditional clinician-administered scales like HDRS-17 and YMRS are the standard for assessing depressive and manic symptoms, but access to specialized care is limited and existing ML approaches often reduce assessment to a single label (diagnosis or total score), losing actionable clinical detail. The authors propose predicting each individual item of HDRS and YMRS from multivariate wearable signals, aligning with clinical practice where treatment is tailored to symptom profiles rather than a single summary score. They motivate the need by highlighting the heterogeneity of MD symptomatology and the potential of personal sensing to provide near-continuous, objective physiological monitoring outside clinical settings. The work also targets methodological issues: multi-task learning for multiple items, modeling ordinal outcomes, handling class imbalance, and learning subject-invariant representations, and emphasizes evaluating out-of-distribution generalization over time within recordings.
Literature Review
Prior personal sensing studies for MDs typically collapsed outcomes to binary or single-score predictions. Côté-Allard et al. performed binary classification of bipolar manic vs euthymic states using wristband features; Pedrelli and Ghandeharioun predicted HDRS residualized total score using engineered features and traditional ML; Tazawa and Jacobson used gradient boosting/actigraphy features for case-control detection and severity; Nguyen et al. applied neural networks to minimally processed actigraphy for differentiating schizophrenia and MDD; Lee et al. predicted mood episode recurrence using random forests on wearable/smartphone features. These works demonstrated feasibility but did not predict full symptom profiles. The present study distinguishes itself by predicting all HDRS (17 items) and YMRS (11 items) items, using a larger prospective cohort with clinician-rated acute-phase data and by systematically addressing class imbalance, ordinal targets, multi-task learning, and subject-invariant representation learning.
Methodology
Design and cohort: Prospective, exploratory, observational, single-center, longitudinal study (TIMEBASE/INTREPIBD) embedded in routine practice. Participants had DSM-5 diagnoses of Major Depressive Disorder (MDD) or Bipolar Disorder (BD). Acute-episode participants were assessed up to four times: T0 acute, T1 response onset (50% reduction HDRS/YMRS), T2 remission (HDRS/YMRS ≤7), T3 recovery (HDRS/YMRS ≤7 for ≥8 weeks). Euthymic participants were assessed once. At each assessment start, a clinician collected demographics and HDRS/YMRS and provided an Empatica E4 wristband to wear on the non-dominant wrist for ~48 h. Dataset: 75 subjects, 149 sessions, >7000 h. Medication use common as per clinical practice. Median time since onset 6 years (IQR 14). Sensors: E4 modalities and sampling: ACC 32 Hz (tri-axial), BVP 64 Hz, EDA 4 Hz, HR 1 Hz, IBI 1 Hz (not used due to missingness), temperature 1 Hz.
Pre-processing: Applied automated QC rules (Kleckner et al.) plus HR plausibility (25–250 bpm). Median 8.05% of data per session discarded (range 1.95–32.10). Sessions segmented into non-overlapping windows of length τ (hyperparameter; best τ=16 s). Each segment inherits the 28 item scores from the corresponding clinician assessment. To reduce temporal label drift, primary analyses used only the first 5 hours ("close-to-interview" samples), split 70/15/15 into train/val/test. Distribution shift was analyzed by testing the trained model on successive 30-minute intervals beyond the first 5 hours ("far-from-interview" samples).
Evaluation metrics: Items are ordinal with imbalanced distributions. Primary metric was quadratic-weighted Cohen’s κ (QCK), appropriate for ordinal, imbalanced data and clinically interpretable. Item-level QCK correlations with item Shannon entropy (H) and imbalance ratio ρ were examined. Macro-averaged F1 also reported (supplement).
Model architecture: Supervised deep learning pipeline mapping segments to 28 item scores via multi-task learning. Classifier (CF) comprises (1) channel Encoder (EN) projecting modalities to a common dimensionality, (2) Representation Module (RM) extracting a shared latent representation, and (3) 28 parallel Item Predictors (IP) outputting per-item rank probabilities. A Patient Critic (CR) is trained adversarially on the RM representation to predict subject identity, encouraging subject-invariant features (penalty weight λ; best λ=0.07). Loss incorporated ordinal-aware objectives; best used quadratic κ loss.
Imbalance handling: Explored (i) focal loss variants, (ii) probability thresholding at inference by dividing per-rank probabilities by training rank frequency and renormalizing, and (iii) class re-sampling (random under/over-sampling by HDRS/YMRS severity bins) with loss re-weighting proportional to sampling ratios. The preferred strategy was re-sampling with loss re-weighting.
Hyperparameter tuning: Exhaustive search via Hyperband Bayesian optimization (300 iterations). Feature importances from a random forest regressor over hyperparameters predicted validation QCK; loss type was most predictive, τ had relatively low importance (best τ=16 s; other τ yielded worse QCK). Training details in supplementary methods.
Baselines: Classical ML baseline built by extracting engineered features from the same segments using FLIRT, then training 28 random forests (one per item) with random oversampling. Hyperparameters tuned by random search (300 iterations per classifier).
Post-hoc analyses: Residual distribution analysis on close samples; summation of item predictions to estimate total scores with RMSE; partial correlation network (Gaussian graphical lasso) over item residuals to probe error structure; subject-level macro-F1 distribution and associations with age, sex, total scores (Pearson R, t-tests with Bonferroni).
Channel ablation: Model-agnostic modality importance: retrained best ANN excluding one channel at a time to quantify ΔQCK per item relative to the all-channel baseline.
Key Findings
- Performance (close-to-interview): Best ANN achieved average QCK=0.609 across all 28 HDRS/YMRS items on held-out data, indicating moderate to very good agreement. Baseline random forest achieved QCK=0.214.
- Item-wise results: Highest QCK for H12 somatic symptoms gastrointestinal (0.775) and Y9 disruptive–aggressive behavior (0.755); lowest for H10 anxiety psychic (0.492). Item-level QCK correlated weakly with class imbalance ρ (R=0.08) and moderately with item entropy H (R=0.42).
- Temporal shift: On far-from-interview samples, average QCK declined: first 30-min interval ~0.498–0.499 (HDRS/YMRS), second ~0.301–0.307, third ~0.182. Subsequent intervals fluctuated and did not recover (e.g., as low as 0.061 at ~15 h). Volatile items (e.g., H9 agitation, H10 anxiety somatic, Y4 sleep, Y9 disruptive-aggressive) showed larger drops; stable items (H1 depressed mood, H17 insight, Y11 insight) retained performance better.
- Total score estimation: Summed item predictions yielded RMSE=4.592 (HDRS total) and 5.854 (YMRS total).
- Error characteristics: Residuals centered near zero and typically within ±1 rank. Partial correlation network of residuals showed stronger connections within common symptom domains and between analogous items across scales (e.g., insight), average node predictability ~48.4%.
- Modality importance: Removing ACC caused the largest performance deterioration across many items (e.g., ΔQCK ≈ −0.276 for H1, −0.321 for H11, −0.371 for Y4, −0.281 for Y9), underscoring the value of actigraphy. BVP removal had relatively modest impact.
- Model/optimization insights: Best used quadratic κ loss; adversarial critic with small penalty (λ=0.07) encouraged subject-invariant features; class re-sampling with loss re-weighting preferred for imbalance; best segment length τ=16 s (longer/shorter windows reduced QCK).
Discussion
Predicting the full set of HDRS and YMRS items from wearable sensor time-series captures clinically actionable heterogeneity that single-label predictions miss. The ANN achieved clinically meaningful agreement with expert ratings on close-in-time segments and clearly outperformed a strong classical feature-based baseline, supporting the value of learned representations and multi-task ordinal modeling. Item-level performance related to distributional entropy and aligned with clinical intuition about symptom variability. The performance degradation on segments farther from the assessment highlights domain shift over time (due to symptom evolution, circadian/activity changes, and medication effects) as a major hurdle for real-world deployment; volatile symptoms were most affected while stable traits were less so. Residual correlation patterns mirrored known symptom domain structures, suggesting model errors are not arbitrary but track scale interdependencies. Channel ablation emphasized actigraphy’s central role in detecting mood-related psychomotor and behavioral changes, while cardiovascular volume pulse contributed less in this setting. Overall, the findings demonstrate the feasibility and clinical relevance of item-level symptom inference while underscoring the importance of handling temporal distribution shift and subject invariance for robust generalization.
Conclusion
This work introduces item-level inference of HDRS and YMRS from wearable multivariate time-series, providing a detailed symptom profile that preserves clinically actionable information beyond single-number predictions. A deep learning pipeline with adversarial subject-invariant representation learning achieved moderate agreement with clinician ratings (QCK=0.609) and outperformed a classical ML baseline. The approach can also recover total scores by summation. However, performance deteriorated with increasing temporal distance from the clinical assessment, indicating a key challenge for translation. Future research should prioritize domain adaptation over time, modeling multiple temporal scales (capturing slow and fast dynamics), improving generalization to unseen patients or developing individualized models, leveraging self-supervised learning to reduce labeling bottlenecks, and enhancing explainability and uncertainty quantification for clinical trust and actionability.
Limitations
- Single rater: All HDRS/YMRS scores were from one clinician; multi-rater data would better contextualize model agreement relative to inter-rater reliability.
- Within-session labels: No repeated clinician ratings within a recording session, limiting assessment of true target drift vs. model generalization issues; the 5-hour cutoff, while informed, is a design choice that could be varied.
- Medications and naturalistic setting: Concomitant treatments may confound physiological signals and symptom expression.
- Generalizability: Evaluated on a single-center cohort; although out-of-distribution testing over time was performed, broader generalization (cross-site, inter-/intra-individual robustness) remains to be established.
Related Publications
Explore these studies to deepen your understanding of the subject.

