Medicine and Health

Automated mood disorder symptoms monitoring from multivariate time-series sensory data: getting the full picture beyond a single number

F. Corponi, B. M. Li, et al.

This cutting-edge research, conducted by a team including Filippo Corponi and Eduard Vieta, unveils a novel method for monitoring mood disorders by utilizing wearable sensor data to infer HDRS and YMRS scale items, achieving a remarkable agreement with expert assessments.

00:00

~3 min • Beginner • English

Index

Introduction

The study addresses the challenge of monitoring mood disorders (major depressive disorder and bipolar disorder) outside clinical settings by leveraging wearable sensor data and machine learning to predict detailed symptom profiles. Traditional assessments rely on clinician-administered scales (HDRS and YMRS), but limited specialist availability and the complexity of symptom presentations impede timely, individualized interventions. Prior work often reduces prediction to a single label (diagnosis or total score), which obscures clinically actionable heterogeneity. This work proposes predicting all individual items of HDRS (17) and YMRS (11) from wearable-derived physiological signals, aiming to preserve detailed symptom information that can guide tailored treatments. The purpose is to evaluate whether a deep learning framework can infer item-level symptom severity in agreement with clinician ratings and to explore methodological solutions for multi-task, ordinal, imbalanced, and subject-invariant learning, as well as robustness to distribution shift over time.

Literature Review

Existing personal sensing studies in mood disorders have largely focused on single-label outcomes: binary case-control detection, episode prediction, or total HDRS score regression. Examples include: Côté-Allard et al. distinguishing manic vs euthymic states in BD using Empatica E4 features; Ghandeharioun and Pedrelli predicting residualized or total HDRS from wristband plus smartphone features with traditional ML; Tazawa and Jacobson performing depression detection using hand-crafted features; Nguyen applying neural networks to minimally processed actigraphy for schizophrenia vs MDD differentiation; and Lee et al. predicting impending mood episodes with wearable and smartphone data in a large, retrospective-evaluated cohort. These approaches, while informative, collapse complex symptomatology into a single outcome, losing item-level clinical granularity. The present work uniquely targets full HDRS and YMRS item prediction, uses minimally processed multivariate wearable data with deep learning, and evaluates methodological aspects such as multi-task learning, ordinal losses, class imbalance handling, subject-invariant representation learning, and out-of-distribution testing. The study also includes a larger non-retrospectively assessed acute-phase sample relative to most prior works.

Methodology

Design and cohort: Prospective, observational, single-center, longitudinal study (TIMEBASE/INTREPIBD) embedded in routine clinical practice. Participants had DSM-5 diagnoses of MDD or BD. Acute-episode recruits could have up to four assessments: T0 (acute/euthymia), T1 (response onset: 50% reduction in total HDRS/YMRS), T2 (remission: total HDRS/YMRS ≤ 7), T3 (recovery: total HDRS/YMRS ≤ 7 for ≥ 8 weeks). Euthymic recruits were assessed once. A clinician collected demographics and HDRS/YMRS at each assessment and provided a wearable. Sample and data collection: N = 75 subjects, 149 recording sessions (>7000 h). Mean age 44.66 (SD 14.42); 56% female. Mood states included MDE-MDD, MDE-BD, ME, MX, EU-MDD, EU-BD. Medications as per routine care (antidepressants 37.83%, lithium 70.94%, anticonvulsants 34.45%, antipsychotics 12.16%). Median time since onset 6 years (IQR 14). Wearable and sensors: Empatica E4 worn on non-dominant wrist ~48 h per assessment. Modalities: tri-axial acceleration (ACC, 32 Hz), blood volume pulse (BVP, 64 Hz), electrodermal activity (EDA, 4 Hz), heart rate (HR, 1 Hz), skin temperature (TEMP, 1 Hz). Inter-beat interval (IBI) excluded due to extensive missingness and motion artifacts. Preprocessing: Quality control applied to remove physiologically implausible values (rules per Kleckner et al., plus HR 25–250 bpm). Median 8.05% data removed per session (range 1.95–32.10%). Sessions segmented by non-overlapping sliding windows of length t seconds (hyperparameter). Each segment inherits the session’s 28-item HDRS/YMRS labels. To limit temporal variability and distribution shift, segments from the first 5 h post-interview (close-to-interview) were used for model development; splits: 70% train, 15% validation, 15% test. Out-of-distribution analysis evaluated performance on subsequent 30-min intervals across the remainder of the recording (far-from-interview). Evaluation metrics: Primary metric: quadratic-weighted Cohen’s kappa (QCK), suitable for ordinal, imbalanced data and clinically interpretable. Also macro-averaged F1 used. Item-level performance relationships examined with class imbalance ratio and Shannon entropy. Total score RMSE computed from summed item predictions for descriptive purposes. Model: Deep neural network comprising (1) channel encoders to map each modality to a common latent dimensionality; (2) a shared representation module; (3) 28 parallel item-specific predictors outputting rank probabilities. To encourage subject-invariant features, an adversarial critic predicts subject identity from the shared representation; its loss is adversarially combined with the main objective. Ordinal modeling used a kappa-based loss (weighted kappa loss). Class imbalance strategies evaluated: focal loss; probability thresholding at inference by rank frequency; and re-sampling (random under/over-sampling by severity bins) with loss re-weighting. Hyperparameter tuning: Hyperband Bayesian optimization over 300 trials; random forest feature importance used post-hoc to assess hyperparameter impact on validation QCK. Best segment length identified as 16 s; other lengths reported with ΔQCK. Baseline: Classical ML baseline using FLIRT-extracted hand-crafted features from the same segments and 28 random forest classifiers (with random oversampling); 300-iteration random search per classifier. Error and channel analyses: Residual distributions inspected; Gaussian graphical lasso estimated partial correlation network among item residuals. Subject-level macro-F1 examined vs age, sex, and total scores. Channel ablation retrained the best model with each modality removed to assess contribution.

Key Findings

- Performance on close-to-interview samples: Best ANN achieved mean QCK = 0.609 across HDRS and YMRS items, indicating moderate agreement with clinician ratings, and outperformed the random forest baseline (mean QCK = 0.214). Macro-average F1 also 0.609 (as stated in abstract). - Item-level results: Highest QCK for H12 Somatic symptoms gastrointestinal (0.775); lowest for H10 Anxiety psychic (0.492). Item-level QCK correlated weakly with class imbalance (R = 0.08) and moderately with item Shannon entropy (R = 0.42). - Out-of-distribution over time: Mean QCK on far-from-interview segments dropped to 0.498 (5:01–6:00 h), 0.303 (6:01–6:30 h), and 0.182 (6:31–7:00 h), with further fluctuations and a minimum of 0.061 around 15 h. Largest early drops observed for H9 Agitation, H10 Anxiety (paper text indicates somatic), Y4 Sleep, Y9 Disruptive-aggressive behavior; most stable: H1 Depressed mood, H2 Feelings of guilt, H17 Insight, Y11 Insight. - Residuals: Errors centered around zero; when incorrect, predictions were often off by one rank. Summed-item total score RMSE: HDRS 4.592; YMRS 5.854. - Channel importance: ACC was most contributory; removing ACC yielded notable QCK decreases, e.g., H11 Anxiety somatic ΔQCK ≈ −0.321, Y4 Sleep ≈ −0.371, Y9 Disruptive-aggressive ≈ −0.281, H1 Depressed mood ≈ −0.276. BVP removal had relatively minor impact. - Hyperparameters: Weighted kappa loss most predictive of validation QCK; inclusion of a small critic penalty (0.07) preferred; resampling with loss re-weighting favored for class imbalance; optimal segment length t = 16 s. - Subject-level variability: Mean subject macro-F1 ≈ 0.605 (SD 0.015); no significant associations with age, sex, or total scores.

Discussion

Predicting the full HDRS and YMRS symptom profile from wearable data preserves clinically actionable information that single-label predictions cannot capture. The deep learning approach achieved moderate agreement with clinician ratings, demonstrating feasibility of item-level inference from multivariate physiological signals. The correlation of item performance with entropy suggests items with more uniform distributions are harder to model. Out-of-distribution degradation across time highlights sensitivity to changing symptom expression and physiological context (e.g., activity patterns, circadian influences, medications), underscoring the need for temporal robustness and domain adaptation. Residual correlation structures aligned with known symptom domains and cross-scale correspondences, indicating model errors follow clinically meaningful relationships. Channel ablation emphasized actigraphy’s centrality in capturing mood-related motor and activity changes, while BVP contributed less in this setting. Overall, the findings support moving beyond single-number monitoring to item-level symptom tracking that could inform individualized interventions, while identifying generalization over time as a key challenge for clinical deployment.

Conclusion

This work introduces and validates item-level prediction of HDRS and YMRS from wearable sensor data, offering a richer, clinically actionable alternative to single-label outcomes. A deep learning pipeline with multi-task, ordinal-aware learning and adversarial subject-invariance achieved moderate agreement with clinicians (mean QCK 0.609) and surpassed a strong classical ML baseline. The study delineates methodological considerations for imbalanced, ordinal, multi-task targets and highlights actigraphy’s importance. Future directions include: improving robustness to temporal and distributional shifts via domain adaptation and multi-timescale representations; assessing generalization to unseen individuals and across sites versus personalized fine-tuning; leveraging self-supervised learning to reduce labeling demands; and advancing explainability and uncertainty quantification for clinical trust.

Limitations

- Single rater: All HDRS/YMRS assessments were by one clinician; multi-rater data would clarify inter-rater variability and benchmarking of model agreement. - Temporal labels: No repeated clinician ratings within the same session, limiting assessment of target drift over time; the 5-hour cutoff choice, while pragmatic, may not be optimal. - Medication effects: Naturalistic prescribing allows potential confounding by treatment changes during recordings. - Generalizability: Evaluations used a single-center cohort; broader inter- and intra-individual generalization, and cross-clinic performance, remain untested. Although out-of-distribution over time was examined, other generalization dimensions require study.

Related Publications

Explore these studies to deepen your understanding of the subject.

Psychology

Automated mood disorder symptoms monitoring from multivariate time-series sensory data: getting the full picture beyond a single number

F. Corponi, B. M. Li, et al.

Transportation

How has the Covid-19 pandemic affected wheelchair users? Time-series analysis of the number of railway passengers in Tokyo

Y. Arai, Y. Niwa, et al.

Psychology

Physical and mental health in adolescence: novel insights from a transdiagnostic examination of FitBit data in the ABCD study

K. S. F. Damme, T. G. Vargas, et al.

Physics

A single atom noise probe operating beyond the Heisenberg limit

T. Dutta and M. Mukherjee

Listen, Learn & Level Up

Over 10,000 hours of research content in 25+ fields, available in 12+ languages.

No more digging through PDFs, just hit play and absorb the world's latest research in your language, on your time.

listen to research audio papers with researchbunny