logo
ResearchBunny Logo
Personalized prediction of negative affect in individuals with serious mental illness followed using long-term multimodal mobile phenotyping

Medicine and Health

Personalized prediction of negative affect in individuals with serious mental illness followed using long-term multimodal mobile phenotyping

C. A. Webb, B. Ren, et al.

Smartphones and wearables can detect real-time spikes in negative emotions by passively tracking behavior. In a year-long study of 68 adults with mood or psychotic disorders, a personalized ensemble machine learning model predicted states like irritability and loneliness (AUCs 0.72–0.79), with GPS location features most predictive and substantial individual variability observed. These results point to smartphone-triggered, timely emotional interventions. Research conducted by Christian A. Webb, Boyu Ren, Habiballah Rahimi-Eichi, Bryce W. Gillis, Yoonho Chung, and Justin T. Baker.... show more
Introduction

Serious mental illness (SMI), including mood and psychotic disorders, is characterized by elevated negative affect (e.g., sadness, anxiety, anger) that often persists for hours and disrupts daily functioning. Smartphones, owned by most adults, offer passive, continuous sensing (accelerometer, GPS, usage metadata) that may signal shifts in affect in real-world contexts. The study investigates whether passively collected smartphone (and wearable) data can reliably predict within-person states of heightened negative affect (HNA) in individuals with SMI. It compares machine learning methods (random forest, SVM, elastic net) and a conventional mixed-effects model, evaluates a personalized ensemble modeling strategy that can leverage patterns across individuals, extends observation duration to roughly one year per participant to capture clinically meaningful variability, examines added value of wrist-worn accelerometry, and addresses heterogeneity in person-specific associations between sensor features and negative emotions. The overarching goal is to enable just-in-time, smartphone-delivered interventions triggered by predicted elevations in negative affect.

Literature Review

Prior work suggests passively collected smartphone data can correlate with or predict elevated negative emotions and depressive symptoms, though most studies used short monitoring windows (days to weeks), potentially limiting within-person variance and power. Some studies found random forests outperform other algorithms for predicting negative emotional states in youth with depressive symptoms. Wrist-worn devices can provide continuous measures of activity, sleep, and physiology, but their psychometric properties vary by device and compliance; research-grade devices with raw data access (e.g., GENEActiv) improve transparency. Evidence also indicates substantial between-person variability in how behaviors relate to emotions, motivating idiographic modeling, while some shared cross-person patterns suggest potential gains from borrowing information across individuals via ensemble strategies.

Methodology

Design and participants: Seventy adults with primary affective (bipolar I/II, major depressive disorder) or psychotic disorders (schizophrenia, schizoaffective) were enrolled (2015–2019). Passive smartphone signals were available for 68 and wrist-worn actigraphy for 31. Participants were recruited through McLean Hospital divisions and the Mass General Brigham Rally platform. Planned observation was one year per participant with option to extend (M = 465 days, SD = 426, range 3–2044). Total daily emotion surveys collected = 12,959. Ethics approval obtained (MGB IRB 2015P002189) with informed consent. Data collection: Participants installed the Beiwe app (Android/iOS) for semi-continuous passive sensing (accelerometer, GPS, screen usage) and daily ecological surveys. Optional wrist-worn actigraphy (GENEActiv) was worn continuously; devices were swapped at monthly visits. Measures: Outcomes were daily self-reported negative affect items: anxious, irritable, upset, lonely on a 1–4 scale (not at all to extremely). Predictors included engineered daily features from smartphone accelerometry and screen usage (sleep epoch phone use, sleep onset/offset, sleep duration, phone use during wake, difference between wake/sleep use, missing data) and GPS-derived mobility and location features processed by the DPLocate pipeline (distance from home, radius of mobility, percent time at home, number of locations visited, GPS missing time). Additional wrist-based features from DPSleep (accepted watch days, sleep onset/offset, sleep duration, activity during sleep/wake, wake–sleep activity difference, sleep fragmentation) were computed for the subset (n=31), but high missingness limited primary analyses to smartphone predictors (wrist results in supplement). Data preprocessing and imputation: Missing predictor data were imputed using multiple imputation by chained equations (MICE; 5 imputations, 5 iterations), excluding outcomes. For prediction, mean values across imputations were used. Outcomes were not imputed. Missingness assumed Missing At Random (MAR). Outcome derivation—HNA states: For each emotion, a high negative affect (HNA) event was defined when a day’s rating exceeded the participant’s person-specific mean by at least 0.5 points. Participants were included per emotion if they had ≥10 total observations and at least max(4, 10%) HNA events. Final Ns per emotion: anxious (35), irritable (36), upset (40), lonely (42). Modeling approach: Same-day prediction of HNA states from passive features. Two approaches: (1) Generalized linear mixed-effects regression (GLMER; logit link) with subject-specific random intercept for comparison. (2) Personalized Ensemble Model (PEM): for each individual, idiosyncratic models (IMs) were trained using multiple learning algorithms (elastic net GLM, SVM, random forest). A personalized weighted ensemble for each person was computed by optimizing nonnegative weights summing to 1 via cross-validated log-loss. A combined ensemble across three algorithms (“PDEM”) was also evaluated. Ten-fold cross-validation within subject time points estimated weights; a time-series CV yielded similar results (supplement). Feature engineering included parallel IMs using raw features and principal components (PCA) of features, allowing the ensemble to combine both. Feature importance and clustering: Feature importance for best-performing PEMs was derived by linearly combining IM importances weighted by the learned ensemble weights and normalized per subject. K-medoids clustering (Euclidean, up to five clusters, selected by silhouette width) was applied to feature-importance profiles to visualize subgroups with similar signatures. Code availability: https://github.com/bcj2024/BLS_ensemble

Key Findings
  • Data volume and HNA prevalence: On average, >200 daily ratings per participant. HNA states occurred in ~25–30% of observations and were large elevations above person-specific means (Cohen’s d ~1.33–1.45 across emotions).
  • Predictive performance: All machine learning PEMs outperformed GLMER (AUC > 0.50). PDEM and PEM-RF had the best performance across emotions. AUCs (PEM-Enet / PEM-SVM / PEM-RF / PDEM / GLMER):
    • Anxious: 0.749 / 0.722 / 0.764 / 0.765 / 0.687
    • Irritable: 0.711 / 0.683 / 0.716 / 0.724 / 0.648
    • Upset: 0.716 / 0.688 / 0.717 / 0.728 / 0.643
    • Lonely: 0.774 / 0.739 / 0.787 / 0.791 / 0.690 Accuracy, sensitivity, and specificity at optimal cutoffs were highest for loneliness (e.g., PEM-RF: accuracy 0.710, sensitivity 0.740, specificity 0.699).
  • Feature importance: GPS-derived mobility/location features (distance from home, radius of mobility, percent time at home) were the most predictive on average across emotions. There was marked heterogeneity between individuals in both feature ranking and directionality. For example, 80% showed an inverse association between GPS radius mobility and loneliness; activity–anxiety associations were split in direction across participants.
  • Benefit of personalized ensembles: PEMs improved over corresponding idiosyncratic models with median AUC increase ~0.05 and median Brier score decrease ~0.013, indicating consistent gains for ≥75% of participants. Weight patterns showed most individuals benefited from borrowing information across models from other participants rather than relying solely on their own data.
Discussion

Findings demonstrate that passive smartphone sensing combined with machine learning can predict within-person elevations in negative affect among individuals with SMI better than chance, without requiring burdensome self-report beyond a daily brief survey. Random forest and the personalized ensemble achieved the strongest performance, particularly for loneliness, suggesting non-linearities and interactions among passive features are informative. GPS-based mobility/location signals were the most influential predictors overall, though person-specific heterogeneity in predictive patterns was substantial, reinforcing the value of individualized models. At the same time, personalized ensembles that borrow information across participants further improved accuracy, implying shared cross-individual structure that can be leveraged when individual data are limited. These capabilities could enable just-in-time smartphone-delivered interventions triggered by predicted HNA states and tailored to detected behavioral patterns (e.g., increased time at home with reduced social contact or adverse sleep changes).

Conclusion

This study shows that long-term, multimodal passive sensing combined with personalized ensemble machine learning can accurately predict daily states of heightened negative affect in individuals with SMI. GPS-derived mobility features were especially informative, and personalized ensembles outperformed both simpler mixed-effects models and purely idiographic models. The work supports the feasibility of deploying just-in-time, smartphone-delivered interventions triggered by predictive models to mitigate negative affect in real time. Future research should increase sampling density of affect (multiple times per day), incorporate richer passive features (e.g., call/text metadata, social media patterns), expand sample sizes to enhance generalizability and clustering reliability, refine dimensionality reduction beyond PCA, reduce wearable data missingness, and prospectively test the clinical impact of algorithm-triggered interventions.

Limitations
  • Affect was assessed once daily; more frequent EMA could capture intra-day variability.
  • Additional smartphone features (e.g., call/text metadata, social media use) were not included and may improve predictions.
  • Although within-person analyses used many time points per participant, the between-subject sample size was modest.
  • Generalizability may be limited due to recruitment via clinical divisions and advertisement, potentially introducing selection bias.
  • Small sample size reduces reliability of clustering based on feature importance.
  • PCA may not have been the optimal dimensionality reduction method, as models using only PCs underperformed compared to those combining PCs with raw features.
  • Wrist-worn actigraphy had high missingness, limiting its contribution to primary analyses (results reported in supplement).
Listen, Learn & Level Up
Over 10,000 hours of research content in 25+ fields, available in 12+ languages.
No more digging through PDFs, just hit play and absorb the world's latest research in your language, on your time.
listen to research audio papers with researchbunny