Medicine and Health
Mobile sensing-based depression severity assessment in participants with heterogeneous mental health conditions
B. Lamichhane, N. Moukaddam, et al.
The study investigates whether passive mobile sensing can objectively assess depression severity across heterogeneous mental health conditions, addressing limitations of questionnaire-based assessments (e.g., PHQ-9) that rely on subjective recall and may suffer from bias. Prior mobile sensing work often focused on homogeneous cohorts (e.g., students or clinical groups without healthy controls), limiting generalizability. The authors aim to examine if features derived from smartphone communication logs (calls/texts) and GPS-based mobility relate to PHQ-9 depression severity and whether free-living audio-derived sociability features provide additional predictive value. Given the transdiagnostic nature of depressive symptoms, the goal is to develop a generalized approach applicable across healthy individuals, those with major depressive disorder, and those with schizoaffective disorders.
Prior studies have shown associations between mobile sensing and depression: phone communication logs and mobility features correlate with depressive symptomatology; however, most cohorts were homogeneous (e.g., students with none/mild depression) or narrowly clinical without healthy controls. Meta-analytic synthesis across heterogeneous populations has been limited due to heterogeneity in features, designs, and reporting. Free-living audio has been used to infer social ambiance and dyadic interactions, with reported associations to depression, but its predictive utility for depression severity and complementarity to communication and mobility data remained underexplored. Speech-based markers under controlled settings (acoustic/semantic features, response times) show associations with depression, suggesting potential for free-living audio to capture social functioning relevant to mental health. A review suggested different features might be needed for clinical vs non-clinical groups, which complicates transdiagnostic applications; thus, testing generalization across heterogeneous groups is necessary.
Design: Cross-sectional observational pilot study with 32 participants across heterogeneous mental health conditions (healthy individuals, individuals with major depressive disorder, and individuals with schizoaffective disorders). Participants were outpatients to capture natural behavior. Demographics: mean age 45.2 ± 13.6 years; 19 female/13 male; racially/ethnically diverse (50% Black/African American, 21.88% Hispanic, 18.75% White). Ethics: IRB approvals from Rice University, Baylor College of Medicine, and Harris Health Systems (H-41811); written informed consent. Data collection (7-day monitoring):
- Smartphone app: logged incoming/outgoing calls (timestamps, durations) and text messages (timestamps); phone numbers one-way hashed. Collected GPS location (available for 19 participants).
- Wearable audioband: continuous daytime free-living audio (~12 hours/day, 8 AM–8 PM) across 7 days.
- Depression severity: PHQ-9.
Feature engineering:
- Communication log features (per modality: calls, texts, combined call+text):
- Average daily number of communications.
- Asymmetry coefficient AC = (out − in)/(out + in).
- Skewness coefficient SC = sign(out − in)·(1 − H(out,in)), with Shannon entropy H(on proportions).
- Average daily network size for incoming communications (unique contacts).
- Average daily network size for outgoing communications.
- Plus average call duration per day (calls modality). Total: 16 features across modalities.
- Mobility features from GPS (for 19 participants; missing values imputed later):
- Processing included speed estimation (Vincenty distance), stationary vs transitional segmentation using 1.4 m/s threshold, and clustering stationary points with DBSCAN.
- Average distance traveled per day (sum over stationary points normalized by count).
- Location variance = log(σ²_latitude + σ²_longitude).
- Cluster entropy = −Σ t_i log(t_i), and normalized entropy = entropy/log(k), where t_i is time in cluster i and k is number of clusters.
- Audio-derived sociability features via ECoNet pipeline:
- Voice activity detection (Pyannote/PyanNet-based) to segment speech.
- Speaker embeddings (TDNN/x-vector model trained on large public datasets) and unsupervised clustering to assign speaker labels; random forest filter to remove spurious speakers; speaker tracking across days.
- Features: conversational network size (average distinct conversation partners per day), top speakers’ speech ratio (t_top-2-speakers/t_total-speech), and response time in dyadic interaction.
Prediction models and evaluation:
- Regression: Ridge regression (linear with L2 regularization). LOPO cross-validation; regularization weight α selected via inner CV on training set. Baseline: mean PHQ-9 of training set. Feature sets: Model A (communication + location), Model B (audio-only), Model C early fusion (all features), late fusion (average of A and B predictions). Always operated in p < n regime; no explicit feature selection beyond regularization.
- Missing data: For participants missing GPS-based features (only 19 had GPS), imputed with k-NN imputer (k=3) fitted on training data within CV loop; outperformed mean/median in this dataset.
- Metrics: RMSE and normalized RMSE (nRMSE = RMSE/range of observed PHQ-9).
- Classification: Ordinal logistic regression (all-threshold variant) for 5-class PHQ-9 categories (none, mild, moderate, moderately severe, severe), with weighted F1; multinomial logistic regression (L2) for comparison; baseline predicts dominant class. Binary classification also evaluated (PHQ-9 ≥ 5 as depression present).
Associations (Pearson correlations with PHQ-9):
- Communication logs: Significant correlations included network size (incoming text) r=0.47, p=0.014; network size (outgoing text) r=−0.44, p=0.026; skewness coefficient (call+text network) r=0.40, p=0.045; asymmetry coefficient (call+text network) r=0.39, p=0.051 (borderline). Average call duration r=−0.29, p=0.155; total calls r=0.00, p=0.992 (ns).
- Mobility (GPS): Normalized entropy r=0.51, p=0.026 (significant); entropy r=−0.29, p=0.232; total distance r=−0.39, p=0.094; location variance r=−0.251, p=0.299.
- Audio sociability: Conversational network size r=0.49, p=0.014 (significant); top speakers' speech r=0.21, p=0.318; response time r=0.39, p=0.062 (trend).
- Audio conversational network size significantly correlated with location entropy features, indicating partial overlap with mobility-derived sociability, while other audio features were largely independent.
Prediction performance (LOPO CV):
- Baseline (mean predictor): RMSE 9.63, nRMSE 0.36.
- Model A (communication + location): RMSE 6.80, nRMSE 0.25.
- Model B (audio-only): RMSE 8.53, nRMSE 0.32.
- Model C (early fusion: all features): RMSE 6.07, nRMSE 0.22; Pearson correlation between predicted and reported PHQ-9 = 0.76, p<0.001.
- Late fusion (average of A and B): RMSE 6.92, nRMSE 0.26.
Classification (5 classes, weighted F1):
- Ordinal logistic regression: Model C 0.46 (best), Model A 0.34, Model B 0.34, Baseline 0.25. Logistic regression (non-ordinal) underperformed (e.g., Model C 0.34), highlighting benefit of ordinal modeling.
- Binary classification (PHQ-9≥5): F1 scores: Model C 0.66; Model B 0.57; Model A 0.50; Baseline 0.42.
Group-level patterns:
- Both reactive and proactive phone communication patterns observed; psychosis group tended to be more proactive. Depression and psychosis groups showed smaller text messaging and in-person conversational network sizes than healthy; psychosis had comparable call network size to healthy.
Mobile sensing features relating to sociability and mobility are associated with depression severity across heterogeneous mental health conditions, supporting transdiagnostic applicability. Text-based social network size and communication asymmetry/skewness relate to PHQ-9 severity, while GPS-derived normalized entropy and audio-based conversational network size also show significant associations. Psychosis participants exhibited distinct communication patterns (more proactive and comparable call network size to healthy) despite higher depressive symptoms, underscoring the value of heterogeneous cohorts to reveal robust features. Incorporating free-living audio sociability features improved both regression (lower nRMSE to 0.22) and ordinal classification (F1=0.46), demonstrating complementary information beyond smartphone logs and mobility. The results align with prior work on mobile sensing for depression and extend them by leveraging free-living audio and validating across diverse diagnostic groups. This multimodal approach could enhance screening and monitoring of depressive symptoms in broader populations, though longitudinal validation and broader population studies are needed.
Mobile sensing can provide an objective method to assess depression severity across heterogeneous mental health conditions. Free-living audio-derived sociability features complement commonly used smartphone communication and location data, improving both continuous PHQ-9 prediction and categorical severity classification. The approach presents a step toward scalable, transdiagnostic screening and monitoring of depressive symptoms. Future work should expand to larger, multi-site, and longitudinal studies, explore privacy-preserving yet informative audio features, and evaluate end-to-end representation learning approaches to further improve generalizability and performance.
- Small pilot sample size and single-site study limit generalizability; sociability and mobility patterns may vary by region, culture, and setting.
- Cross-sectional design; does not capture within-subject longitudinal changes in depression severity.
- Potential confounders (e.g., individual differences in baseline communication habits) can affect features; absolute communication volumes and network sizes may be influenced by inter-individual factors.
- Incomplete GPS data (only 19 participants); imputation used (k-NN) may introduce bias.
- Handcrafted features; no feature engineering or representation learning due to limited data; potential for improved performance with end-to-end models and ordinal losses.
- Privacy and ethical concerns for continuous audio sensing; although content was not inferred, broader acceptability and privacy-preserving feature extraction need further investigation.
- Prior evidence suggests some mobility-based predictors may not scale to demographically heterogeneous populations, requiring multi-site validation and possibly invariant feature design.
Related Publications
Explore these studies to deepen your understanding of the subject.

