Medicine and Health
Systematic review and meta-analysis of performance of wearable artificial intelligence in detecting and predicting depression
A. Abd-alrazaq, R. Alsaad, et al.
This systematic review and meta-analysis explores the remarkable capabilities of wearable artificial intelligence in detecting and predicting depression, showing a high accuracy of 0.89. Despite its promising results, the study by Alaa Abd-Alrazaq and colleagues emphasizes that wearable AI is not yet ready for clinical application and needs further research.
~3 min • Beginner • English
Introduction
Depression is a serious illness that affects ~3.8% of the population worldwide (i.e., 260 million people). Depression causes feelings of sadness and/or loss of interest in activities that were once enjoyed and can lead to a variety of emotional and physical problems. Individuals with depression may have decreased ability to function at home and/or work and may experience changes in appetite, sleep patterns, fatigue, feelings of worthlessness or guilt, poor concentration, impaired decision-making, and increased risk of suicide. If left untreated, depression can become chronic and lead to poor quality of life. One study estimated a 28.9-year loss in quality-adjusted life expectancy due to depression, underscoring the importance of early detection.
Current assessments rely on clinical observations, history, and self-reported questionnaires (e.g., PHQ-9), which are subjective, time-consuming, and difficult to repeat, leading to inaccuracies and challenges in personalized assessment. Global shortages of mental health professionals exacerbate delayed detection, particularly in low- and middle-income countries, and stigma further impedes early identification.
Wearable devices offer an avenue for automatic, objective, efficient, and real-time assessment by passively collecting biosignals such as heart rate, activity, sleep, blood oxygen, and respiratory rate. Forms include watches, bands, jewelry, shoes, and clothing; categories include on-body, near-body, in-body, and electronic textiles. Adoption is increasing globally. Wearable-derived parameters can assess depression symptoms and may enhance detection and prediction. Due to the desire for automatic, objective, efficient, and real-time approaches to
Literature Review
Methodology
Search strategy: Eight databases (MEDLINE via Ovid, PsycINFO via Ovid, EMBASE via Ovid, CINAHL via EBSCO, IEEE Xplore, ACM Digital Library, Scopus, and Google Scholar) were searched on October 3, 2022. Automatic PubVixed alerts continued for 3 months (until January 2, 2023). For Google Scholar, only the first 100 hits were screened. Backward and forward citation checking of included studies was performed.
Eligibility criteria: Included studies developed AI algorithms to detect current depression status or predict future occurrence/level of depression using non-invasive on-body wearable data (e.g., smartwatches, smart glasses, smart clothes). Studies could include additional data sources (e.g., questionnaires) but had to evaluate AI performance for detecting/predicting depression and report a confusion matrix and/or performance measures (e.g., accuracy, sensitivity, specificity, RMSE). Excluded: non-wearable, hand-held, near-body, in-body/implantable devices, wearables wired to non-wearables, or those requiring expert sensor placement; studies predicting treatment outcomes; non-English; preprints, reviews, abstracts, posters, protocols, editorials, comments.
Study selection: Duplicates were removed in EndNote X9. Two reviewers independently screened titles/abstracts, then full texts, resolving disagreements by discussion. Inter-rater agreement: Cohen’s kappa 0.85 (title/abstract) and 0.92 (full text).
Data extraction: Two reviewers independently extracted meta-data, wearable devices, AI algorithms, and performance results (calculating measures when raw data were available). Results based solely on non-wearable data were not extracted. When multiple experiments per study existed, one effect size per relevant result was used.
Risk of bias and applicability: A modified QUADAS-2 (with elements from PROBAST) assessed four bias domains (participants, index test/AI, reference standard/ground truth, analysis) and three applicability domains (participants, index test, reference standard). Two reviewers assessed independently with consensus resolution.
Data synthesis and analysis: Narrative synthesis and three-level random-effects meta-analyses were conducted to account for multiple effect sizes per study (level 1: repeated analyses within study; level 2: between-study effects; plus sampling variance). Pooled means were computed for accuracy, sensitivity, specificity, and RMSE. Subgroup analyses stratified by AI algorithms, AI aim (detection vs prediction), wearable device, data source (open vs closed), data types, and reference standards. Heterogeneity was assessed using I² and Cochran’s Q (p<0.05 indicating significant heterogeneity). Heterogeneity thresholds: I² 0–40% (might not be important), 30–60% (moderate), 50–90% (substantial), 75–100% (considerable). Analyses used R v4.2.2.
Key Findings
Study selection and scope: From 1314 records, 54 studies (2015–2022) were included. Most were journal articles (74.1%). Studies spanned 17 countries, with the USA contributing the most. Sample sizes ranged 8–4036 (mean ~316). Half recruited both depressed patients and healthy controls.
Wearables and AI: Thirty wearable devices were used; wrist placement predominated (92.6%). Common devices: Actiwatch/ActiGraph variants (35.2%) and Fitbit (25.9%). AI tasks were mainly detection (88.9%) with classification approaches (81.5%) most common; some regression (9.3%) and mixed (9.3%). Frequently used algorithms: Random Forest (59.3%), Logistic Regression (24.1%), SVM (20.4%), XGBoost (18.5%). Data sources were closed (63%) or open (37%). Input data types were diverse, most commonly physical activity (87%), sleep (48.1%), heart rate (31.5%), plus smartphone usage, location, social interaction, and others. Ground-truth measures included MADRS (35.2%), PHQ-9/8/4 (25.9%), DSM-IV/5 (9.3%), HDRS (9.3%), BDI-II (9.3%), among others.
Risk of bias: Many studies had unclear participant selection (69% insufficient info). Sample sizes were often insufficient (44%). Most had low risk in index test (87%) and reference standard domains (89%), and 78% had low analysis bias. Applicability concerns were generally low across domains.
Meta-analysis outcomes (considerable heterogeneity throughout):
- Highest performance estimates: accuracy pooled mean 0.89 (95% CI ~0.83–0.93; 75 estimates, 35 studies), sensitivity 0.87 (0.79–0.92; 58 estimates, 29 studies), specificity 0.93 (0.87–0.97; 54 estimates, 28 studies), RMSE 4.55 (3.05–5.05; 5 estimates, 3 studies).
- Lowest performance estimates: accuracy pooled mean reported as 0.70 overall; detailed meta-analysis reported 0.79 with wide CI; sensitivity 0.61 (0.49–0.72; 30 estimates, 21 studies); specificity 0.73 (0.62–0.83; 27 estimates, 20 studies); lowest RMSE pooled mean 3.76.
Subgroup analyses: Significant differences by algorithm were found for highest/lowest accuracy, highest sensitivity/specificity, and lowest specificity. Wearable device type showed significant differences for lowest sensitivity and lowest specificity. Considerable heterogeneity (I² often >98%) was present across analyses.
Discussion
Wearable AI demonstrates good but not optimal performance for detecting and predicting depression from passive biosignals. Pooled highest metrics suggest strong classification of non-depressed individuals (specificity ~0.93) and good detection of depressed individuals (sensitivity ~0.87), yet lowest-case performance across studies drops notably (sensitivity ~0.61; specificity ~0.73), indicating variability and context dependence. RMSE values for predicting depression severity (≈3.8–4.6) indicate only moderate precision.
Algorithm choice materially influences performance; AdaBoost often outperformed other methods in subgroup analyses, whereas logistic regression and decision trees tended to underperform. Device differences may also affect results; some analyses suggested better performance with certain actigraphy devices than with Fitbit, but these findings may be confounded by shared datasets and small numbers. Overall, findings support wearable AI’s promise for scalable, objective monitoring but highlight substantial heterogeneity, methodological variability, and limitations that preclude immediate clinical adoption. Until performance, generalizability, and robustness are improved, wearable AI should complement, not replace, standard diagnostic and monitoring approaches.
Conclusion
This systematic review and meta-analysis of 54 studies shows that wearable AI can detect and predict depression with generally high pooled best-case accuracy, sensitivity, and specificity, and moderate error for severity prediction, but its performance is inconsistent across studies and not yet clinically ready. The study contributes a comprehensive synthesis of devices, data types, algorithms, ground-truth standards, and rigorous multi-level meta-analytic estimates, along with risk-of-bias and applicability assessments. Future research should: (1) improve study design and sample sizes; (2) standardize ground-truth timing and definitions; (3) evaluate algorithm/device effects with diverse, representative cohorts; (4) integrate multi-modal data (e.g., combining wearable signals with neuroimaging); and (5) assess differential diagnosis to distinguish depression from other mental and physical conditions. In the interim, wearable AI should be used alongside established clinical methods.
Limitations
- High heterogeneity across studies (I² often >95%) due to diverse populations, devices, data types, and modeling choices.
- Participant selection often unclear (69%); many studies used insufficient sample sizes (44%), limiting generalizability.
- Only a minority used appropriate intervals between index tests and reference standards, potentially introducing temporal bias.
- Performance estimates frequently based on small numbers of studies per algorithm or device, risking unstable subgroup conclusions.
- Reporting inconsistencies and multiple analyses per study necessitated multi-level modeling; residual dependence may remain.
- Variability in ground-truth measures and data preprocessing pipelines complicates cross-study comparability.
- Results suggest promise but not readiness for clinical implementation without further validation and standardization.
Related Publications
Explore these studies to deepen your understanding of the subject.

