Medicine and Health
Systematic review and meta-analysis of AI-based conversational agents for promoting mental health and well-being
H. Li, R. Zhang, et al.
This systematic review and meta-analysis by Han Li, Renwen Zhang, Yi-Chieh Lee, Robert E. Kraut, and David C. Mohr finds AI-based conversational agents can significantly reduce symptoms of depression and distress—especially when multimodal, generative, and integrated with mobile messaging—while overall psychological well-being showed no significant change, pointing to the need for research on mechanisms, long-term effects, and safe LLM integration.
~3 min • Beginner • English
Introduction
The study investigates whether AI-based conversational agents (CAs) improve mental health outcomes and which factors influence their effectiveness and user experience. While CAs are increasingly used for diagnosis, psychoeducation, and treatment, most prior systems have been rule-based, limiting context understanding and personalization. Advances in NLP and generative AI have enabled AI-based CAs with richer, adaptive interactions, but concerns remain about privacy, bias, safety, and unpredictable outputs. Prior reviews suggested CAs can reduce depression, anxiety, and distress and enhance well-being, but largely over-represented rule-based systems and did not incorporate recent generative AI/LLM developments or systematically examine user experience. This review aims to synthesize the effectiveness of AI-based CAs on psychological distress and well-being, identify moderators of effects, and narratively summarize user engagement and experience to inform safe, effective, user-centered deployment in mental health care.
Literature Review
Previous reviews and meta-analyses reported that conversational agents can mitigate depression, anxiety, and distress and support well-being and quality of life, but they often focused on specific CA types or disorders and primarily included rule-based systems. Two comprehensive reviews supported CA effectiveness across outcomes, yet the effectiveness of AI-based (as opposed to rule-based) CAs remained underexplored. Rapid developments in generative AI/LLMs introduce new opportunities and risks that prior reviews have not fully assessed. User experience factors (e.g., therapeutic alliance, content quality, communication quality) are known to affect outcomes in psychotherapy but were not conclusively addressed for AI-based CAs in prior work, indicating gaps this study targets.
Methodology
Design: Systematic review and meta-analysis following PRISMA, registered in PROSPERO (CRD42023392187). Searches covered 12 databases from inception to Aug 16, 2022, updated to May 26, 2023, limited to English-language publications.
Eligibility (PICOS): Population: all demographics; Intervention: AI-based CA as the primary intervention (two-way interaction leveraging NLP/ML/AI to understand intent/context and retrieve or generate responses); Comparator: any, including active machine/human controls, psychoeducation, usual care, waitlist, or no comparator (for quasi-experiments); Outcomes: psychological distress and/or psychological well-being (self-report, objective, or third-party measures); Study design: experimental studies (quasi-experiments and RCTs); RCTs were used for meta-analysis.
Screening and selection: After removing duplicates, 7834 records were screened by title/abstract, 533 full texts were assessed; 35 studies (34 full-text articles) met inclusion for systematic review; 15 randomized trials (1744 participants) were eligible for meta-analysis (one RCT lacked data; 19 non-RCTs excluded from meta-analysis). Inter-rater reliability: Cohen’s kappa 0.90 (title/abstract) and 0.83 (full text); disagreements resolved via discussion/third reviewer.
Data extraction: Publication and study details; participant characteristics; CA intervention characteristics (deployment, role, target condition, safety measures); CA design features (name, platform, AI model/technique, interaction mode, personalization, emotion AI, etc.); therapeutic orientation; user engagement and experience metrics; outcomes and measures; theoretical mechanisms; moderators/mediators. Open-ended user feedback was analyzed via inductive thematic analysis.
Risk of bias: Cochrane tool across domains; overall risk judged low only if all domains low; high if any non-performance domain high; otherwise “some concerns.” Performance bias was common due to challenges in blinding CA interventions.
Meta-analysis: Conducted in R (metafor). Primary outcomes: pooled effects on (1) psychological distress (distress, depression, anxiety, stress) and (2) psychological well-being (well-being, positive/negative affect, resilience, mental health self-efficacy). Hedges’ g with 95% CIs computed from post-intervention means/SDs (transformed when needed). Random-effects models a priori; three-level random-effects models accounted for multiple effect sizes per study. Heterogeneity: Q, I²; publication bias: Egger’s test. Sensitivity: leave-one-out analyses.
Subgroup analyses: Participant characteristics (gender, age group, health status), study design (control group type), and CA features (response generation: retrieval vs generative; interaction mode: text vs multimodal/voice; delivery platform: smartphone/tablet, instant messenger, web, robot). Mixed-effects models for categorical moderators; meta-regression for continuous (gender). GRADE used to rate evidence quality for primary outcomes.
Key Findings
Study sample: 35 studies across 15 countries/regions (2017–2023; 27 since 2020), N=17,123 participants; 16 RCTs and 19 quasi-experiments; most sample sizes <200; age range 10.7–92 years. For meta-analysis: 15 RCTs, N=1744 participants.
Interventions and design: 23 distinct CAs; 32/35 stand-alone; primary roles psychotherapy/psychoeducation (22/35); platforms: smartphone/tablet (16), instant messengers (9), robots (5), web (3), VR (1), EMDR (1); response generation: retrieval-based (30) vs generative (5); many used NLP; emotion AI in 8 studies; safety features in 15 studies.
Primary meta-analyses:
- Psychological distress: g = 0.70 (95% CI 0.18–1.22), significant benefit of AI-based CAs vs controls; high heterogeneity (Q = 267.98, I² = 95.3%); Egger’s test showed no clear publication bias. Sensitivity leave-one-out range: 0.529–0.787; excluding two influential generative CA studies reduced effects modestly but remained significant (0.529–0.564).
- Psychological well-being: g = 0.32 (95% CI -0.13 to 0.78), not significant; high heterogeneity (Q = 85.7, I² = 91.3%).
Specific outcomes (≥3 trials):
- Depression: g = 0.644 (95% CI 0.17–1.12), significant improvement.
- Anxiety: g = 0.65 (95% CI -0.46 to 1.77), not significant.
- Positive affect: g = 0.07 (95% CI -0.43 to 0.57), not significant.
- Negative affect: g = 0.52 (95% CI -0.67 to 1.71), not significant.
Moderators (distress outcome):
- Response generation: generative CAs g = 1.244 vs retrieval-based g = 0.523 (F(2,19)=4.883, p=0.019).
- Interaction mode: multimodal/voice g = 0.828 vs text g = 0.665 (F(2,19)=3.655, p=0.045).
- Delivery platform: smartphone/tablet g = 0.963; instant messengers g = 0.751; web-based g = -0.075 (F(3,18)=3.261, p=0.046).
- Age group: middle-aged/older adults g = 0.846 vs adolescents/young adults g = 0.64 (F(2,19)=3.691, p=0.044).
- Health status: clinical/subclinical g = 1.069 vs non-clinical g = 0.107 (F(2,19)=7.152, p=0.005).
- Gender: no moderation (F(1,19)=0.105, p=0.749).
- Control type: no significant differences (F(5,20)=2.598, p=0.06).
Moderators (well-being outcome): no significant subgroup effects across participant or CA features, or control types.
User experience synthesis: Positive experiences tied to therapeutic alliance (empathic, non-judgmental communication; accountability; human-like presence; tailored feedback), content (therapeutic techniques; richness), learning process, accessibility, and text-based interactions (single study). Negative experiences centered on communication breakdowns (misunderstanding/misresponding), content relevance/format issues, impersonal feel, limited initiative, preference for human support in severe cases, and technical glitches.
Evidence quality: GRADE rated as moderate for distress and low for well-being due to heterogeneity and imprecision. Risk of bias overall: low in 2 studies, high in 5, remaining with some concerns; performance bias common; attrition issues present in several trials.
Discussion
Findings indicate AI-based CAs meaningfully reduce psychological distress, with effect sizes larger than prior reviews dominated by rule-based systems, suggesting advantages of advanced AI in simulating human-like, adaptive dialogue. Generative CAs and multimodal/voice interfaces likely enhance social presence and personalization, improving outcomes. Mobile and instant messaging platforms may increase accessibility and engagement compared to web-based delivery. Benefits were greater among clinical/subclinical and older adult populations, consistent with greater room for improvement or higher engagement, while gender did not moderate effects. Despite improvements in distress, well-being effects were not significant, possibly due to fewer trials and the relative stability of well-being constructs requiring longer-term interventions. User experience themes underscore the importance of therapeutic alliance, content quality, and avoiding communication breakdowns as active ingredients aligning with psychotherapy common factors. Overall, results support AI-based CAs as accessible adjuncts to address treatment gaps, while highlighting the need to optimize CA design (e.g., generative approaches, multimodality), deployment channels, and targeting to populations most likely to benefit.
Conclusion
AI-based conversational agents demonstrate significant benefits in alleviating psychological distress, with strongest effects for generative, multimodal/voice CAs delivered via mobile or instant messaging platforms and among clinical/subclinical and older adult populations. However, evidence for improving psychological well-being was not significant. User experience depends on therapeutic relationship quality, engaging content, and robust, contextually appropriate communication. Future research should: (1) elucidate mechanisms of change and mediators (e.g., therapeutic alliance, engagement); (2) standardize outcomes and methods; (3) assess long-term effects and durability across diverse mental health outcomes; (4) rigorously evaluate safety and ethical integration of generative AI/LLMs; (5) compare specific AI models and design features; and (6) identify conditions necessitating human support vs CA-only interventions.
Limitations
- High heterogeneity across studies, outcomes, and measures limited precision and comparability; GRADE downgraded for inconsistency and imprecision.
- Limited number of RCTs assessing psychological well-being reduces statistical power to detect effects.
- Few studies reported follow-up outcomes with variable durations, precluding meta-analysis of long-term effects.
- English-only inclusion may omit relevant non-English studies, limiting generalizability.
- Performance bias due to infeasibility of blinding in CA interventions; attrition and missing-data handling introduced additional bias in some trials.
- Small number of generative CA studies prevented analysis of effects by specific AI models/LLMs.
- Narrative synthesis of user experiences relies on original study reporting quality and may reflect methodological limitations.
Related Publications
Explore these studies to deepen your understanding of the subject.

