Medicine and Health
Large Language Models for Mental Health Applications: Systematic Review
Z. Guo, A. Lai, et al.
The study addresses whether and how large language models (LLMs) can be effectively and safely used in mental health for early screening, digital interventions (eg, conversational agents), and clinical support. It situates the research within the global burden of mental illness, substantial care gaps, stigma-related barriers to seeking help, and rising demand exacerbated by COVID-19. LLMs offer accessible, scalable, nonjudgmental support and potential clinical utility (eg, assessment, feedback), but also pose risks such as inconsistencies, hallucinations, and ethical concerns. Motivated by rapid advances since the transformer architecture (2017) and the surge in interest following ChatGPT’s release, the review systematically synthesizes evidence on models, methods, data sources, and outcomes to evaluate applicability, efficacy, and readiness for clinical use.
Prior work demonstrated the effectiveness of machine learning and NLP in mental health (eg, text categorization, sentiment analysis) and offered general reviews of LLMs in medicine and scoping reviews in mental health. However, gaps remained in a systematic, post-transformer-era assessment specifically of state-of-the-art LLMs applied to mental health with rigorous inclusion/exclusion criteria, risk-of-bias assessments, and exclusion of preprints. Existing studies also raised issues around interpretability, ethical use, and clinical integration. This review builds on that literature by comprehensively cataloging applications (screening, conversational agents, clinical decision support), evaluating methodologies and datasets, and appraising risks and ethical challenges.
Design: Systematic review registered on PROSPERO (CRD42024508617), conducted per PRISMA guidelines. Search Strategy: Five databases (MEDLINE via PubMed, IEEE Xplore, Scopus, JMIR, ACM Digital Library) were searched August 3-6, 2024 using (mental health OR mental illness OR mental disorder OR psychiatry) AND (large language models). Database-specific strategies: all metadata (MEDLINE, IEEE Xplore), titles/abstracts/keywords (Scopus), exact match criteria (JMIR), full text (ACM DL). Eligibility: English-language, peer-reviewed articles (original research and reviews) published Jan 1, 2017–Apr 30, 2024 focusing on LLM applications in mental health. Exclusions included non–peer-reviewed works (including preprints), non-LLM or non–mental health topics, duplicates, and very short/nonacademic items. Screening: Citations were imported into Zotero, deduplicated, and double-screened by two reviewers (ZG, KL). Interrater reliability used Cohen’s kappa (κ=0.84). Disagreements were resolved with a third reviewer (AL). Risk of Bias: Assessed using the Cochrane Risk of Bias 2 tool (visualized in Multimedia Appendix 2), considering randomization, deviations from intended interventions, missing data, outcome measurement, and selective reporting. Data Extraction and Synthesis: Two reviewers (ZG, KL) extracted application scenarios, model architectures, data sources, methods, and outcomes. Studies were grouped into three categories: (1) detection of mental health conditions and suicidal ideation via text; (2) LLMs as mental health conversational agents; (3) other applications/evaluations of LLMs in mental health. Public datasets were summarized where available; narrative synthesis integrated quantitative results and performance metrics. PRISMA Flow: 14,265 records identified; after deduplication, 13,967 remained; 590 abstracts screened; 101 full texts assessed; 40 included.
- Corpus and categories: 40 studies included: 15 (38%) on detection of mental health conditions/suicidal ideation from text; 7 (18%) on conversational agents (CAs); 18 (45%) on other applications and evaluations (eg, diagnosis, treatment guidance, education).
- Models and data: BERT and variants were most common (25% of studies used BERT-family models) and GPT-family widely used (20%). Many studies used social media data (Twitter, Reddit, Weibo) in English, Malay, Chinese, Portuguese; several used clinical transcripts/interviews.
- Depression and mental health screening from text:
- GPT-3.5 achieved F1=0.78 on DAIC-WOZ, surpassing CNN baselines (F1=0.53, 0.59) for depression detection (Danner et al).
- ROBERTa-based approach achieved lowest MAE=3.65 in predicting PHQ-8 scores and classifying 3 depression levels on E-DAIC (Sadeghi et al).
- PsychBERT (social media + literature) obtained F1=0.63, outperforming CNN (0.57) and LSTM (0.51) for multiple syndromes (Vajre et al).
- Emotion/sentiment: Emotion-aware BERT Tiny reached 93.14% accuracy (sentiment) and 85.46% (emotion), outperforming BERT baselines (Stigall et al).
- Suicidal ideation detection:
- BERTimbau large (Portuguese): accuracy 0.955, precision 0.961, F1 0.954 (Diniz et al).
- BERT identified 88.5% of tweets as suicidal/off-topic, comparable to human analysts (Metzler et al).
- GPT-4 suicide risk assessments aligned with clinicians but tended to overestimate ideation (Levkovich et al).
- Conversational agents (CAs):
- VHope (DialoGPT) expert ratings: 67% relevant, 78% human-like, 79% empathetic.
- In 717 evaluations of 239 autism-specific questions, 46.86% preferred chief physicians, 34.87% GPT-4, 18.27% ERNIE Bot; ChatGPT showed higher empathy (mean 3.64, 95% CI 3.57–3.71) vs physicians (3.13, 95% CI 3.04–3.21).
- Safety gap: among 25 CAs tested in suicide scenarios, only 2 made suicide hotline referrals during conversation (Heston).
- Non-English challenges: Translating English datasets to Polish reduced accuracy from ~90% to ~80% for emotion tasks (Zygadlo et al), highlighting multilingual data scarcity.
- Other applications/evaluations:
- ChatGPT on 100 psychiatry vignettes: “A” in 61 cases; no unacceptable responses (Franco D’Souza et al).
- Neurology board-style questions: GPT-4 correct on 85% vs human average 73.8% (Schubert et al).
- Depression prognosis/treatment recommendations: GPT-4, Claude, and Bard showed strong agreement with clinicians, recommending psychotherapy plus antidepressants in every case (Elyoseph et al).
- Risks: Prompt sensitivity led to inconsistent or inappropriate medication suggestions; GPT-4 Turbo suggested less suitable or contraindicated meds in 12% of cases (Perlis et al). ChatGPT occasionally produced hazardous advice (eg, recommending high-risk activities for depression) despite safeguards (Grabb).
- Overall: LLMs are promising for detection, triage, education, and supportive interactions with evidence of empathy and diagnostic utility; however, current clinical risks (hallucinations, inconsistency, limited judgment, safety gaps) may outweigh benefits without supervision and robust frameworks.
Findings indicate LLMs can accurately analyze text for early screening of depression and suicidal ideation, exhibit empathetic interaction qualities, and provide useful clinical guidance in controlled scenarios. These outcomes support the review’s hypothesis that LLMs hold substantial potential to augment mental health services across early detection, digital interventions, and clinical decision support. However, significant barriers prevent routine clinical deployment: sensitivity to prompts, hallucinations, variable judgment, limited crisis-handling capability, and ethical and privacy concerns. The evidence base is skewed toward English and social media data with limited multilingual clinical datasets, and most studies focus on BERT/GPT models. Thus, while LLMs can complement clinicians and improve accessibility and efficiency, their outputs require professional oversight, and their use should be carefully restricted and monitored in high-risk contexts. The results underscore the need for standardized evaluations, safety benchmarks, interpretability efforts, and ethical frameworks before broader clinical integration.
This systematic review synthesizes evidence on LLM applications in mental health, spanning text-based screening, suicidal ideation detection, conversational agents, and clinical decision support. LLMs show promising accuracy, empathy, and potential utility but are constrained by risks of hallucinations, inconsistency, limited clinical judgment, data privacy and bias concerns, and insufficient ethical governance. LLMs should not replace professional care but may serve as valuable aids under clinician supervision. Future work should prioritize: (1) curated, expert-annotated, multilingual mental health datasets; (2) safety and efficacy evaluations tailored to clinical contexts; (3) techniques to reduce hallucinations and improve reliability (eg, retrieval-augmented generation, parameter tuning, model updates); (4) privacy-preserving methods (eg, differential privacy, federated learning, robust de-identification); (5) interpretability via documentation, audits, attention analyses, and knowledge graphs; and (6) development of benchmarked ethical frameworks and collaborative governance for safe integration.
Limitations of LLM use in mental health include prompt sensitivity, inconsistent outputs, hallucinations, lack of clinical judgment and crisis-handling, training-data bias (including language and demographic biases), privacy concerns, and the black-box nature of many models. Safety issues persist, such as occasional inappropriate or hazardous recommendations and poor referral behavior in high-risk scenarios. Limitations of the evidence base include reliance on social media data skewed to younger and English-speaking populations, scarcity of expert-annotated multilingual datasets, and concentration on a narrow set of models (primarily BERT/GPT). Rapid LLM evolution complicates longitudinal comparisons; different versions (eg, GPT-3.5 vs GPT-4) can yield divergent results. The review’s inclusion period and focus on peer-reviewed literature may omit very recent or preprint findings.
Related Publications
Explore these studies to deepen your understanding of the subject.

