Psychology

Large language models could change the future of behavioral healthcare: a proposal for responsible development and evaluation

E. C. Stade, S. W. Stirman, et al.

Discover how large language models like OpenAI's GPT-4 and Google's Gemini could transform behavioral healthcare by enhancing psychotherapy practices. This insightful paper by Elizabeth C. Stade, Shannon Wiltsey Stirman, and their colleagues examines the responsible development and evaluation of clinical LLMs, detailing their potential benefits and the challenges we face.

00:00

~3 min • Beginner • English

Index

Introduction

The paper addresses how large language models could responsibly support, augment, or automate components of psychotherapy within behavioral healthcare. It situates LLMs as transformative technologies capable of understanding and generating human-like text, with early attempts already appearing in clinical and quasi-therapeutic contexts. Given the high-stakes, nuanced nature of psychotherapy—where errors can include mishandling suicide or homicide risk—the authors argue for a cautious, evidence-based approach to development and deployment. They overview the technology, outline stages for integrating LLMs into psychotherapy, and propose best practices for development and evaluation to harness potential benefits (e.g., scalability, personalization) while mitigating risks (e.g., safety, ethics, bias).

Literature Review

The paper reviews prior applications of AI and NLP in behavioral health, including suicide risk detection, identification of therapy homework assignment, and emotion detection in psychotherapy sessions. It contrasts rule-based mental health chatbots (e.g., Woebot, Tessa) with LLM-driven systems, noting current limitations such as poor handling of unanticipated user inputs and high dropout/low engagement in existing apps. Early LLM applications include enhancing empathic expressions in peer counseling, identifying therapist and client behaviors in motivational interviewing, and mixed systems like Youper. The authors highlight psychotherapy’s complexity and risk profile compared to other domains (e.g., productivity tools, autonomous vehicles), underscoring the need for clinical oversight and interdisciplinary collaboration. They discuss how LLMs are trained (transformers, pretraining, prompting, fine-tuning, RLHF) and why psychotherapy transcripts are underrepresented in training data, affecting LLMs’ clinical proficiency.

Methodology

This is a conceptual and proposal paper rather than an empirical study. The authors present: (1) a technical overview of clinical LLMs and their training/adaptation methods; (2) a staged integration framework for psychotherapy (assistive, collaborative, fully autonomous) with analogies to autonomous vehicles; (3) a survey of imminent, mid-, and long-term applications across clinical care, training, and research; (4) design criteria and recommendations for responsible development and evaluation (risk-first evaluation, focus on evidence-based practices, interdisciplinary collaboration, trust and usability). They synthesize existing literature, propose evaluation and design frameworks, and articulate anticipated risks, ethical considerations, and research directions.

Key Findings

• Staged integration framework: - Stage 1 (Assistive AI, machine-in-the-loop): Low-risk, modular tasks (e.g., intake support, psychoeducation, activity planning, basic assessment administration, documentation summarization). Risks include supervision overhead for clinicians. - Stage 2 (Collaborative AI, human-in-the-loop): More integrated functions (e.g., LLM-recommended assessments with context, structured interventions like CBT-I components, real-time feedback on worksheets), with clinician review and tailoring. Risks include insufficient review by busy clinicians and drafts requiring significant corrections. - Stage 3 (Fully autonomous AI): End-to-end assessment, case conceptualization, intervention selection and delivery. Significant risks include missed critical information (e.g., suicide risk) and harmful care; uncertain feasibility for complex, comorbid cases. • Imminent applications (Table 2): - Automating administrative tasks and clinical documentation using session recordings; generating patient handouts/summaries. - Measuring treatment fidelity (adherence, competence, differentiation, receipt) more efficiently than traditional methods. - Providing real-time feedback on therapy homework/worksheets. - Automating aspects of supervision and training, especially for peer counselors and trainees (e.g., suggesting improved responses, highlighting strengths/areas for improvement). • Long-term applications: - Decision aids to optimize delivery of EBPs and tailor interventions to patient characteristics and context. - Fully autonomous care delivering comprehensive, protocolized psychotherapy; subject to rigorous safety and ethical vetting. - Discovery and development of new therapeutic techniques and mechanisms via data-driven analyses of large psychotherapy corpora; potential to reverse-engineer modular EBPs. - Toward precision medicine in psychotherapy by integrating assessments, outcomes, and tailoring to comorbidities and demographics. • Design criteria for clinical LLMs: - Detect risk of harm (suicidality, homicide, abuse/violence) with robust memory/context, vector databases, and prompt clinicians with legal/ethical guidance; support safety planning. - Be “healthy” (avoid undesirable behaviors), with monitoring, auditing, and guardrails. - Aid psychodiagnostic assessment and diagnosis (e.g., support SCID-like interviews), enabling appropriate intervention selection and outcome monitoring. - Be responsive/flexible (motivational enhancement, shared decision-making, second-line options). - Stop when not helping or not confident; facilitate referrals/higher levels of care. - Be fair, inclusive, and bias-mitigated; monitor and address error/outcome disparities for minoritized groups. - Be empathetic to a prudent degree; balance human-likeness with risks. - Be transparent as AI and obtain informed consent. • Development and evaluation recommendations: - Focus first on evidence-based practices/common elements; avoid optimizing for engagement alone. - Commit to risk-first evaluation (harm minimization), followed by feasibility, acceptability, and effectiveness in head-to-head comparisons with standard treatments; adopt commonsense, risk-graded evaluation standards. - Involve robust interdisciplinary collaboration between clinicians and technologists in prompting, few-shot examples, fine-tuning datasets, and RLHF; prioritize high-quality data. - Focus on trust and usability for clinicians and patients; gather data on acceptability and trust to safely transition from assistive to collaborative applications. • Anticipated impacts: - Potential changes to workforce structure (greater supervision loads, risk of burnout, shifting roles to top-of-license practice). - Potential to scale training, fidelity assessment, and clinical trials (large-N, dismantling/mechanism studies) and to challenge core assumptions about therapeutic alliance and human necessity in psychotherapy.

Discussion

The proposed staged integration and design criteria directly address the central challenge of safely leveraging LLMs in psychotherapy by prioritizing risk detection, ethical safeguards, and clinical efficacy over engagement metrics. Assistive and collaborative stages enable incremental deployment with clinician oversight, reducing risk while delivering practical benefits (documentation, fidelity monitoring, training support, homework feedback). Emphasizing EBPs ensures clinical relevance and minimizes harm, while risk-first evaluation aligns with regulatory and ethical norms. Interdisciplinary development bridges gaps between technical capability and clinical nuance, promoting usable and trustworthy tools. The framework anticipates systemic impacts on workforce and care delivery, advocating for proactive research on supervision capacity, burnout mitigation, and acceptable autonomy levels. Long-term, clinical LLMs could catalyze precision psychotherapy and next-generation clinical science (e.g., large-scale trials and mechanism discovery), potentially revising assumptions about alliance and human factors in therapeutic change. Nonetheless, full autonomy remains uncertain due to complex case conceptualization, context sensitivity, and nonverbal/relational skills that are currently beyond LLMs.

Conclusion

LLMs could substantially augment and, in some instances, partially automate psychotherapy, improving accessibility, consistency, and scalability of care. This paper contributes a roadmap: a staged integration model, imminent and long-term application domains, concrete design criteria, and a risk-first evaluation strategy centered on EBPs, clinical outcomes, safety, transparency, trust, and bias mitigation. It calls for deep interdisciplinary collaboration and commonsense evaluation to ensure ethical, effective deployment. Future work should empirically test clinical LLMs against standard care, develop robust risk detection and diagnostic interviewing, build high-quality, consented datasets for fine-tuning, and investigate precision-tailoring and mechanism discovery. Research should also examine workforce implications, acceptable autonomy thresholds, and the nature and sufficiency of therapeutic alliance with AI systems.

Limitations

The paper is conceptual and does not present empirical evaluations or datasets. Many proposed applications—especially fully autonomous psychotherapy—are theoretical and untested. Safety, legal accountability, and ethical implications of autonomous systems remain unresolved. LLMs currently have limited memory/context, potential undesirable behaviors, and known biases; psychotherapy transcripts are underrepresented in training data. The generalizability of LLM-driven interventions to complex, comorbid cases and diverse populations is uncertain. Nonverbal and relational competencies (e.g., alliance rupture repair) are not yet within LLM capabilities. Implementation details (e.g., data governance, privacy, consent for training) and real-world clinical outcomes require rigorous study.

Related Publications

Explore these studies to deepen your understanding of the subject.

Medicine and Health

Leveraging Large Language Models for Precision Monitoring of Chemotherapy-Induced Toxicities: A Pilot Study with Expert Comparisons and Future Directions

O. R. Sarrias, M. P. M. D. Prado, et al.

Medicine and Health

A framework for human evaluation of large language models in healthcare derived from literature review

T. Y. C. Tam, S. Sivarajkumar, et al.

Education

How Does ChatGPT Perform on the United States Medical Licensing Examination? The Implications of Large Language Models for Medical Education and Knowledge Assessment

A. Gilson, C. W. Safranek, et al.

Medicine and Health

Development of prediction models for screening depression and anxiety using smartphone and wearable-based digital phenotyping: protocol for the Smartphone and Wearable Assessment for Real-Time Screening of Depression and Anxiety (SWARTS-DA) observational study in Korea

Y. Shin, A. Y. Kim, et al.

Listen, Learn & Level Up

Over 10,000 hours of research content in 25+ fields, available in 12+ languages.

No more digging through PDFs, just hit play and absorb the world's latest research in your language, on your time.

listen to research audio papers with researchbunny