logo
ResearchBunny Logo
Evaluation of large language models on mental health: from knowledge test to illness diagnosis

Computer Science

Evaluation of large language models on mental health: from knowledge test to illness diagnosis

Y. Xu, Z. Fang, et al.

Large language models are reshaping mental health tools—this study systematically evaluates 15 state-of-the-art LLMs (including DeepSeek-R1/V3, GPT-4.1, Llama4, and Alibaba’s QwQ) on knowledge testing and diagnostic tasks in Chinese datasets. Results show DeepSeek-R1, QwQ, and GPT-4.1 lead in accuracy, offering guidance for safer model selection. This research was conducted by the authors listed in the <Authors> tag.

00:00
00:00
~3 min • Beginner • English
Introduction
Large language models (LLMs) have rapidly advanced and show strong potential for applications in mental health assessment, counseling, and education. Beyond general-purpose models like GPT-3/4, domain-specific mental health models (e.g., ZhiXin, SoulChat) have emerged. However, the professional competence and risks of LLMs in mental health remain uncertain, and prior evaluations often used outdated models, focused on single aspects, or relied on subjective, manual assessments. This study aims to evaluate state-of-the-art LLMs in Chinese mental health scenarios across two tasks: mental health knowledge assessment and mental illness diagnostic support. The research questions are: (1) How do the latest LLMs compare in knowledge and diagnostic capabilities? (2) What factors influence performance beyond parameter size? The authors hypothesize that while larger models may generally perform better, architectural innovations and fine-tuning strategies also play significant roles. The study fills a gap by comprehensively evaluating recent models (e.g., DeepSeek-R1/V3, GPT-4.1, Llama4, QwQ) using curated datasets and automated, standardized prompts.
Literature Review
The paper surveys emerging work evaluating LLMs for mental health applications. Early studies assessed GPT-3.5-Turbo for zero-shot binary classification of stress, depression, and suicidality in social media text, later expanding to affective computing tasks like personality trait prediction, sentiment analysis, and suicide risk detection. Other work tested LLM responsiveness to psychological prompts but was limited to few models (e.g., GPT-4, ChatGPT). Many prior evaluations used older models (GPT-3.5, Llama 2) and often examined a single aspect (e.g., conversational ability) or used manual, subjective evaluations. Domain-specific models such as ZhiXin and SoulChat were developed with psychiatric datasets and empathy-focused fine-tuning, demonstrating improved diagnosis or empathetic responses. Recent analyses highlight both opportunities and risks of LLMs in digital mental health, including hallucinations, bias, and privacy concerns. This study responds to these gaps by evaluating newer, high-performing models on multiple tasks with automated protocols.
Methodology
Assessment tasks: The study evaluates LLMs on two tasks: (1) Mental health knowledge test using Chinese Academy of Sciences (CAS) Counsellor Qualification Exam questions (single- and multiple-choice) to assess mastery of personality, social, developmental psychology, mental disorders, counseling theory, psychological assessment, basic skills, and counseling methods; (2) Mental illness diagnostic test using Chinese translations of social media datasets Dreaddit and SDCNL, where models analyze posts and provide possible diagnoses and recommendations. Performance is compared to standard labels, computing indicators such as accuracy (and recall in principle), focusing on effectiveness for clinical support. Datasets: (a) CAS Counselor Qualification Exam Dataset (2023–2024): 744 single-choice and 200 multiple-choice questions covering theory (introduction to psychology, personality and social psychology, developmental psychology, mental health and disorders, introduction to counseling) and operations (counseling theory, psychological assessment, basic counseling skills, counseling methods). (b) Dreaddit: Reddit-based stress analysis dataset; from the original 190,000-post corpus with AMT-provided labels, a representative subset of 1,151 posts was used, including 64 labeled “stress,” 503 “anxiety,” and 584 “PTSD.” (c) SDCNL: 1,517 Reddit posts from r/SuicideWatch and r/Depression, with 787 labeled “suicide” and 729 “depression,” balanced during evaluation. Evaluated LLMs: Fifteen models spanning families and sizes were tested via APIs: DeepSeek-R1 (671B), DeepSeek-V3 Pro (671B), DeepSeek-R1-1.5B (1.5B), DeepSeek-R1-32B (32B), GPT-4o (~200B), GPT-4.1 (N/A), GLM-4 9B, GLM-4 32B, GLM-Z1 32B, Llama 3.3-70B (70B), Llama4-scout (17B), Gemma-2-27B (27B), Gemma-3-27B (27B), QwQ-32B (32B), Qwen2.5-72B (72B). Release dates and API providers are listed in Table 1 of the paper. Prompting and evaluation setup: All models were accessed via their official APIs with default settings (no temperature or system prompt adjustments). For each prompt, only the first model response was used. Dataset items from .xls/.csv were programmatically read and formatted into task-specific prompts using Python. The process automated evaluation across datasets to ensure consistent inputs. Diagnostic task performance was compared against dataset labels, reporting accuracy (and noting recall as an intended metric). Analysis: Knowledge test accuracy was computed separately for single-choice and multiple-choice items and summarized into an overall accuracy (Figure 1). Diagnostic accuracy was computed for two subtasks: suicidal tendency detection (SDCNL) and depression diagnosis (Dreaddit). Failure cases were qualitatively analyzed to identify common error sources.
Key Findings
Mental health knowledge test: For single-choice questions (744 items), top accuracies were DeepSeek-R1 671B at 86.83% and DeepSeek-V3 Pro at 84.68%, with QwQ-32B at 84.27%, GLM-Z1 9B at 78.90%, and GPT-4o at 72.72%. Multiple-choice questions (200 items) were harder: QwQ-32B led at 64.00%, followed by DeepSeek-R1 32B at 57.00%, Qwen2.5-72B and GLM-Z1 9B at 52.50%, GPT-4.1 at 50.00%, GPT-4o at 44.50%, and others. DeepSeek-R1 671B and DeepSeek-V3 Pro had low multiple-choice accuracies (28.50% and 21.00%), indicating challenges with multi-select reasoning. Overall knowledge test accuracy (Figure 1) highlighted QwQ-32B (79.98%) and DeepSeek-R1 671B (74.47%) as top performers, with Qwen2.5-72B (75.32%) and DeepSeek-V3 Pro (71.19%) also strong. Mental illness diagnosis: On suicidal tendency diagnosis (SDCNL), GPT-4.1 achieved the highest accuracy at 69.53%, followed by DeepSeek-R1 671B at 67.15% and Qwen2.5-72B at 65.70%. On depression diagnosis (Dreaddit), Llama4-scout led with 76.98%, followed by Gemma2-27B at 72.02% and DeepSeek-V3 Pro at 69.59%. Several medium-scale models (e.g., Gemma2-27B) outperformed some larger models, suggesting architecture and training optimizations can outweigh parameter count. Error analysis: Misclassifications often resulted from symptom overlap (e.g., anxiety vs. PTSD), ambiguous/metaphorical language, lack of clinical context, and a tendency to default to high-frequency categories (e.g., “anxiety”) when uncertain. Single-turn, text-only inputs limit nuanced assessment.
Discussion
Findings show substantial variability across models and tasks. In knowledge assessments, model size generally correlates with performance, yet multiple-choice results reveal that scale is not sufficient for complex, multi-select reasoning; QwQ-32B excelled overall and in multiple-choice, while large models like DeepSeek-R1 671B underperformed on multiple-choice. In diagnostic tasks, GPT-4.1 led in suicidality detection and Llama4-scout in depression detection, with medium-scale models (e.g., Gemma2-27B) sometimes surpassing larger models. These outcomes address the research questions by demonstrating that architecture, fine-tuning, and training data quality materially affect performance beyond parameter count. The work underscores the promise of modern LLMs for mental health education and triage support while emphasizing current limitations in handling nuanced, ambiguous cases and the need for task-specific optimization. Ethical considerations are critical, including risks of hallucination, bias, and privacy concerns; expert oversight and strong data governance are required for safe deployment.
Conclusion
The study benchmarks 15 contemporary LLMs on Chinese mental health knowledge testing and mental illness diagnosis. Models such as QwQ-32B, DeepSeek-R1 variants, and GPT-4.1 perform strongly in specific areas, but limitations persist—particularly in multiple-choice reasoning and in diagnosing nuanced cases from brief social media posts. Model size correlates with performance but is not determinative; architecture and fine-tuning strategies can yield superior outcomes for mid-scale models. Future research should incorporate clinical data (EHRs, clinical notes), domain-specific fine-tuning, and expert validation to improve reliability, safety, and ethical robustness for real-world applications.
Limitations
Key limitations include: (1) reliance on social media (e.g., Reddit) text that may not reflect clinical presentations, potentially limiting generalizability; (2) single-turn, text-only inputs without clinical context, contributing to ambiguity and symptom overlap errors; (3) inconsistent performance on multiple-choice items, indicating challenges with multi-select reasoning; and (4) ethical and safety concerns inherent to LLMs—hallucinations, training-data biases, and privacy risks—necessitating expert oversight and robust governance. The authors plan to integrate clinical data and expert validation in future work.
Listen, Learn & Level Up
Over 10,000 hours of research content in 25+ fields, available in 12+ languages.
No more digging through PDFs, just hit play and absorb the world's latest research in your language, on your time.
listen to research audio papers with researchbunny