Computer ScienceFrontiers in Psychiatry

Evaluation of large language models on mental health: from knowledge test to illness diagnosis

Y. Xu, Z. Fang, et al.

Large language models are reshaping mental health tools—this study systematically evaluates 15 state-of-the-art LLMs (including DeepSeek-R1/V3, GPT-4.1, Llama4, and Alibaba’s QwQ) on knowledge testing and diagnostic tasks in Chinese datasets. Results show DeepSeek-R1, QwQ, and GPT-4.1 lead in accuracy, offering guidance for safer model selection. This research was conducted by the authors listed in the <Authors> tag.... show more

General Summary Metrics

Abstract

Large language models (LLMs) have opened up new possibilities in the field of mental health, offering applications in areas such as mental health assessment, psychological counseling, and education. This study systematically evaluates 15 state-of-the-art LLMs, including DeepSeekR1/V3 (March 24, 2025), GPT-4.1 (April 15, 2025), Llama4 (April 5, 2025), and QwQ (March 6, 2025, developed by Alibaba), on two key tasks: mental health knowledge testing and mental illness diagnosis in the Chinese context. We use publicly available datasets, including Dreaddit, SDCNL, and questions from the CAS Counsellor Qualification Exam. Results indicate that DeepSeek-R1, QwQ, and GPT-4.1 outperform other models in both knowledge accuracy and diagnostic performance. Our findings highlight the strengths and limitations of current LLMs in Chinese mental health scenarios and provide clear guidance for selecting and improving models in this sensitive domain.

Publisher

Frontiers in Psychiatry

Published On

Aug 06, 2025

Authors

Yijun Xu, Zhaoxi Fang, Weinan Lin, Yue Jiang, Wen Jin, Prasanalakshmi Balaji, Jiangda Wang, Ting Xia

DOI

https://doi.org/10.3389/fpsyt.2025.1646974

Explore these studies to deepen your understanding

Adjacent work that informs or extends this paper's methodology and findings.

Medicine and Health

Evaluation of large language models on mental health: from knowledge test to illness diagnosis

Y. Xu, Z. Fang, et al.

Medicine and Health

A framework for human evaluation of large language models in healthcare derived from literature review

T. Y. C. Tam, S. Sivarajkumar, et al.

Education

How Does ChatGPT Perform on the United States Medical Licensing Examination? The Implications of Large Language Models for Medical Education and Knowledge Assessment

A. Gilson, C. W. Safranek, et al.

Psychology

Large language models could change the future of behavioral healthcare: a proposal for responsible development and evaluation

E. C. Stade, S. W. Stirman, et al.

Listen, Learn & Level Up

Over 10,000 hours of research content in 25+ fields, available in 22+ languages.

No more digging through PDFs, just hit play and absorb the world's latest research in your language, on your time.

listen to research audio papers with researchbunny