Computer ScienceFrontiers in Psychiatry
Evaluation of large language models on mental health: from knowledge test to illness diagnosis
Y. Xu, Z. Fang, et al.
Large language models are reshaping mental health tools—this study systematically evaluates 15 state-of-the-art LLMs (including DeepSeek-R1/V3, GPT-4.1, Llama4, and Alibaba’s QwQ) on knowledge testing and diagnostic tasks in Chinese datasets. Results show DeepSeek-R1, QwQ, and GPT-4.1 lead in accuracy, offering guidance for safer model selection. This research was conducted by the authors listed in the <Authors> tag.
Related Publications
Explore these studies to deepen your understanding
Adjacent work that informs or extends this paper's methodology and findings.
Medicine and Health
Evaluation of large language models on mental health: from knowledge test to illness diagnosis
Y. Xu, Z. Fang, et al.
Medicine and Health
A framework for human evaluation of large language models in healthcare derived from literature review
T. Y. C. Tam, S. Sivarajkumar, et al.
Education
How Does ChatGPT Perform on the United States Medical Licensing Examination? The Implications of Large Language Models for Medical Education and Knowledge Assessment
A. Gilson, C. W. Safranek, et al.
Psychology
Large language models could change the future of behavioral healthcare: a proposal for responsible development and evaluation
E. C. Stade, S. W. Stirman, et al.

