Computer Science
Evaluation of large language models on mental health: from knowledge test to illness diagnosis
Y. Xu, Z. Fang, et al.
Large language models are reshaping mental health tools—this study systematically evaluates 15 state-of-the-art LLMs (including DeepSeek-R1/V3, GPT-4.1, Llama4, and Alibaba’s QwQ) on knowledge testing and diagnostic tasks in Chinese datasets. Results show DeepSeek-R1, QwQ, and GPT-4.1 lead in accuracy, offering guidance for safer model selection. This research was conducted by the authors listed in the <Authors> tag.
~3 min • Beginner • English
Related Publications
Explore these studies to deepen your understanding of the subject.

