Introduction
Large Language Models (LLMs) like ChatGPT are increasingly used for seeking health advice, mirroring the prevalence of using Google for self-diagnosis. This study addresses the lack of comprehensive evaluations of LLMs for clinical decision-making tasks. The potential benefits of LLMs in medicine are considerable, ranging from assisting medical writing and literature searches to enhancing telemedicine and patient care. However, concerns about the spread of misinformation and 'hallucinations' necessitate rigorous evaluation, particularly in the high-stakes environment of medical diagnosis and treatment. This research investigates the clinical accuracy of GPT-3.5 and GPT-4, two commercially available LLMs, across three key clinical decision-making tasks: initial diagnosis, examination steps, and treatment options. The impact of disease frequency (rare, less frequent, frequent) on model performance is also examined. A sub-study compares the performance of two Llama 2 open-source LLMs. Finally, the performance of the LLMs is benchmarked against a naïve Google search for the diagnostic task. The study uses a carefully curated dataset of 110 medical cases, extracted from German clinical casebooks and translated into English, to minimize potential bias from prior training data.
Literature Review
Previous research has demonstrated the potential of ChatGPT in passing medical licensing exams (USMLE, ACLS) and assisting with various medical tasks, including medical writing, literature review, and improving patient care in different specialities. However, there's limited research on thorough evaluations encompassing multiple clinical decision support tasks, especially considering the influence of disease frequency. Existing studies demonstrate ChatGPT's capabilities in generating differential diagnoses for common complaints but lack the breadth of this study.
Methodology
The study used 110 medical cases from German clinical casebooks, translated into English and processed to a first-person perspective to mimic patient input. Cases were selected to represent a broad spectrum of diseases across multiple medical specialties and disease frequencies (rare, less frequent, frequent). The selection process aimed to minimize the risk of the LLMs being trained on this specific data. Three tasks were assessed:
1. **Initial Diagnosis:** Patient history and symptoms were provided to GPT-3.5, GPT-4, and Google search, asking for up to five likely diagnoses. Google search results were evaluated by identifying diagnoses from the top 10 hits.
2. **Examination Options:** Given patient history and symptoms, the LLMs were prompted to suggest up to five important examinations.
3. **Treatment Options:** Given a diagnosis and patient history, the LLMs suggested up to five appropriate therapies.
Two independent physicians evaluated the LLMs' and Google's responses for each task using a 5-point Likert scale. Inter-rater reliability was assessed using Cohen's kappa. Statistical comparisons (Mann-Whitney U test with Bonferroni correction) were used to analyze the performance differences between models and the impact of disease frequency. A sub-study explored the performance of Llama 2 (7B and 70B parameter models) on a subset of cases (n=18).
Key Findings
GPT-4 consistently outperformed GPT-3.5 and Google search in all three tasks (diagnosis, examination, treatment). The differences were statistically significant in many comparisons. GPT-4 demonstrated a median score of 4.5 for diagnosis, 4.5 for examination and 4.5 for treatment, significantly higher than GPT-3.5 and Google. All models performed worse on rare diseases, which is expected due to their underrepresentation in training data. The Llama 2 models showed lower performance compared to GPT-3.5 and GPT-4. Inter-rater reliability was substantial to almost perfect for all tasks (κ = 0.76-0.8). However, even GPT-4 only achieved a perfect score (5/5) on a limited number of cases for each task (18 for diagnosis, 24 for examination, and 26 for treatment). The post-hoc analysis suggests that prompt engineering (re-phrasing questions) can improve the performance of the models, especially in cases where rare conditions were involved. The study highlights an uneven performance between frequent and rare diseases, which was expected but still crucial to consider.
Discussion
The findings demonstrate the promising potential of LLMs, particularly GPT-4, for clinical decision support. GPT-4's superior performance over GPT-3.5 and Google search underscores the rapid advancements in LLM technology. However, the study also highlights significant limitations, especially in the diagnosis of rare diseases. The observed inconsistencies, even in GPT-4's performance, emphasize the need for continued improvement and rigorous testing before clinical deployment. The study's findings, particularly the superior performance on more common diseases, indicates LLMs are best used to supplement—not replace—human clinical expertise. The use of open-source LLMs like Llama 2 presents advantages in data privacy and training transparency, although their current performance lags behind commercial models. Future work should focus on developing task-specific medical LLMs trained on high-quality, up-to-date medical data, as well as exploring advanced prompt engineering techniques to improve accuracy.
Conclusion
This study provides a comprehensive evaluation of LLMs for clinical decision support, demonstrating the superior performance of GPT-4 while highlighting limitations, particularly with rare diseases. The results suggest promising potential for LLMs as clinical decision support tools but emphasize the need for further development, stringent regulations, and human oversight to ensure accuracy and safety before widespread clinical adoption. The exploration of open-source models reveals their potential to address concerns about data privacy and training transparency.
Limitations
The study's reliance on translated case reports might limit generalizability. The 5-point Likert scale, while facilitating standardized assessment, is inherently subjective. The exclusion of cases relying heavily on lab tests and imaging limits the scope of the evaluation. The assessment represents a performance snapshot in time, as the LLMs are constantly being updated, and prompt engineering can significantly affect the model's output. The study did not evaluate several crucial clinical dimensions of accuracy (quality of life, surrogate parameters, etc.).
Related Publications
Explore these studies to deepen your understanding of the subject.