logo
ResearchBunny Logo
Systematic analysis of ChatGPT, Google search and Llama 2 for clinical decision support tasks

Medicine and Health

Systematic analysis of ChatGPT, Google search and Llama 2 for clinical decision support tasks

S. Sandmann, S. Riepenhausen, et al.

This groundbreaking study by Sarah Sandmann, Sarah Riepenhausen, Lucas Plagwitz, and Julian Varghese explores the capabilities of advanced AI models like GPT-3.5, GPT-4, and Llama 2 in clinical decision support. Discover how GPT-4 outperformed its peers, suggesting effective diagnoses and treatments, while pointing to the crucial need for regulated AI in healthcare!

00:00
00:00
~3 min • Beginner • English
Introduction
The growing public adoption of large language models (LLMs) such as ChatGPT (GPT-3.5 and GPT-4) has raised concerns and opportunities for their use in medicine. While LLMs excel at text-based tasks, their training on heterogeneous internet corpora introduces risks of misinformation and hallucinations, which is critical in clinical contexts where patients and clinicians may consult such tools similarly to using search engines. Prior works have shown that ChatGPT can pass medical exams (e.g., USMLE, ACLS) and assist in medical writing, EHR information extraction, literature support, and specialty workflows (e.g., dental telemedicine, radiology). Diagnostic capability for common complaints has also been demonstrated. However, to the authors’ knowledge, there has been no thorough evaluation across key clinical decision-support tasks. This study investigates the clinical accuracy of GPT-3.5 and GPT-4 across three core tasks—initial diagnosis, recommended examinations, and treatment—while examining the impact of disease frequency (rare, less frequent, frequent). It also benchmarks diagnosis against a naïve Google search and explores open-source Llama 2 models in a sub-study.
Literature Review
Background studies indicate LLMs’ potential in healthcare: ChatGPT has reportedly passed the USMLE and ACLS exams and shown utility in medical writing (e.g., extracting EHR information, assisting literature searches, improving style/formatting). Specialty discussions note potential in dental telemedicine and patient-centered radiology. GPT-3.5 can generate differential diagnoses for common chief complaints. At the same time, literature highlights hallucination risks and misinformation concerns in LLM outputs. Despite these advances, there had been no comprehensive evaluation spanning diagnosis, examination, and treatment recommendations across varied disease frequencies, motivating the present work.
Methodology
Design and data sources: Clinical case reports were extracted from two German casebook publishers (Elsevier, Thieme) covering a broad spectrum of specialties and disease frequencies. Cases were translated to English using DeepL and manually reviewed; language and perspective were adjusted to mimic patient-generated input. To minimize training data leakage, sources were non-English and restricted-access. Sampling and selection: From 1020 cases, a 40% random sample (n=408) ensured even source distribution for subgroup power. Applying inclusion criteria—(1) sufficient patient/third-party history; (2) diagnosis not overly reliant on laboratory values; (3) exclusion of imaging-dependent cases; (4) no duplicates—yielded 153 eligible cases. Incidence/prevalence were researched to stratify diseases into rare, less frequent, and frequent categories. A balanced final set of 110 cases was selected across publishers, specialties, and frequency strata (n=110: rare=34, less frequent=39, frequent=37). Tasks and prompts: Three tasks were assessed per case: (1) initial diagnosis, (2) examination recommendations, and (3) treatment recommendations. For ChatGPT, separate conversations were used for diagnosis and treatment prompts. Prompting plan: Diagnosis—present history/symptoms and ask “What are most likely diagnoses? Name up to five.” Examination—ask “What are the most important examinations that should be considered in my case? Name up to five.” Treatment—present history/symptoms and add “My doctor has diagnosed me with (specific diagnosis X). What are the most appropriate therapies in my case? Name up to five.” Interactions were conducted via chat.openai.com between April 3 and May 19, 2023, using GPT-3.5 and GPT-4. Google benchmarking: A naïve Google search was performed in incognito mode at google.com. From each case’s history, symptom keywords (with optional modifiers like “baby” or “child”) were used to retrieve the first 10 websites. Diagnoses were inferred by a non-medical expert from those pages, considering alignment with the case history and basic demographics; up to five likely diagnoses were recorded. Open-source sub-study: Llama 2 chat models (Llama2-7b-chat and Llama2-70b-chat) were explored descriptively on a subset (top-3 and bottom-3 cases per frequency strata based on GPT-4 cumulative scores; total n=18). Example generation settings included temperature 0.6, top-p 0.9, and max sequence length 4096. Evaluation: Two independent physicians rated outputs against casebook solutions. A 5-point Likert scale (1–5) assessed clinical accuracy for each task; final scores were the mean of the two raters. Inter-rater reliability used weighted Cohen’s kappa with 95% CIs. Statistics: Analyses were performed in R 4.3.1. Paired one-sided Mann–Whitney tests compared GPT-3.5 vs GPT-4 (examination and treatment) and GPT-3.5 vs GPT-4 vs Google (diagnosis), with Bonferroni correction for multiple comparisons (diagnosis n=12; examination/treatment n=7). Disease frequency effects (rare vs less frequent vs frequent) were assessed using one-sided Mann–Whitney tests with Bonferroni adjustment. The Llama 2 sub-study was descriptive.
Key Findings
Inter-rater reliability: Substantial to almost perfect agreement. Diagnosis: κ=0.80 (GPT-3.5), κ=0.76 (GPT-4), κ=0.80 (Google). Examination: κ=0.53 (GPT-3.5), κ=0.64 (GPT-4). Treatment: κ=0.67 (GPT-3.5), κ=0.73 (GPT-4). Diagnosis performance (n=110): GPT-4 outperformed both GPT-3.5 and Google. Medians (IQR): GPT-4 4.5 [3.8–5.7]; GPT-3.5 4.0 [2.7–5.4]; Google 4.0 [2.75–4.75]. Pairwise tests: GPT-4 vs GPT-3.5 p=0.003; GPT-4 vs Google p=0.006; GPT-3.5 vs Google p=0.621 (ns). All approaches performed better on frequent vs rare diseases. Frequency effects: GPT-3.5 frequent>rare p<0.001; GPT-4 frequent>rare p=0.0003 and less frequent>rare p=0.0067; Google showed no significant differences between rare and less frequent. Examination performance: GPT-4 median 4.5 [4.0–4.75] vs GPT-3.5 median 4.25 [3.75–4.75]; GPT-4 superior (p<0.0001). Frequency effects: GPT-4 significantly better for frequent/less frequent vs rare (overall p=0.0203 for better than rare); GPT-3.5 trends favored frequent diseases but were not significant. Treatment performance: GPT-4 median 4.5 [4.0–4.75] vs GPT-3.5 median 4.25 [4.0–4.69]; difference not significant (p=0.503). No clear influence of disease frequency on treatment accuracy. Ceiling attainment: Full scores of 5/5 were achieved in 18 (diagnosis), 24 (examination), and 26 (treatment) of 110 cases. Open-source sub-study: Llama2-7b-chat and Llama2-70b-chat showed slightly lower and more variable performance than GPT models on the evaluated subset.
Discussion
The study demonstrates a clear performance progression from GPT-3.5 to GPT-4 across clinical decision-support tasks, with GPT-4 surpassing both GPT-3.5 and a naïve Google search for diagnosis. All systems performed worse on rare diseases, consistent with their under-representation in training corpora. GPT-4 also showed advantages in recommending examinations and comparable treatment recommendations to GPT-3.5. The sub-study suggests current open-source Llama 2 variants lag slightly behind commercial counterparts. These findings address the central question of whether LLMs can provide clinically accurate suggestions across diagnosis, examination, and treatment. They indicate that modern LLMs, particularly GPT-4, can deliver useful guidance, yet remain imperfect, especially for rare conditions. Case examples illustrate potential pitfalls (e.g., pheochromocytoma treatment sequencing, missed dermatomyositis diagnosis), and show how prompt refinement or traditional search may sometimes yield better results. Given evolving back-end models, the results should be viewed as time-specific snapshots. The implications are that while LLMs can support clinical reasoning, they require cautious use, oversight, and further optimization, including integration with curated medical knowledge sources and adherence to regulatory standards.
Conclusion
Commercial LLMs, particularly GPT-4, show promising capability in clinical decision-support tasks and improve upon GPT-3.5, outperforming a naïve Google search for diagnosis and achieving strong examination recommendations. However, accuracy is inconsistent—especially for rare diseases—and full correctness is infrequent, so these models should not be used for standalone medical consultation. Open-source LLMs like Llama 2 currently perform slightly worse but remain attractive for privacy and transparency needs. Future work should integrate up-to-date, quality-assured medical knowledge bases; expand evaluation to include lab/imaging data and multidimensional clinical outcomes; and address regulatory, transparency, and safety considerations for medical deployment.
Limitations
- Subjective evaluation using a 5-point Likert scale; other clinical outcome dimensions (e.g., quality of life, safety, life expectancy) were not separately assessed. - Cases were translated and adapted by non-medical experts to simulate patient input; prompts were not optimized for maximum performance, which may influence outcomes. - Selection excluded cases heavily dependent on imaging or laboratory data to focus on initial diagnosis, limiting generalizability to fully worked-up cases. - Model outputs and performance may vary over time due to continual back-end updates; findings represent a snapshot (April–May 2023 interactions). - The Google benchmark used a naïve search strategy limited to the first 10 results and basic symptom strings, which may underrepresent expert search behaviors. - The open-source Llama 2 evaluation was exploratory on a small subset, limiting conclusions about open-source models’ comparative performance.
Listen, Learn & Level Up
Over 10,000 hours of research content in 25+ fields, available in 12+ languages.
No more digging through PDFs, just hit play and absorb the world's latest research in your language, on your time.
listen to research audio papers with researchbunny