Education

How Does ChatGPT Perform on the United States Medical Licensing Examination? The Implications of Large Language Models for Medical Education and Knowledge Assessment

A. Gilson, C. W. Safranek, et al.

Discover how ChatGPT, a cutting-edge natural language processing model, tackled the challenges of the USMLE Step 1 and Step 2 exams with surprising accuracy. This groundbreaking research, conducted by Aidan Gilson, Conrad W Safranek, Thomas Huang, Vimig Socrates, Ling Chi, Richard Andrew Taylor, and David Chartash, highlights ChatGPT's potential as a valuable tool in medical education, especially for simulating small group learning.

00:00

~3 min • Beginner • English

Index

Introduction

Chat Generative Pre-trained Transformer (ChatGPT) is a large autoregressive language model trained to generate human-like, dialogic responses to user prompts. Built upon OpenAI’s GPT-3.5 family and fine-tuned with supervised and reinforcement learning from human feedback, ChatGPT is optimized for conversational use. In medicine, large language models (LLMs) have been explored for patient interaction and consumer health education, but prior systems have struggled on clinical knowledge assessments and general question-answering tasks. For ChatGPT to be useful as an educational tool—such as a simulated tutor, brainstorming partner, or peer for small-group–style learning—it should perform at least comparably to human learners on assessments of medical knowledge and reasoning. This study aims to quantify ChatGPT’s performance on USMLE Step 1 and Step 2-style multiple-choice questions and to compare it with GPT-3 and InstructGPT. It also assesses the interpretability of ChatGPT’s responses by examining logical reasoning and the use of information internal and external to the question.

Literature Review

Methodology

Design and data sets: Two pairs of data sets focusing on USMLE-style content were created. (1) AMBOSS question bank subsets: 100 Step 1 and 100 Step 2 questions (AMBOSS-Step1 and AMBOSS-Step2). AMBOSS provides question difficulty ratings (1–5) and an optional Attending Tip; each question was also posed in a second instance including the Attending Tip to test the effect of added context. (2) NBME Free 120 items: Publicly available Step 1 and Step 2 questions (NBME-Free-Step1 and NBME-Free-Step2). Questions containing images or requiring table-formatted answers were excluded to ensure purely textual inputs; after filtering, the NBME sets comprised 87 Step 1 items and 102 Step 2 items. Prompt standardization: All items were reformatted to present the stem followed by the direct question, then the multiple-choice options, each separated by new lines. For GPT-3, additional prompt engineering was applied to force explicit answer selection: “Please answer this multiple choice question:” + question + “Correct answer is.” The AMBOSS Attending Tip was included as a separate question instance for ChatGPT to test the effect of additional context. Models and testing procedure: ChatGPT (web interface; version as of December 15, 2022) was queried manually with each standardized question (and, for AMBOSS, also with the Attending Tip). GPT-3 (davinci) and InstructGPT (text-davinci-003) were queried via the OpenAI API using the same standardized prompts. Outputs were copied verbatim and manually adjudicated to map the model’s response to one of the provided options. Qualitative evaluation: ChatGPT’s free-text rationales were coded along three binary metrics of narrative coherence: (1) logical reasoning (clear logic in selecting among options), (2) internal information (uses information from the question stem), and (3) external information (adds knowledge beyond the stem, e.g., qualifying answers or contextual facts). For incorrect answers, error types were labeled as logical error, information error, statistical error, or combinations thereof. Disagreements were reconciled collaboratively among annotators. Statistical analysis: Analyses were conducted in Python 3.10.2. Unpaired chi-square tests assessed (a) the relationship between AMBOSS difficulty level and ChatGPT accuracy (with and without Attending Tip) and (b) differences in presence of logical reasoning, internal information, and external information between correct and incorrect responses on the NBME sets.

Key Findings

- Accuracy across data sets: ChatGPT achieved 64.4% (56/87) on NBME-Free-Step1, 57.8% (59/102) on NBME-Free-Step2, 44% (44/100) on AMBOSS-Step1, and 42% (42/100) on AMBOSS-Step2. It performed better on Step 1 than Step 2 within both NBME (64.4% vs 57.8%) and AMBOSS (44% vs 42%) sets, and better on NBME than AMBOSS for both steps. - Comparison with other LLMs: ChatGPT outperformed InstructGPT by an average of 8.15% across all data sets. GPT-3 performed near random chance on all sets. - Effect of question difficulty (AMBOSS): On Step 1 without Attending Tip, accuracy decreased significantly as difficulty increased (P=.01), from 64% on difficulty 1 to 0% on difficulty 5. Trends were generally monotonically decreasing across other conditions as well. - Percentile performance relative to AMBOSS users: Step 1 without Attending Tip ~30th percentile; with Attending Tip ~66th percentile. Step 2 without Attending Tip ~20th percentile; with Attending Tip ~48th percentile. - Qualitative coherence: In NBME data sets, 100% of ChatGPT responses contained a logical explanation of answer choice. Internal information from the stem was used in 96.8% (183/189) of responses. External information was present far more often in correct than incorrect answers: Step 1, 92.9% vs 48.4% (difference 44.5%; P<.001); Step 2, 89.8% vs 62.8% (difference 27%; P=.001). - Error typology (incorrect NBME responses): Logical errors were most common, followed by information errors; statistical errors were rare.

Discussion

The study’s central question was whether ChatGPT can perform at a level comparable to medical students on standardized medical knowledge assessments and whether its explanations are interpretable enough to support learning. Results indicate that ChatGPT reaches or surpasses a common passing threshold (~60%) on NBME-style Step 1 questions and approaches that threshold on Step 2, suggesting performance similar to that of a third-year medical student. The strong presence of logical reasoning in all responses and frequent use of internal stem information reflect narrative coherence and potential educational value. The markedly higher frequency of external information in correct versus incorrect responses suggests that successful answering may be associated with effective integration of background knowledge beyond the question stem. Compared with prior domain-specific QA systems that often sacrifice generalizability for performance, ChatGPT demonstrates improved accuracy alongside broad applicability to text-based problems and uniquely supports dialogic, interpretable feedback. These features position ChatGPT as a promising adjunct for self-directed learning and small-group simulation, enabling follow-up questioning, contextualization, and reflective practice.

Conclusion

ChatGPT demonstrates performance on USMLE-style questions consistent with expectations for a third-year medical student, while providing coherent, interpretable rationales. Its dialogic interface and tendency to incorporate both internal and external information suggest utility as an adjunct to small-group learning, problem-based learning, and reflective study. Future work should evaluate the effectiveness of ChatGPT as a simulated tutor in controlled educational settings, characterize how iterative prompting and Attending Tips influence learning outcomes, and monitor performance changes as models evolve. Further research is also warranted to assess reliability across specialties, question formats (including image-based items), and to develop best practices to mitigate reasoning and information errors.

Limitations

- Training data cutoff: ChatGPT’s pretraining data largely predates 2021, limiting access to more recent knowledge. - Model opacity and tuning: The closed nature of the system and lack of fine-tuning access constrained exploration of stochasticity and task-specific optimization. - Model versioning: ChatGPT is updated over time; results reflect the December 15, 2022 version and may not generalize to later versions. - Scope constraints: Image- and table-based questions were excluded, so results apply to purely textual items. - Generalizability of educational impact: While interpretability was assessed, the actual efficacy as an educational tool (e.g., for small-group simulation) was not directly measured.

Related Publications

Explore these studies to deepen your understanding of the subject.

Medicine and Health

Development of prediction models for screening depression and anxiety using smartphone and wearable-based digital phenotyping: protocol for the Smartphone and Wearable Assessment for Real-Time Screening of Depression and Anxiety (SWARTS-DA) observational study in Korea

Y. Shin, A. Y. Kim, et al.

Computer Science

The Potential and Limitations of Large Language Models for Text Classification through Synthetic Data Generation

A. K. P. Venkata and L. Gudala

Interdisciplinary Studies

Does large-scale research infrastructure affect regional knowledge innovation, and how? A case study of the National Supercomputing Center in China

H. Yang, L. Liu, et al.

Psychology

Large language models could change the future of behavioral healthcare: a proposal for responsible development and evaluation

E. C. Stade, S. W. Stirman, et al.

Listen, Learn & Level Up

Over 10,000 hours of research content in 25+ fields, available in 12+ languages.

No more digging through PDFs, just hit play and absorb the world's latest research in your language, on your time.

listen to research audio papers with researchbunny