Education

How Does ChatGPT Perform on the United States Medical Licensing Examination? The Implications of Large Language Models for Medical Education and Knowledge Assessment

A. Gilson, C. W. Safranek, et al.

Discover how ChatGPT, a cutting-edge natural language processing model, tackled the challenges of the USMLE Step 1 and Step 2 exams with surprising accuracy. This groundbreaking research, conducted by Aidan Gilson, Conrad W Safranek, Thomas Huang, Vimig Socrates, Ling Chi, Richard Andrew Taylor, and David Chartash, highlights ChatGPT's potential as a valuable tool in medical education, especially for simulating small group learning.

00:00

~3 min • Beginner • English

Index

Abstract

Background: Chat Generative Pre-trained Transformer (ChatGPT) is a 175-billion-parameter natural language processing model that can generate conversation-style responses to user input. Objective: This study aimed to evaluate the performance of ChatGPT on questions within the scope of the United States Medical Licensing Examination Step 1 and Step 2 exams, as well as to analyze responses for user interpretability. Methods: We used 2 sets of multiple-choice questions to evaluate ChatGPT's performance, each with questions pertaining to Step 1 and Step 2. The first set was derived from AMBOSS, a commonly used question bank for medical students, which also provides statistics on question difficulty and the performance on an exam relative to the user base. The second set was the National Board of Medical Examiners (NBME) free 120 questions. ChatGPT's performance was compared to 2 other large language models, GPT-3 and InstructGPT. The text output of each ChatGPT response was evaluated across 3 qualitative metrics: logical justification of the answer selected, presence of information internal to the question, and presence of information external to the question. Results: Of the 4 data sets, AMBOSS-Step1, AMBOSS-Step2, NBME-Free-Step1, and NBME-Free-Step2, ChatGPT achieved accuracies of 44% (44/100), 42% (42/100), 64.4% (56/87), and 57.8% (59/102), respectively. ChatGPT outperformed InstructGPT by 8.15% on average across all data sets, and GPT-3 performed similarly to random chance. The model demonstrated a significant decrease in performance as question difficulty increased (P=.01) within the AMBOSS-Step1 data set. We found that logical justification for ChatGPT's answer selection was present in 100% of outputs of the NBME data sets. Internal information to the question was present in 96.8% (183/189) of all questions. The presence of information external to the question was 44.5% and 27% lower for incorrect answers relative to correct answers on the NBME-Free-Step1 (P<.001) and NBME-Free-Step2 (P=.001) data sets, respectively.

Publisher

JMIR Medical Education

Published On

Feb 08, 2023

Authors

Aidan Gilson, Conrad W Safranek, Thomas Huang, Vimig Socrates, Ling Chi, Richard Andrew Taylor, David Chartash

DOI

https://doi.org/https://doi.org/10.2196/45312

Related Publications

Explore these studies to deepen your understanding of the subject.

Medicine and Health

Development of prediction models for screening depression and anxiety using smartphone and wearable-based digital phenotyping: protocol for the Smartphone and Wearable Assessment for Real-Time Screening of Depression and Anxiety (SWARTS-DA) observational study in Korea

Y. Shin, A. Y. Kim, et al.

Interdisciplinary Studies

Does large-scale research infrastructure affect regional knowledge innovation, and how? A case study of the National Supercomputing Center in China

H. Yang, L. Liu, et al.

Psychology

Large language models could change the future of behavioral healthcare: a proposal for responsible development and evaluation

E. C. Stade, S. W. Stirman, et al.

Computer Science

The Potential and Limitations of Large Language Models for Text Classification through Synthetic Data Generation

A. K. P. Venkata and L. Gudala

Listen, Learn & Level Up

Over 10,000 hours of research content in 25+ fields, available in 12+ languages.

No more digging through PDFs, just hit play and absorb the world's latest research in your language, on your time.

listen to research audio papers with researchbunny