ChatGPT, a 175-billion parameter natural language processing model, was evaluated on its performance on questions from the United States Medical Licensing Examination (USMLE) Step 1 and Step 2 exams. ChatGPT achieved accuracies of 44% (44/100), 42% (42/100), 64.4% (56/87), and 57.8% (59/102) on AMBOSS-Step1, AMBOSS-Step2, NBME-Free-Step1, and NBME-Free-Step2 data sets, respectively. The model's performance decreased with increasing question difficulty. ChatGPT provided logical justification for its answer selection in all cases and included information internal to the question in 96.8% of responses. The presence of information external to the question was significantly higher for correct answers compared to incorrect ones. These findings suggest ChatGPT's potential for medical education, particularly in simulating small group learning.
Publisher
JMIR Medical Education
Published On
Feb 08, 2023
Authors
Aidan Gilson, Conrad W Safranek, Thomas Huang, Vimig Socrates, Ling Chi, Richard Andrew Taylor, David Chartash
Tags
ChatGPT
USMLE
medical education
natural language processing
model performance
question difficulty
learning simulation
Related Publications
Explore these studies to deepen your understanding of the subject.