ChatGPT, a large language model (LLM) developed by OpenAI, has garnered attention for its ability to generate human-like responses to prompts. While LLMs have been explored in the medical domain for patient interaction and education, their success in assessing clinical knowledge through question-answering tasks has been limited. ChatGPT, with its conversational interface, could potentially bridge the gap between clinical knowledge and dialogic interaction. This study aims to assess ChatGPT's performance on the USMLE Step 1 and Step 2 exams, which assess medical knowledge and reasoning. The researchers hypothesize that ChatGPT could perform at a level comparable to human learners, making it a valuable tool for medical education and knowledge assessment.
Literature Review
Large language models (LLMs), particularly those trained on vast amounts of text data, have shown promise in various applications, including question answering and dialogue generation. In the medical domain, LLMs have been explored for personalized patient interaction, consumer health education, and medical question answering. However, previous research has indicated that LLMs' performance in answering clinical knowledge questions is not yet on par with human experts. ChatGPT, with its ability to generate coherent and contextually relevant responses, represents a potential advancement in this area.
Methodology
The study used two pairs of data sets to assess ChatGPT's performance on questions related to the USMLE Step 1 and Step 2 exams. The first pair, AMBOSS-Step1 and AMBOSS-Step2, consisted of 100 questions each from the widely used AMBOSS question bank. The second pair, NBME-Free-Step1 and NBME-Free-Step2, comprised the National Board of Medical Examiners (NBME) free 120 questions. To standardize the format of the questions for input into ChatGPT, the researchers removed questions with images or tables and formatted them with the question text, direct question, and multiple-choice answers, separated by new lines. The Attending Tip provided by AMBOSS was included as a separate instance of the question to examine its impact on performance. ChatGPT's performance was compared to two other LLMs, GPT-3 and InstructGPT, using the OpenAI API. The researchers manually reviewed the responses of all models to determine the selected answer. Qualitative analysis of ChatGPT's responses included three metrics: logical reasoning, presence of information internal to the question, and presence of information external to the question. These metrics were used to assess the coherence and interpretability of the model's responses.
Key Findings
ChatGPT achieved varying levels of accuracy across the four data sets. On the AMBOSS data sets, ChatGPT scored 44% on Step 1 and 42% on Step 2, while on the NBME data sets, it scored 64.4% on Step 1 and 57.8% on Step 2. Notably, ChatGPT's performance decreased with increasing question difficulty on the AMBOSS-Step1 data set. Qualitative analysis revealed that ChatGPT consistently provided logical justification for its answer selection and included information internal to the question in 96.8% of cases. Importantly, the presence of external information was significantly higher in correct answers compared to incorrect answers. This suggests that ChatGPT's ability to connect prompts with relevant information within its knowledge base contributes to its accuracy.
Discussion
This study demonstrates that ChatGPT, a conversational LLM, can perform at a level comparable to a third-year medical student on questions from the USMLE Step 1 and Step 2 exams. The model's ability to provide logical explanations and incorporate relevant information suggests its potential for use as a virtual medical tutor. ChatGPT's dialogic nature and ability to provide context-rich responses could support problem-based learning, clinical problem-solving, and reflective practice, enhancing the learning experience for medical students. The study also highlights the importance of prompt engineering and the impact of additional context on LLM performance.
Conclusion
ChatGPT's performance on the USMLE-related questions suggests its potential as a valuable tool for medical education and knowledge assessment. Its ability to engage in dialogic interactions, provide logical justifications, and access relevant information makes it suitable for simulating small group learning and supporting various aspects of medical education. Future research should explore ChatGPT's specific efficacy in different educational settings and investigate potential novel applications for LLMs in medical education.
Limitations
This study has several limitations. First, ChatGPT was trained on data up to 2021, limiting its knowledge to information from that period. Second, the closed nature of the model restricts fine-tuning on task-specific data and examination of its stochasticity. Third, the version of ChatGPT used in this study was not the most up-to-date at the time of publication. However, future iterations of ChatGPT are expected to improve its performance.
Related Publications
Explore these studies to deepen your understanding of the subject.