The increasing use of AI chatbots, particularly ChatGPT, necessitates evaluating their reliability, especially in sensitive areas like healthcare. ChatGPT, trained on a massive dataset, has gained popularity for information retrieval. A significant number of people seek health information online, making it crucial to assess ChatGPT's accuracy in ophthalmology. This study aimed to determine the reliability and adequacy of ophthalmologic information provided by ChatGPT, focusing on the potential for misinformation and harm to patients using the chatbot independently. The increasing trend of seeking health information online, coupled with ChatGPT's growing popularity, suggests a potential for patients to use it for ophthalmic questions. Therefore, this research focused on verifying the validity of information provided by ChatGPT in the context of ophthalmology to assess its potential benefits and risks in patient education.
Literature Review
The introduction section references studies showing the increasing trend of online health information searches in Europe and the US. It also cites recent research demonstrating ChatGPT's ability to perform complex tasks, including passing certain professional exams (MBA, USMLE), but emphasizes that the reliability of ChatGPT's medical information requires validation. The section highlights the need to verify the accuracy and safety of the information provided by ChatGPT, especially considering its potential use by patients seeking health advice.
Methodology
This study assessed ChatGPT version 3.5's accuracy in providing information about five common diseases within each of eight ophthalmology subspecialties (General, Anterior segment and Cornea, Glaucoma, Neuro-ophthalmology, Oculoplastic, Pediatric ophthalmology, Retina and vitreous, Uveitis). For each disease, three questions were asked: 'What is X?', 'How is X diagnosed?', and 'How is X treated?' Responses were scored from -3 (potentially harmful) to 2 (correct and complete) by comparing them to AAO patient guidelines. Two independent ophthalmologists graded each response. A score of ≥1 was considered optimal. The study used the American Academy of Ophthalmology's (AAO) 'For public & patients - Eye health A-Z' section and guidelines as the gold standard for comparison. The grading criteria were clearly defined, and graders were trained experts in their respective fields, using peer-reviewed publications as references when necessary. A detailed grading scale was used, encompassing various levels of accuracy and potential harm, allowing for a nuanced assessment of the chatbot's responses.
Key Findings
Of the 120 questions, 93 (77.5%) scored ≥1. 27 (22.5%) scored ≤-1, including 9 (7.5%) that scored -3 (potentially harmful). The overall median score was 2 for 'What is X?', 1.5 for 'How is X diagnosed?', and 1 for 'How is X treated?'. These median scores varied across subspecialties, with some demonstrating higher accuracy than others. Notably, the 'What is X?' question consistently received the highest median scores across all subspecialties. The study noted a significant number of responses achieving the highest score (2 or 'Very good'), indicating accuracy and completeness in alignment with AAO guidelines. However, a concerning number of responses received scores of -1 or lower, signifying either incorrect or potentially harmful information. A breakdown by subspecialty reveals variations in performance, likely linked to factors such as data availability and disease complexity within the training dataset. The Kruskal-Wallis test showed no statistically significant difference in score distribution across subspecialties for any of the three question types. The figures presented in the study visually represent the median scores and ranges of ChatGPT response grades by subspecialty for each of the three question types, showcasing variability in performance across different ophthalmologic domains.
Discussion
ChatGPT's performance varied across different aspects of ophthalmic knowledge and subspecialties. Its higher accuracy in defining conditions compared to diagnosing or treating them suggests that the chatbot's training data might be more consistent and comprehensive for definitions. The variability in performance across subspecialties suggests that data availability and disease complexity affect the model's accuracy. While ChatGPT demonstrated potential as an educational tool, its limitations necessitate stringent human oversight to prevent the dissemination of potentially harmful information. The study's findings are discussed in the context of other AI chatbots, such as Bing's chatbot (powered by GPT-4) and Google's Bard (powered by LaMDA). The emergence of specialized medical AI tools, such as MedPal, highlights the evolving landscape of AI applications in healthcare. The importance of comprehensive evaluation of LLMs, beyond mere accuracy, is underscored, focusing on consistency, reliability, and safety, particularly in medical domains.
Conclusion
ChatGPT demonstrated potential as an adjunct to patient education in ophthalmology, but its inherent limitations, including the provision of incomplete, incorrect, and potentially harmful information, highlight the critical need for human medical supervision. Future research should explore different prompts and evaluation methods to better assess its accuracy, and to account for its limitations in order to further refine and improve its capabilities. The rapidly evolving AI landscape suggests that continued rigorous evaluation is crucial for responsible integration of these technologies into patient care.
Limitations
The study's limitations include the single-question approach (without clarification requests), the potential for learned patterns in sequential questioning within the same chat session, and the possibility of selection bias due to purposeful selection of common diseases. The use of an ordinal grading scale, rather than continuous variables, may affect the interpretation of the data. Although the grading system was defined beforehand, the inherent subjectivity in comparing responses to the AAO guidelines may have introduced bias.
Related Publications
Explore these studies to deepen your understanding of the subject.