logo
ResearchBunny Logo
Reliability and accuracy of artificial intelligence ChatGPT in providing information on ophthalmic diseases and management to patients

Medicine and Health

Reliability and accuracy of artificial intelligence ChatGPT in providing information on ophthalmic diseases and management to patients

F. Cappellani, K. R. Card, et al.

This study evaluated the accuracy of ophthalmic information provided by ChatGPT version 3.5, revealing a mix of promising insights and potential dangers. Researchers Francesco Cappellani, Kevin R. Card, Carol L. Shields, Jose S. Pulido, and Julia A. Haller discovered that while 77.5% of responses were acceptable, a significant portion provided incomplete or harmful information, emphasizing the critical need for human oversight in medical contexts.

00:00
00:00
~3 min • Beginner • English
Introduction
The study investigates whether ChatGPT can reliably provide accurate, safe, and complete information about ophthalmic diseases for patients. With increasing public use of online tools and rapid adoption of large language models like ChatGPT to seek health information, the work examines the validity of ChatGPT’s responses against authoritative AAO patient guidelines. The purpose is to quantify accuracy across definitions, diagnosis, and treatments for common eye diseases and to evaluate risks of incomplete, incorrect, or potentially harmful guidance. The importance lies in patient safety and the potential role of AI chatbots as adjuncts to patient education in ophthalmology.
Literature Review
Methodology
Design: Cross-sectional evaluation of ChatGPT (version 3.5) responses to standardized prompts. Scope: 5 diseases from each of 8 ophthalmology subspecialties (General; Anterior segment and Cornea; Glaucoma; Neuro-ophthalmology; Oculoplastic; Paediatric ophthalmology; Retina and vitreous; Uveitis) were selected from the AAO “For public & patients - Eye health A-Z.” For each disease, three sequential prompts were used: (1) What is “X”? (2) How is “X” treated? (3) How is “X” diagnosed? Data collection: Responses were collected in a chart, with separate entries per disease to limit learned patterns between assessments. Reference standard: AAO patient guidelines. Grading: Two independent extramural clinicians with subspecialty expertise graded each response against AAO content. Scoring scale (as described): −3 = potentially dangerous (incorrect with potential for harm; e.g., unsupported invasive procedures); −2 = multiple incorrect elements without direct harm potential; −1/1 = incorrect but not harmful vs. correct but incomplete (note some inconsistencies in labels across descriptions); 0 = no response; 2 = correct and complete; 3 (in one section) referenced as complete plus helpful additional information. Statistical analysis: Medians by question and subspecialty were reported; Kruskal–Wallis tests assessed differences across subspecialties and question types. Sample size: Nominally 40 diseases × 3 questions = 120 responses (some text later cites 125 responses).
Key Findings
- Proportion correct: Reported as 77.5% of questions scoring ≥1 (two places cite totals differently: 93/120 and 97/125, both 77.5%). - Incorrect/poor: 22.5% scored ≤−1; 7.5% scored −3 (potentially dangerous recommendations). - Completeness: 61.8% (74/120 or 74/125 as reported) scored 2 (“very good”); among these, 19 (15.8%) were noted as complete with additional helpful information. - Medians by question: “What is X?” median = 2; “How is X diagnosed?” median = 1.5; “How is X treated?” median = 1. - No significant differences by subspecialty or question type by Kruskal–Wallis: What is X? p=0.06; How diagnosed? p=0.52; How treated? p=0.36; overall comparison across questions p=0.13. - Subspecialty performance: General and Neuro-ophthalmology frequently achieved high scores; oncology had the greatest number of potentially harmful responses (4 scores of −3 out of 15 questions). Common, well-known diseases (e.g., cataracts, glaucoma, diabetic retinopathy) tended to achieve higher scores.
Discussion
Findings indicate that while ChatGPT can often provide correct and sometimes complete descriptions of ophthalmic conditions—especially for common, well-defined diseases—it shows decreased performance for diagnosis and treatment details, and variable accuracy across subspecialties. The presence of potentially harmful suggestions underscores safety concerns when patients rely on the model without professional oversight. The results address the research question by demonstrating that ChatGPT’s reliability is insufficient as a stand-alone source for patient guidance and that its strengths lie more in definitions than in nuanced diagnostic or therapeutic advice. The study highlights the need for rigorous, ongoing evaluation of LLMs in medicine, careful alignment with clinical guidelines, and the importance of human supervision to mitigate misinformation risks.
Conclusion
ChatGPT v3.5 provided generally acceptable information for many ophthalmic conditions, particularly for disease definitions, but its outputs were frequently incomplete or occasionally incorrect and potentially harmful regarding diagnosis and treatment. The paper concludes that ChatGPT may serve as an adjunct for patient education but is not adequate without close medical supervision. Future work should test varied prompts, permit clarifications, expand disease coverage, refine grading methods, and evaluate newer, domain-adapted LLMs to improve consistency, completeness, and safety.
Limitations
- Single-pass questioning: Each question was asked only once, without follow-up or clarifications, potentially underutilizing ChatGPT’s capabilities. - Same-chat context: All prompts were asked within the same chat, possibly influencing responses and precision. - Selection bias: Diseases were the five most common per subspecialty rather than random or comprehensive selection. - Ordinal grading scale: Non-equidistant categories may limit interpretability relative to continuous metrics. - Subjectivity: Despite predefined criteria and dual graders, subjective judgment comparing responses to AAO guidelines may introduce bias. - Reporting inconsistencies: Discrepancies in total question counts (120 vs 125) and some score label descriptions appear in the text.
Listen, Learn & Level Up
Over 10,000 hours of research content in 25+ fields, available in 12+ languages.
No more digging through PDFs, just hit play and absorb the world's latest research in your language, on your time.
listen to research audio papers with researchbunny