Introduction
The study explores the application of ChatGPT, a state-of-the-art natural language processing (NLP) model, in translating complex radiology reports into plain language. The rapid growth in ChatGPT's user base (over 100 million in two months) highlights its potential. ChatGPT's advancements over previous models like BERT, XLNet, and GPT include a larger model size, chain of thought prompting, and reinforcement learning from human feedback (RLHF). Existing research has explored ChatGPT's use in various fields, including literature reviews, medical education, and language translation. This study specifically focuses on its application in the clinical setting for patient education and improved healthcare communication. The challenge lies in translating technical medical jargon in radiology reports into patient-friendly language, reducing anxiety and improving treatment compliance. The researchers aim to assess ChatGPT's performance and suggest potential improvements, including a comparison with the newer GPT-4 model.
Literature Review
Several studies have examined the use of ChatGPT in healthcare. Patel et al. explored using ChatGPT for patient discharge summaries, while Biswas investigated its applications in various medical writing tasks, including patient care and publication. Jeblick et al. evaluated the accuracy and safety of ChatGPT-simplified radiology reports, while Rao et al. investigated its use as an adjunct for radiology decision-making. Sarraju et al. studied ChatGPT's ability to provide appropriate cardiovascular disease prevention recommendations. These studies demonstrate a growing interest in applying large language models to clinical tasks, but further research is needed to fully understand their capabilities and limitations.
Methodology
The researchers collected 62 de-identified chest CT screening reports and 76 brain MRI screening reports from the Atrium Health Wake Forest Baptist clinical database. The chest CT reports were from patients aged 53-80, with an average of 66.9 years (32 male, 30 female). The brain MRI reports were from patients aged 5-98, with an average of 55.0 years (45 male, 31 female). Three prompts were used to test ChatGPT: 1) Translate the report into plain language; 2) Provide suggestions for the patient; 3) Provide suggestions for the healthcare provider. Two experienced radiologists evaluated the quality of the translated reports using a 5-point scale (1-worst, 5-best), assessing completeness and correctness. They also counted instances of missing or incorrect information. The suggestions were evaluated for relevance and appropriateness. The process was repeated using GPT-4 for comparison. To investigate randomness, 10 translations of the same report were generated using both a vague and an optimized prompt, which provided detailed instructions for formatting and information inclusion. The effect of prompt engineering was also examined by varying the prompt to reflect different education levels of the intended audience and using prompts generated by ChatGPT itself or a prompt engineering tool. Finally, ensemble learning was explored by combining multiple ChatGPT translations into a single report.
Key Findings
ChatGPT generated plain language versions of radiology reports that were generally shorter than the originals, with length reductions of 26.7% for chest CT and 21.1% for brain MRI. The translations were considered patient-friendly, replacing medical jargon with common words and integrating information from different sections of the original reports. Radiologist evaluation showed that ChatGPT performed well, with an average overall score of 4.27, with only 0.08 places of missing information and 0.07 places of misinformation. ChatGPT provided relevant suggestions, with specific suggestions based on report findings in 37% of cases. However, ChatGPT exhibited randomness in its responses, sometimes oversimplifying or omitting information. Using a detailed prompt significantly improved the quality of translations, increasing the rate of complete translations from 55.2% to 77.2%. GPT-4 significantly outperformed ChatGPT, achieving almost 100% good translation rate with the optimized prompt. Ensemble learning did not consistently improve results, indicating that prompt optimization is more effective. Different prompts had little effect on the quality of translation when no clear instructions were given.
Discussion
The study demonstrates ChatGPT's potential in translating complex medical reports into accessible language, but it also highlights the need for careful prompt engineering to ensure accuracy and completeness. The randomness of ChatGPT's responses raises concerns regarding its reliability, highlighting the importance of clear and specific instructions. The lack of a built-in template for report generation also affects consistency. GPT-4's superior performance underscores the rapid advancements in large language models. While showing promise, limitations like incomplete translations and inconsistent responses need addressing before widespread clinical adoption. The ease of demonstrating safety for communication tools versus those directly impacting diagnosis or treatment is also discussed.
Conclusion
This study demonstrates the feasibility of using ChatGPT and GPT-4 to translate radiology reports into plain language. While promising, limitations exist concerning the randomness of responses and potential oversimplification. Optimized prompts significantly improve the quality of translations. GPT-4 shows significant improvement over ChatGPT. Future work should focus on addressing limitations and ensuring reliability before widespread clinical implementation. Large language models hold immense potential in healthcare, assisting in report generation, treatment planning, patient guidance, and psychological support.
Limitations
The study's limitations include a relatively small sample size and the use of de-identified data, potentially limiting the generalizability of the findings. The evaluation relied on two radiologists, limiting the perspectives considered. Further research with larger datasets and broader evaluation is needed to fully assess the clinical utility and safety of these tools.
Related Publications
Explore these studies to deepen your understanding of the subject.