
Medicine and Health
TRANSLATING RADIOLOGY REPORTS INTO PLAIN LANGUAGE USING CHATGPT AND GPT-4 WITH PROMPT LEARNING: PROMISING RESULTS, LIMITATIONS, AND POTENTIAL
Q. Lyu, J. Tan, et al.
Discover how innovative research by Qing Lyu, Josh Tan, Michael E Zapadka, Janardhana Ponnatapura, Christopher T Whitlow, Chuang Niu, Ge Wang, and Kyle J Myers explores the power of ChatGPT and GPT-4 in translating complex radiology reports into clear, understandable language. This groundbreaking study reveals impressive results while addressing the challenges of AI-generated medical communication.
~3 min • Beginner • English
Introduction
The study addresses whether large language models, specifically ChatGPT, can accurately and usefully translate radiology reports into plain language for patients and healthcare providers. Contextually, ChatGPT (built on GPT-3.5 with RLHF and prompt-based reasoning) has demonstrated strong performance in general NLP tasks and is being explored for clinical applications. Prior works have used ChatGPT for medical writing, education, report simplification, and decision support. Radiology reports, which often contain dense medical terminology, can be difficult for patients to interpret. The purpose of this study is to evaluate ChatGPT’s ability to produce concise, clear, and comprehensive lay translations of chest CT and brain MRI screening reports, to generate actionable suggestions for patients and providers, to analyze the variability of outputs, to examine the impact of prompt design (including an optimized prompt), and to compare performance with GPT-4. The importance lies in improving patient understanding, reducing anxiety, enhancing adherence, and informing potential safe, effective uses of LLMs in clinical communication.
Literature Review
The paper cites prior work on foundational NLP models (BERT, XLNet, GPT) and the development of ChatGPT with RLHF. It references early explorations of ChatGPT in medical contexts: generating discharge summaries, medical writing and documentation, simplifying radiology reports, adjunct radiology decision-making, and cardiovascular prevention guidance. These studies generally found that ChatGPT can produce appropriate or factually correct outputs for certain tasks, though concerns about reliability and safety remain. The literature positions the current study within efforts to assess ChatGPT’s clinical utility, particularly for patient-facing reformulations of technical reports, and motivates comparison with the newer GPT-4 model.
Methodology
Data: The authors collected 62 low-dose chest CT lung cancer screening reports and 76 brain MRI metastasis screening reports from the Atrium Health Wake Forest Baptist database, dated Feb 1–13. Reports were de-identified. Chest CT: ages 53–80 (mean 66.9), 32 male/30 female, finalized by 11 radiologists, 278 ± 57 words, categorized by Lung-RADS (1, 1S, 2, 2S, 3, 4A). Brain MRI: ages 5–98 (mean 55.0), 45 male/31 female, finalized by 14 radiologists, 247 ± 92 words, categorized as no metastases, stable, or worsening.
Experimental design: Each report was processed with three prompts to ChatGPT: (1) translate the report into plain language; (2) provide suggestions for the patient; (3) provide suggestions for the healthcare provider. Responses were collected in mid-February. Additional experiments assessed variability by generating 10 translations for the same report and analyzing 25 predefined key information points (labeled as Good, Missing, Inaccurate, Incorrect). An optimized prompt specifying structure and content requirements (paragraph-wise instructions including screening context, explicit findings with nodule positions/sizes and other structures, conclusions with Lung-RADS and follow-up, and incidental findings) was tested with another 10 translations. Further prompt variants were evaluated (targeting different education levels, self-designed prompt by ChatGPT, and a prompt from PromptPerfect). Ensemble learning was explored by combining 5 translations into a single output using ChatGPT. Finally, the same tasks were run using GPT-4 with both the original and optimized prompts.
Evaluation: Two experienced radiologists (21 and 8 years) assessed translations for completeness (count of missing information points), correctness (count of incorrect/misinterpreted points), and an overall quality score (1–5). Suggestion quality was summarized by frequency of general themes, percent of specific suggestions tied to findings, and percent of inappropriate suggestions. Statistical summaries were computed across chest CT and brain MRI cohorts. Comparative analyses were made across prompts, ensemble vs. non-ensemble, and model type (ChatGPT vs GPT-4).
Key Findings
- Length reduction and readability: ChatGPT translations were generally shorter and more patient-friendly. Chest CT: 85.5% (53/62) were shorter with an overall 26.7% reduction; by Lung-RADS: 1 (−20.5%), 2 (−29.0%), 2S (−29.0%), 3 (−54%), 4A (−29.4%); 1S increased length by 13.3%. Brain MRI: 72.4% (55/76) were shorter with overall 21.1% reduction; “stable” −13.1%, “worsening” −34.1%; “no metastases” had a slight 1.8% increase.
- Translation quality (radiologist evaluation): Chest CT: average 0.097 missing and 0.032 incorrect information points per report; 76% scored 5/5. Brain MRI: 0.066 missing and 0.092 incorrect per report; 37% scored 4/5 and 32% scored 5/5. Overall across all reports: 0.080 missing and 0.065 incorrect per report; average overall score 4.268, with 27% scoring 4 and 52% scoring 5.
- Suggestions content: Suggestions were generally relevant, e.g., patients to follow up with doctors and maintain healthy lifestyle; providers to communicate findings clearly and schedule follow-ups. About 37% of cases included specific suggestions tied to particular findings (e.g., manage/evaluate sinus symptoms when paranasal sinus disease was noted).
- Randomness and errors with vague prompts: In 10 repeated translations of one chest CT report analyzed across 25 key points, Good 55.2%, Missing 19.2%, Inaccurate 24.8%, Incorrect 0.8%. Typical issues included omission of detailed nodule positions/sizes and failure to mention “no new nodules,” and misinterpretation of “30 pack-years” as 30 years.
- Optimized prompt improved completeness: With the detailed prompt, in 10 repeats Good increased to 77.2%, Missing dropped to 9.2%, Inaccurate to 13.6%, Incorrect to 0%.
- Alternative prompts: Five semantically similar prompts (including varying education levels and prompts from ChatGPT/PromptPerfect) yielded performance similar to the original prompt and worse than the optimized prompt; the self-designed prompt by ChatGPT performed slightly better than the others but still inferior to the optimized prompt.
- Ensemble learning: Combining 5 translations modestly improved results for the original prompt (Good +6.4%, Missing −0.8%, Inaccurate −4.8%, Incorrect −0.8%) but did not outperform the gains from the optimized prompt; with the optimized prompt, ensemble slightly degraded performance (Good −4.4%, Missing +1.2%, Inaccurate +3.2%, Incorrect 0%).
- GPT-4 vs ChatGPT: GPT-4 significantly improved translation quality. With the original prompt: Good 73.6%, Missing 8.0%, Inaccurate 18.4%, Incorrect 0%. With the optimized prompt: Good 96.8%, Missing 1.6%, Inaccurate 1.6%, Incorrect 0%. GPT-4 with the original prompt was competitive with ChatGPT using the optimized prompt; with the optimized prompt it approached near-perfect performance, with minor formatting variance observed in one case.
Discussion
The findings demonstrate that ChatGPT can meaningfully translate radiology reports into plain language, addressing the core question of feasibility and utility for patient education and provider communication. Translations were shorter, clearer, and integrated information across sections, which can reduce patient anxiety and improve comprehension and adherence. However, variability and occasional omissions or misinterpretations were observed, particularly with generic prompts, underscoring the importance of precise, structured prompt design to guide completeness and formatting. The optimized prompt materially improved information preservation and consistency, while semantically similar generic prompts did not. GPT-4’s superior performance indicates rapid model advancement and suggests that newer LLMs may meet higher reliability thresholds for clinical communication support. Ensemble aggregation offered limited benefit compared to prompt optimization. Overall, the results support the potential of LLMs as tools to enhance communication of imaging findings, with careful prompt engineering and oversight to ensure accuracy, completeness, and consistent formatting suitable for patient use.
Conclusion
This study evaluated ChatGPT and GPT-4 for translating chest CT and brain MRI screening radiology reports into plain language and generating patient/provider suggestions. ChatGPT achieved an average overall score of 4.268/5 with low average counts of missing (0.080) and incorrect (0.065) information points per report, but exhibited variability and occasional oversimplification when guided by a vague prompt. A structured, optimized prompt improved completeness (Good 77.2% vs 55.2%) and eliminated incorrectness in repeated trials. GPT-4 further improved performance, approaching near-perfect Good rates with the optimized prompt. The work highlights LLMs’ promise for clinical education and communication. Future work should explore standardized templates, automated checks for completeness and factual consistency, integration into clinical workflows with human oversight, prospective studies of patient comprehension and outcomes, and broader validation across modalities, institutions, and languages.
Limitations
Key limitations include: (1) Variability and uncertainty in LLM outputs, leading to inconsistent formatting and occasional omissions or misinterpretations, especially with generic prompts; (2) Incomplete preservation of detailed findings (e.g., precise nodule positions and sizes) under vague instructions; (3) Single-institution dataset and limited time window (early February) may limit generalizability; (4) Evaluation limited to two radiologists and focused on text translations, not patient comprehension or clinical outcomes; (5) Some analyses (e.g., ensemble learning) showed limited benefit; (6) GPT-4, while markedly better, still exhibited minor randomness/format deviations; (7) Study addressed plain-language translation and suggestions, not end-to-end diagnostic decision-making or safety/risk in clinical deployment.
Related Publications
Explore these studies to deepen your understanding of the subject.