logo
ResearchBunny Logo
ZERO-SHOT INFORMATION EXTRACTION FROM RADIOLOGICAL REPORTS USING CHATGPT

Computer Science

ZERO-SHOT INFORMATION EXTRACTION FROM RADIOLOGICAL REPORTS USING CHATGPT

D. Hu, B. Liu, et al.

This study investigates ChatGPT's ability to perform zero-shot information extraction from radiological reports. Conducted by Danqing Hu, Bing Liu, Xiaofeng Zhu, Xudong Lu, and Nan Wu, the research demonstrates competitive performance with CT reports, highlighting both efficacy and existing limitations.

00:00
00:00
Playback language: English
Introduction
The increasing volume of free-text data in electronic health records (EHRs), such as radiological reports, hinders efficient secondary use of medical data. Manual extraction of structured information is time-consuming and error-prone. Information extraction (IE) aims to automate this process, typically involving subtasks like named entity recognition (NER), coreference resolution, and relation extraction. Traditional IE systems rely on rule-based methods, which are difficult to maintain and update, or machine learning methods that require large annotated datasets. Large language models (LLMs) like ChatGPT offer a potential solution by performing well on various NLP tasks without parameter tuning. This study investigates the use of ChatGPT for zero-shot IE, aiming to extract information from radiological reports without the need for training data.
Literature Review
The authors review existing information extraction methods, dividing them into rule-based and machine learning approaches. Rule-based systems utilize regular expressions to identify patterns, but often suffer from low precision due to diverse linguistic expressions. Machine learning methods, including those using deep learning and pre-trained models like BERT, offer improvements but require substantial annotated data. The emergence of LLMs like GPT-3 and ChatGPT, capable of performing various downstream NLP tasks without fine-tuning, suggests a new paradigm for IE. This research builds upon previous work using multi-turn question answering (MTQA) for IE, which is used here as a baseline for comparison.
Methodology
The study used 847 CT reports of lung cancer patients from Peking University Cancer Hospital. Eleven lung cancer-related questions were defined, and two engineers with medical informatics backgrounds manually labeled the answers, with a clinician reviewing the labels to create a gold standard. A three-part prompt template was designed for ChatGPT. The first part included the original CT report; the second part contained an IE instruction and an unfilled question form; and the third part had supplementary requirements (defaulting to 'False' if information was not mentioned). The prompts were generated by combining the template with each CT report. ChatGPT's responses were processed using a rule-based post-processing module with regular expressions to structure the extracted information. A baseline MTQA IE system developed in prior work, using BERT for multi-turn question answering, was used for comparison. To investigate the effect of instructions, a modified prompt template (prompt-PMK) incorporating medical knowledge was created and tested. The consistency of ChatGPT's responses was evaluated by repeating the process three times on 100 randomly selected reports.
Key Findings
ChatGPT demonstrated competitive performance on certain extraction tasks compared to the MTQA baseline, particularly for simple questions like tumor location, diameter, lobulation, spiculation, and pleural invasion. However, performance was less impressive for questions concerning tumor density and lymph node status. Analysis of errors revealed several issues: difficulty distinguishing between similar terms (e.g., 'streaky shadow' and 'spiculation'); inability to handle multiple tumors described with different densities; and inconsistent interpretation of terms like "mediastinal group 10 lymph nodes." Using the prompt-PMK which incorporated prior medical knowledge improved the accuracy for spiculation, lobulation, and pleural invasion questions. However, this improvement did not extend to questions about tumor density and lymph node status, and it even resulted in performance degradation in recall for mediastinal lymph node status due to inconsistent processing of multiple lymph node descriptions. Analysis of the consistency of ChatGPT's responses showed higher consistency for questions about tumor location, diameters, pleural invasion, and lymph node status compared to those about tumor density, spiculation, and lobulation.
Discussion
The study demonstrates the potential of LLMs for zero-shot IE in radiology reports. ChatGPT's ability to extract information from unstructured text without training data is a significant advantage over traditional methods. The success on simpler questions highlights ChatGPT's capacity to understand and reason, albeit with limitations in handling ambiguity and complex relationships between concepts. The inconsistent performance and challenges in handling nuanced medical terminology suggest that prompt engineering and incorporating medical knowledge, while beneficial for some questions, does not always guarantee improvements and can even lead to negative impacts on accuracy. The inconsistent output of ChatGPT also necessitates further research into constraining the output format for better structure.
Conclusion
This research demonstrated the application of ChatGPT for zero-shot IE from radiological reports, achieving competitive results in some aspects. Incorporating medical knowledge into prompts improved performance on some questions but negatively impacted others. Future work should focus on improving ChatGPT's ability to handle complex relationships, improve output consistency, and address privacy concerns associated with processing sensitive medical data, possibly through on-site deployment of smaller, more specialized models.
Limitations
The study's limitations include the reliance on a single LLM (ChatGPT), the potential for bias in the manually created gold standard, and the inherent variability in LLM responses. The limited scope of questions and the specific focus on lung cancer may also limit the generalizability of findings. The potential risks of data leakage associated with using a publicly available LLM also need consideration.
Listen, Learn & Level Up
Over 10,000 hours of research content in 25+ fields, available in 12+ languages.
No more digging through PDFs, just hit play and absorb the world's latest research in your language, on your time.
listen to research audio papers with researchbunny