logo
ResearchBunny Logo
ZERO-SHOT INFORMATION EXTRACTION FROM RADIOLOGICAL REPORTS USING CHATGPT

Computer Science

ZERO-SHOT INFORMATION EXTRACTION FROM RADIOLOGICAL REPORTS USING CHATGPT

D. Hu, B. Liu, et al.

This study investigates ChatGPT's ability to perform zero-shot information extraction from radiological reports. Conducted by Danqing Hu, Bing Liu, Xiaofeng Zhu, Xudong Lu, and Nan Wu, the research demonstrates competitive performance with CT reports, highlighting both efficacy and existing limitations.... show more
Introduction

The study addresses the challenge of converting free-text medical documents (e.g., radiological reports) into structured data suitable for analysis. Manual extraction is time-consuming, error-prone, and expensive. Traditional IE pipelines involve subtasks such as named entity recognition, coreference resolution, relation extraction, and post-processing, implemented via rule-based or machine learning approaches. Although deep learning and pre-trained models (e.g., BERT) improve IE performance, they still require large annotated corpora. Recent LLMs (GPT-3/ChatGPT) demonstrate strong zero-/few-shot performance via prompting, suggesting a new paradigm for IE. The research question is whether ChatGPT can perform zero-shot IE on CT reports for lung cancer, and how prompt design and post-processing affect performance.

Literature Review

Prior work spans rule-based IE systems (e.g., regular expressions and domain rules) that can achieve coverage but suffer from brittleness and maintenance challenges, and machine learning approaches that learn semantic patterns from data (e.g., CRF, SVM, logistic regression) with hand-crafted features. Deep learning replaced manual features with embeddings (word2vec, GloVe) and RNNs, followed by pre-trained transformers (e.g., BERT) that require task-specific fine-tuning and annotated data. In clinical domains, systems like cTAKES and MetaMap, and numerous studies have tackled NER and relation extraction in oncology and radiology. More recently, LLMs (GPT-3, ChatGPT) have shown strong zero-/few-shot capabilities via prompting for clinical IE and QA, motivating investigation of zero-shot IE from radiological reports without parameter tuning or annotations.

Methodology

Data: 847 preoperative chest CT reports of lung cancer patients (2010–2018) from the Department of Thoracic Surgery II, Peking University Cancer Hospital. Eleven lung cancer–related questions (covering tumor location; size—long/short diameters; density—solid, ground-glass, mixed ground-glass; spiculation; lobulation; pleural invasion/indentation; mediastinal and hilar lymph node enlargement) were defined under clinician guidance. Two medical informatics engineers annotated answers, with clinician adjudication to form the gold standard. Ethics approval: 2022KT128.

Zero-shot QA framework: A prompt template was designed with three parts: (I) the original CT report; (II) IE instructions with an unfilled question form for the 11 items; (III) supplementary requirements (e.g., default False/“No” if not mentioned). For each CT report, a new ChatGPT dialogue was started via the official web interface, and responses were collected. Only the filled question form portion of the response was extracted.

Post-processing: A rule-based module with regular expressions converted free-form responses into structured outputs. Tumor location keywords (right/left, upper/middle/lower) were normalized to categorical lobes (right upper/middle/lower; left upper/lower; others). Diameters were parsed as floats with units normalized to millimeters. Density questions were enforced to be mutually exclusive using a predefined priority. Boolean items were determined via keywords (e.g., yes/no), with “Not mentioned” treated as False.

Baseline: A multi-turn question answering (MTQA) IE system from prior work (BERT-based) served as baseline. It first extracted head entities, then generated tail-entity questions for a second QA round, followed by its own post-processing.

Prompt engineering: Common ChatGPT extraction errors were diagnosed (e.g., confusing streaky shadows with spiculation; irregular margins with lobulation; pleural thickening with invasion; lymph node group naming). A new prompt template with prior medical knowledge (Prompt-PMK) added explicit instructions: (a) extract the first described tumor; (b) enforce mutual exclusivity among density types; (c) clarify that streaky shadow ≠ spiculation; irregular shape ≠ lobulation; pleural thickening ≠ invasion; fissure invasion counts as pleural involvement; group 10 nodes are hilar nodes.

Consistency study: 100 CT reports were randomly selected, each prompted three times with the base template to assess output consistency across runs.

Key Findings

Overall zero-shot performance: On 847 CT reports, ChatGPT achieved competitive IE performance on several items versus the BERT-based MTQA baseline without any fine-tuning or labeled examples.

Representative ChatGPT metrics (Accuracy / Precision / Recall / F1):

  • Tumor location: 0.985 / 0.951 / 0.987 / 0.966 (MTQA: 0.995 / 0.997 / 0.990 / 0.993)
  • Tumor long diameter: 0.960 / 0.960 / 1.000 / 0.980 (MTQA: 0.957 / 0.957 / 1.000 / 0.978)
  • Tumor short diameter: 0.953 / 0.953 / 1.000 / 0.976 (MTQA: 0.956 / 0.956 / 1.000 / 0.978)
  • Is tumor solid: 0.948 / 0.990 / 0.938 / 0.963 (MTQA: 0.960 / 0.993 / 0.951 / 0.972)
  • Is tumor ground-glass: 0.894 / 0.598 / 0.873 / 0.710 (MTQA: 0.960 / 0.795 / 0.984 / 0.879)
  • Is tumor mixed GGO: 0.924 / 0.774 / 0.591 / 0.670 (MTQA: 0.946 / 0.802 / 0.773 / 0.787)
  • Spiculations: 0.877 / 0.726 / 0.996 / 0.840 (MTQA: 0.976 / 0.988 / 0.938 / 0.963)
  • Lobulations: 0.954 / 0.860 / 1.000 / 0.925 (MTQA: 0.906 / 1.000 / 0.667 / 0.800)
  • Pleural invasion/indentation: 0.913 / 0.899 / 0.917 / 0.908 (MTQA: 0.854 / 0.802 / 0.915 / 0.855)
  • Mediastinal LN enlarged: 0.950 / 0.929 / 0.923 / 0.926 (MTQA: 0.941 / 0.961 / 0.859 / 0.907)
  • Hilar LN enlarged: 0.904 / 0.775 / 0.867 / 0.819 (MTQA: 0.937 / 1.000 / 0.749 / 0.856)

Prompt engineering effects (Prompt-PMK vs base): Significant improvements were observed for spiculation (Acc/Prec/Rec/F1: 0.986/0.989/0.967/0.978), lobulation (0.976/0.923/1.000/0.960), and pleural invasion (0.935/0.922/0.942/0.932). However, performance degraded for density questions (solid, pure GGO, mixed GGO) and for lymph node status (mediastinal, hilar), often due to persistent confusion in mutual exclusivity handling and node group nomenclature.

Consistency: Repeated prompting of 100 reports (3 runs each) showed lower consistency for density, spiculation, and lobulation items than for location, size, pleural involvement, and lymph node status. Variability likely stems from ambiguous synonym handling and lack of fixed logical rules in zero-shot responses.

Discussion

The findings demonstrate that ChatGPT can perform zero-shot IE on radiology reports at competitive levels for several clinically relevant attributes (tumor location, size, lobulation, pleural involvement, lymph node status), rivaling a trained BERT-based MTQA system for some items. Performance is weaker for nuanced semantic distinctions (density types and spiculation), where lexical ambiguity and contextual subtleties lead to errors. Incorporating prior medical knowledge into prompts can correct systematic mistakes (e.g., streaky shadow vs spiculation, irregular margin vs lobulation, fissure invasion counting as pleural involvement), but may not universally help and can degrade performance for other attributes (e.g., density, lymph node mapping), highlighting trade-offs in prompt engineering.

The study also underscores output variability across runs and free-text formatting challenges, necessitating robust post-processing and possibly output-constraining strategies. Clinically, these results suggest LLM-based IE could reduce annotation costs and accelerate information structuring from EHRs, but careful prompt design, validation, and domain-specific post-processing remain essential.

Conclusion

A zero-shot IE approach using ChatGPT was proposed for extracting structured information from CT radiology reports. ChatGPT delivered competitive performance on several IE tasks without training data or fine-tuning. Prompt engineering with prior medical knowledge improved certain attributes (spiculation, lobulation, pleural involvement) but did not consistently help for more complex distinctions (density types, lymph node status). Future work should focus on better output constraints, improving consistency, integrating domain logic for mutually exclusive categories, refining lymph node terminology handling, and exploring secure, locally deployable LLM solutions for clinical settings.

Limitations
  • Reliance on ChatGPT’s zero-shot understanding and reasoning, leading to errors in nuanced distinctions (e.g., density categories, lymph node group nomenclature).
  • Inconsistent outputs across identical prompts and unpredictable free-text response formats.
  • Prompt engineering benefits are attribute-dependent; added prior knowledge can degrade performance for some tasks.
  • Post-processing relies on handcrafted rules and may not handle all edge cases.
  • Data privacy concerns when using externally hosted LLMs; local deployment is challenging due to computational demands.
  • Single-institution dataset (847 CT reports) may limit generalizability.
Listen, Learn & Level Up
Over 10,000 hours of research content in 25+ fields, available in 12+ languages.
No more digging through PDFs, just hit play and absorb the world's latest research in your language, on your time.
listen to research audio papers with researchbunny