Radiological imaging is crucial for informed medical decision-making. Radiologists produce semi-structured free-text reports, often following personal or institutional schemas. While structured reporting offers benefits for automated analysis, implementation faces resistance due to increased workload for radiologists. Natural Language Processing (NLP), a branch of artificial intelligence and linguistics focused on computer understanding of human language, provides a solution. Specifically, Information Extraction (IE), an NLP subfield, can extract clinically relevant information from free-text reports for secondary uses like prediction, quality assurance, and research. IE involves tasks such as named entity recognition (NER), relation extraction (RE), and template filling, traditionally using heuristic, machine learning, or deep learning methods. Recently, Large Language Models (LLMs) – deep learning models with many parameters trained on massive text data – have emerged, particularly those based on the transformer architecture (e.g., BERT, GPT-3, GPT-4), showing high effectiveness in IE. This review addresses the gap in knowledge regarding the application of LLMs to IE in radiology reports.
Literature Review
Several reviews exist on NLP in radiology, but they either lack current developments or focus on specific aspects or clinical domains (Table 1). Two systematic reviews (2016, 2021) on NLP for radiology report IE are available, but the former is inaccessible, and the latter is limited to Google Scholar and includes only one LLM-based study. Other recent reviews focus on breast cancer reports, cancer concept extraction from clinical notes, and BERT-based NLP in radiology, lacking specific focus on information extraction. The rapid advancement of LLMs necessitates a comprehensive overview of their use in this area.
Methodology
This scoping review followed the PRISMA-SCR guideline. A systematic search across five databases (PubMed, IEEE Xplore, ACM Digital Library, Web of Science Core Collection, Embase) was conducted on August 1st, 2023. The search strategy focused on radiology and information extraction. Initially, 1237 records were identified, reduced to 374 after removing duplicates and pre-2018 publications. After title/abstract screening and full-text review, 34 studies met inclusion criteria. These included studies published between 2018 and August 2023 that described original research using LLMs (defined as deep learning models with over one million parameters trained on unlabeled text data) for IE from free-text radiology reports. Data extraction focused on model architecture, training and fine-tuning processes, use cases (modalities and anatomical regions), data and annotation details, and challenges. A forward search of references in included papers added nine further studies.
Key Findings
The 34 included studies primarily used pre-transformer or encoder-based models (mostly BERT), with few using Bi-LSTM architectures. Only 24% of the transformer-based studies performed further pre-training on in-house data. Performance varied widely, using different F1-score variations and other measures. External validation, reported in 21% of studies, generally showed decreased performance. CT, MRI, and X-ray reports were most common, with the thorax region being the most frequent anatomical area. Dataset sizes ranged from 50 to 10,155 reports. Manual annotation was prevalent (82%), often involving multiple annotators. Inter-annotator agreement (IAA) was reported in 68% of studies, with varying measures and results. Data and source code availability were limited. Common limitations reported included single-institution data, lack of external validation, and limited scope (modality or clinical area). The most common extracted concepts include abnormalities, anatomical information, diagnoses, observations, and recommendations. Most studies applied Named Entity Recognition (NER) and some also applied Relation Extraction (RE), with fewer using Question Answering (QA) approaches. The number of extracted concepts ranged from one to 64.
Discussion
The heterogeneity of performance measures and datasets hampers direct comparison of study results. External validation generally revealed performance drops, highlighting the need for broader data and robust model generalization. The dominance of encoder-only models (e.g., BERT) might be due to their transparency and lack of “hallucination” compared to generative models. The use of further pre-training showed varied effects, and the impact of additional pre-training on generalizability remains unclear. The common use of CT and MRI reports and focus on the thorax may reflect data availability and procedural frequency. Lack of standardization in annotation processes, IAA calculation, and data reporting affects comparability. The limited availability of public datasets and source code restricts reproducibility.
Conclusion
This review provides a comprehensive overview of LLM-based IE from radiology reports, highlighting the promising results of encoder-based models while noting limitations in comparability. The emerging use of generative models post-August 2023 warrants further investigation, alongside addressing challenges in data availability, annotation standardization, and reporting transparency. Future research should focus on a reporting framework to enhance comparability, optimization of annotation processes, and creation of standardized, multilingual datasets for broader model validation.
Limitations
This review’s scope was limited by the search date (August 1st, 2023), excluding newer generative model applications. The definition of LLMs and information extraction involved some ambiguity. The search strategy, while aiming for balance, might have missed relevant studies. Data extraction was performed by a single author, although steps were taken to ensure consistency. The descriptive nature of the scoping review prevents in-depth analysis of the extracted information.
Related Publications
Explore these studies to deepen your understanding of the subject.