logo
ResearchBunny Logo
A scoping review of large language model based approaches for information extraction from radiology reports

Medicine and Health

A scoping review of large language model based approaches for information extraction from radiology reports

D. Reichenpfader, H. Müller, et al.

Discover how the research conducted by Daniel Reichenpfader, Henning Müller, and Kerstin Denecke explores the potential of Natural Language Processing in extracting structured data from radiology reports. This insightful scoping review uncovers challenges, validations, and the rise of Large Language Models in this critical field.

00:00
00:00
~3 min • Beginner • English
Introduction
Radiology reports are typically semi-structured free text, limiting their reuse for secondary purposes despite the potential of structured reporting. NLP and, specifically, information extraction (IE) can convert free text into structured clinical data for prediction, quality assurance, and research. With the emergence of large language models (LLMs)—particularly transformer-based models like BERT, GPT-3/4—there is a need to understand how these models shape IE applied to radiology reports. Prior to transformers, contextual embeddings were produced with RNN-based models (e.g., BiLSTM). A research gap exists: no comprehensive overview of LLM-based IE from radiology reports had been available. Research question: What is the state of research regarding information extraction from free-text radiology reports based on LLMs? Subquestions: RQ.01 Performance of LLMs for IE from radiology reports; RQ.02 Training and modeling (models used, pre-training and fine-tuning design); RQ.03 Use cases (modalities and anatomical regions); RQ.04 Data and annotation (data size, annotation process, public availability); RQ.05 Challenges (open challenges and limitations of existing approaches). The purpose is to summarize recent developments, identify key trends, and highlight future research needs.
Literature Review
Multiple prior reviews exist on NLP in radiology and related domains but either predate current LLM developments or focus narrowly. Systematic reviews in 2016 and 2021 addressed NLP in radiology reports, with limited LLM coverage. Other recent reviews targeted breast cancer report NLP, cancer concept extraction from clinical notes, and BERT-based radiology NLP without a specific IE focus. Thus, an updated overview of LLM-based IE from radiology reports was lacking.
Methodology
Scoping review conducted per JBI Manual and PRISMA-SCR, following a published protocol. Search strategy: a preliminary search (Google Scholar, PubMed) informed term selection; the final query emphasized two dimensions—radiology and information extraction—to balance recall and precision. Systematic searches were run on 08/01/2023 across five databases: PubMed, IEEE Xplore, ACM Digital Library, Web of Science Core Collection, and Embase, without limits. Forward citation chasing of included studies added additional records. Study selection: 1,237 records identified; after deduplication and exclusion of pre-2018 publications, 374 titles/abstracts screened; 72 sought for full text (68 retrieved); 34 included after full-text assessment and reference forward search. Inclusion criteria (all required): (C.01) retrievable full text; (C.02) published after 12/31/2017; (C.03) peer-reviewed journal or conference; (C.04) original research; (C.05) NLP methods for IE from free-text radiology reports; (C.06) LLM-based (deep learning >1M parameters trained on unlabeled text data). Screening was performed by two reviewers using Rayyan; conflicts resolved to maximize recall. Data extraction was performed by one author (with calibration exercises). Changes to protocol and PRISMA-S/PRISMA-SCR checklists are provided in supplements. Data and extraction table are available via OSF.
Key Findings
Study characteristics: 34 studies (2019–2023; peak in 2021 with 11). Based on corresponding author, publications were mainly from the USA (n=15), China (n=6), UK (n=3), Germany (n=3), Canada (n=2), Japan (n=2), Austria (n=1), Spain (n=1), The Netherlands (n=1). Tasks and concepts: Excluded single-label document classification. Included: document-level multi-class (n=2) and multi-label classification (n=9, 26%), NER (n=21, 62%), with 10 also doing RE; QA-based IE (n=2). Extracted concepts ranged from 1 to 64 entities and included abnormalities, anatomy, breast cancer concepts, findings, devices, diagnoses, observations, pathology, PHI, recommendations, scores (e.g., TI-RADS, tumor response), spatial expressions, staging, and stroke phenotypes. Information model development often lacked reporting; only 13 studies referenced guidelines/terminologies or prior work; 21 (62%) gave no details. Normalization/structuring beyond extraction described in 3 studies (rule-based, hybrid BM25+BERT classifier). Class imbalance handling was rarely addressed (1 study countermeasures; 1 study avoided F1 due to imbalance). Models: 28/34 (82%) used transformer-based architectures; 27 of these were BERT-based; 1 used ERNIE. Six studies used BiLSTM variants; two of these pre-trained word vectors (word2vec). One study combined BERT and BiLSTM. Further pre-training of BERT on in-house data occurred in 8 studies (24%); 18 (53%) used pre-trained BERT without further pre-training. Fine-tuning details were reported in 31 studies (91%); hyperparameters in 28 (82%). Performance reporting was heterogeneous: precision/recall/accuracy and various F1 variants (micro, macro, weighted, pooled, exact/inexact match). External validation was conducted in approximately 7 studies (21%), generally showing performance degradation on external data; reported drops ranged up to 35% (overall F1) in a BiLSTM multi-label setting, while the smallest drop was 0.74% (Micro F1) with further pre-training; one study reported only a 3.16% drop when extracting 64 entities. Statistical testing was reported by 22 studies (65%) with methods including cross-validation, McNemar, Mann-Whitney U, Tukey-Kramer, and DeLong. Hardware details were included by 7 studies (21%). Data sets: Fine-tuning sizes ranged from 50 to 10,155 reports; external validation sets were 10–31% of fine-tuning sizes. Further pre-training used 50,000 to 3.8 million reports; contrastive pre-training on MIMIC-CXR was also reported; BiLSTM word embeddings were pre-trained on 3.3 million and 317,130 reports. Splits: 23 studies (68%) used train/validation/test (most common 80/10/10 in 8 studies, 24%); 7 (21%) used two sets; 4 (12%) used cross-validation. Timeframes were reported by 19 (56%), ranging from <1 year to 22 years (1999–2021). Public datasets: MIMIC-CXR (once), MIMIC (twice), MIMIC-III (six studies, 18%), Indiana chest X-ray (twice). External validation used MIMIC-II and MIMIC-CXR in some studies. Modalities: CT (n=16), MRI (n=15), X-Ray (n=14); others included PET-CT (n=1), ultrasound (n=2); some unspecified. Anatomical regions: thorax (n=17), brain (n=8), head (n=5), body of newborn (n=6), others (n=8); some unspecified. Annotation: 28 studies (82%) used exclusively manual annotation; five had dual independent annotators; two used automated assistance with manual review; tagging schemes included IOB2, BISO, BIOES. Annotators spanned radiologists, residents, clinicians, students, and engineers (1–5 annotators). Annotation guidelines existed in few studies (3 with existing guidelines; 4 with instructions but no details); 23 (68%) did not mention guidelines. IAA was mentioned by 23 (68%); results reported by 16 (47%), with Cohen’s kappa 0.81–0.937; tools reported in 11 studies (e.g., Brat, Doccano, TagEditor, Talen, self-developed). Data/code availability: Data available upon request in 5 studies (15%); one claimed but not actually provided; one dataset released on GitHub; one used credentialed-access annotations; 22 (65%) did not mention data availability. Code availability was claimed in 10 studies (29%). Generative models: None of the included studies employed decoder-only generative LLMs for IE within the cutoff window; most used encoder-based (BERT) or pre-transformer (BiLSTM) approaches.
Discussion
Comparability of performance across studies is limited due to heterogeneous datasets, label sets, and metrics (various F1 formulations, exact/inexact matching). External validation generally reduced performance, emphasizing the need for multi-institutional evaluation; while further pre-training may help with generalizability in some settings, its benefit is not guaranteed. Validation methodologies varied (cross-validation vs. simple splits), and statistical tests were inconsistently applied; appropriateness of specific tests (e.g., DeLong for nested models) remains debated. Clear definitions and reporting of pre-training, further pre-training, transfer learning, and fine-tuning are needed, as terminology is often used inconsistently. No generative models were identified within the review window; possible reasons include nascent adoption in healthcare, limited domain pre-training early on, and hallucination concerns. Encoder-only models like BERT provide context embeddings for downstream tasks and are perceived as more transparent and verifiable for IE than generative models. Only two studies used extractive QA for IE, despite its suitability for span extraction. Reports focused on CT and MRI modalities and thoracic/brain regions, reflecting dataset availability and clinical prevalence. Annotation practices varied widely (guidelines, IAA methods, tools), hindering reproducibility and comparability. Good practices include reporting per-class and aggregated scores, providing formulas for metrics, patient-level splits, descriptive annotation statistics, and corpus complexity analyses. Post-cutoff literature suggests that large generative models do not consistently outperform encoder-based methods for structured IE in radiology and remain resource-intensive and less explainable, though they may offer better generalization via in-context learning and improved aggregation; combining LLMs with knowledge graphs may mitigate hallucinations. Overall, encoder-based models offer efficiency and precision for IE, whereas generative models provide flexibility but at higher computational and explainability costs.
Conclusion
This scoping review synthesizes LLM-based approaches for extracting structured information from radiology reports (2018–08/01/2023). Included studies predominantly used encoder-only BERT variants or pre-transformer BiLSTM architectures; no generative models for IE were identified within the review window. Results indicate promising performance of encoder-based and pre-transformer models, though comparison is impeded by heterogeneous tasks, datasets, and metrics. External validation was uncommon and typically revealed performance drops, underscoring generalizability challenges. Research focused on CT/MRI modalities and thoracic regions; public datasets and open-source code were limited; annotation processes lacked standardization. Future work should: (1) standardize reporting (metrics, data splits, statistical tests); (2) improve and standardize annotation workflows (guidelines, IAA, tooling); (3) develop and release multilingual, standardized datasets to enable external validation; (4) systematically evaluate generative LLMs for IE with attention to hallucinations, efficiency, and explainability; and (5) create a reporting framework for clinical NLP applications to facilitate comparability and reproducibility.
Limitations
Key limitations include definitional ambiguity for both IE (scope of tasks) and LLMs (inclusion of BiLSTM-based models), which may affect study inclusion. The search strategy did not enumerate many specific model names to maintain manageable query complexity; no search updates were performed, and some venues (e.g., ACL Anthology) and recent arXiv works were outside scope, potentially missing relevant generative approaches. Sensitive data constraints and deployment limitations for closed-source LLMs may have reduced eligible studies. Heterogeneous and incomplete reporting across included studies (metrics, datasets, annotation) precluded comprehensive quantitative synthesis. Data extraction was performed by one author (calibrated on two studies), which may introduce extraction bias. Finally, the rapidly evolving LLM landscape means newer works (post–Aug 2023) were not captured in the synthesis.
Listen, Learn & Level Up
Over 10,000 hours of research content in 25+ fields, available in 12+ languages.
No more digging through PDFs, just hit play and absorb the world's latest research in your language, on your time.
listen to research audio papers with researchbunny