Medicine and Health
Large language model use in clinical oncology
N. Carl, F. Schramm, et al.
The rapid emergence of ChatGPT in 2022 catalyzed an expansion of AI research in medicine, including oncology. LLMs from multiple organizations (e.g., OpenAI, Microsoft, Google, Meta) are being explored for patient information provision, therapy management, and prognostication from clinical text, with capabilities in content generation, translation, and medical question-answering. Despite promise, studies report variable performance and outdated or incorrect outputs. From a computational perspective, LLMs are based predominantly on transformer architectures, with a broader trend toward multimodal models that can handle both text and images—an important shift for complex oncology data. Given mixed outcomes and the evolving landscape, this systematic review aimed to: (1) comprehensively analyze current literature on LLM applications in oncology; (2) perform a meta-analysis quantifying LLM performance in medical question-answering; and (3) analyze methodologies and constraints to guide future research.
Prior reviews up to early and mid-2023 summarized the opportunities and limitations of LLMs in oncology but lacked meta-analytic synthesis. Background literature highlights the rise of transformer-based architectures and the transition toward multimodal AI capable of integrating text and image data, which is pertinent for oncology workflows. Earlier works emphasized potential benefits (e.g., improved patient information and clinician communication) alongside concerns about variable accuracy and outdated content. This review updates the field by incorporating newer studies (2021–2024) and formally synthesizing performance metrics via meta-analysis.
Protocol and registration: The review followed PRISMA and QUADAS principles (as no LLM-specific SR guideline exists) and was registered in PROSPERO (CRD42023429956). Search strategy: Two reviewers searched PubMed/MEDLINE (last access 19 March 2024) using: “LLM” OR “Large Language Model” OR “ChatGPT” AND “(oncology OR (cancer)*)”. Including the term “ChatGPT” was intended to capture relevant LLM literature that explicitly names ChatGPT. Study selection: Inclusion criteria were original, peer-reviewed English-language research (2021–2024) on LLM applications in oncology with available abstracts. Exclusions were studies not addressing LLMs in oncology, lacking rigorous methodology, preprints, systematic reviews, non–peer-reviewed articles, non-English publications, or without full text. Screening was performed independently by two reviewers on titles/abstracts, followed by full-text assessment, with disagreements resolved by discussion or a third reviewer. Data analysis: Two reviewers independently extracted study characteristics using a standardized form. Reporting and evaluation framework: The authors proposed an evaluation framework (items spanning prompt sources, models, questioning procedures including prompt engineering and test–retest, and output evaluation such as raters, blinding, metrics, grading, and controls) to appraise methodological reporting across studies.
Study yield: Of 483 records, 110 full texts were screened; 34 studies met inclusion (January 2021–March 2024). Application domains and tasks: Most studies examined LLMs’ medical knowledge via question answering (32/34). Fewer addressed patient involvement/compliance (1/34) or translation/summarization for patients (2/34). Topics spanned multiple cancer entities; many focused on diagnostic appropriateness (14/34) and especially treatment recommendations (30/34). Prompt sources and procedures: Inputs were drawn from guidelines, official forums, FAQs, clinical cases, and exam banks; English predominated. Median number of questions was 51 (range 8–293). Only about 62% provided prompts/outputs in text or supplement. Questioning procedures, including prompt engineering and test–retest, were frequently under-reported; one study (Holmes et al.) provided the most comprehensive procedure details. Models assessed: Most studies evaluated GPT-3.5 and/or GPT-4; some compared multiple LLMs. One study reported fine-tuning GPT-3.5 Turbo on a narrow RCC QA set achieving 100% accuracy, illustrating task-specific optimization and potential overfitting risks. Meta-analysis of medQA performance: • Single-model studies: GPT-3.5 mean accuracy 63.6% (SD 0.23); GPT-4 mean accuracy 78.0% (SD 0.16). • Comparative benchmarks: mean accuracies 79% (SD 0.10) for GPT-4, 73% (SD 0.17) for GPT-3.5, and 51% (SD 0.15) for Bard (LaMDA). Reported heterogeneity measures (I² reported as 0% and 21% in different subsets) and broad performance ranges underscore substantial inter-study variability. Evaluation metrics: Across studies, 26+ distinct correctness terms/metrics were used, including binary (e.g., accuracy, sensitivity, specificity, precision, recall, F1, agreement), one-dimensional (e.g., Likert scales, #correct responses, cosine similarity), and multidimensional tools (e.g., DISCERN, SERVQUAL, PEMAT-P, AIP, VGT). Readability was assessed by Flesch Reading Ease and FK grade level in a subset. Overall, performance varied by model, task, and domain and was sensitive to prompting strategy and evaluation design.
The review demonstrates that oncology LLM research predominantly tests off-the-shelf models’ encoded medical knowledge through question answering, with wide performance variability attributable to model version (GPT-4 generally outperforming GPT-3.5), oncologic subdomain, and methodological choices (prompt sources, prompt engineering, test–retest, and grading schemes). The under-reporting of questioning procedures and lack of LLM-specific reporting standards likely contribute to heterogeneity and hinder reproducibility and comparability. Clinically, strong interest in treatment recommendations reflects potential for patient advisory and decision support, but trust, legal accountability, privacy, and regulatory concerns remain. Few studies assess patient interactions or real-world clinical integration, highlighting a translational gap between in-silico benchmarks and in-vivo workflows. Methods to improve utility include retrieval-augmented generation (RAG) to ground outputs in current guidelines and provide traceable citations, potentially improving accuracy, timeliness, and explainability. While targeted fine-tuning can yield high accuracy on narrow tasks, it risks overfitting and limited generalizability. As oncology often involves multimodal data, future LLMs and evaluations should address text and visual modalities to better reflect real clinical complexity.
Current evidence shows LLMs, particularly GPT-4, can achieve moderate-to-high accuracy on oncology question-answering tasks, but performance is inconsistent across domains and methodologies. The field lacks standardized, LLM-specific reporting of prompting procedures, evaluation metrics, and grading methods, limiting cross-study comparability and clinical translation. Future work should: (1) adopt emerging LLM-focused reporting guidelines and transparently document prompt engineering and test–retest procedures; (2) evaluate models in real-time clinical settings and patient-facing use; (3) employ grounding methods such as RAG for up-to-date, explainable outputs; and (4) progress toward multimodal capabilities reflecting oncology practice. These steps are critical for reliable, ethical, and effective integration of LLMs into oncology care.
- Search limited to PubMed/MEDLINE and inclusion of the keyword “ChatGPT” may have introduced selection bias and missed studies not explicitly naming ChatGPT. - Potential publication bias may overrepresent positive findings. - High methodological heterogeneity (prompting strategies, domains, metrics) complicates meta-analytic interpretation. - Under-reporting of questioning procedures and lack of LLM-specific reporting standards limit reproducibility and comparability. - No included studies reported real-time clinical deployment, limiting external validity for clinical workflows. - Some affiliations and methodological details across the literature were incompletely reported, reflecting broader reporting gaps in the field.
Related Publications
Explore these studies to deepen your understanding of the subject.

