Medicine and Health

Leveraging Large Language Models for Precision Monitoring of Chemotherapy-Induced Toxicities: A Pilot Study with Expert Comparisons and Future Directions

O. R. Sarrias, M. P. M. D. Prado, et al.

This study evaluated the capacity of Large Language Models to classify chemotherapy toxicities, finding them to match the accuracy of expert oncologists in general toxicity ratings. The authors highlight the potential of LLMs for improving patient monitoring while alleviating the burden on medical professionals. The research was conducted by Oskitz Ruiz Sarrias and colleagues.

00:00

~3 min • Beginner • English

Index

Introduction

The study addresses whether a Large Language Model can accurately identify and grade subjective chemotherapy-induced toxicities from patient descriptions, using CTCAE v5 criteria. Contextually, LLMs like GPT-4 can process and generate clinical text, potentially supporting documentation, patient interaction, and decision-making. In oncology, continuous monitoring of treatment toxicities is crucial but resource-intensive, and round-the-clock personalized follow-up is often unfeasible with human resources alone. The purpose of this pilot is to test the feasibility and accuracy of an LLM to interpret subjective symptom narratives similarly to experienced oncologists, enabling scalable, personalized monitoring that could reduce risks associated with treatment toxicities and optimize resource allocation.

Literature Review

Applications of LLMs in cancer care show promise in diagnosis support, EHR summarization, toxicity monitoring, and tumor board assistance, but require expert verification due to error rates and potential outdated information. Studies in radiation oncology reported time-saving benefits with GPT-4 summarization and chatbot approaches, yet reliability assessments found substantial inaccuracies: about 34.1% of ChatGPT responses were incorrect, 26.2% lacked essential context, and only 39.7% were correct and complete, underscoring the need for safeguards and rigorous review. In toxicity monitoring, patient-reported outcomes (PROs) and electronic PROs (ePROs) have demonstrated feasibility, high patient satisfaction, and potential improvements in communication and safety. A randomized controlled trial of symptom self-reporting improved HRQL, reduced emergency visits/hospitalizations, and prolonged chemotherapy duration. The NCI’s PRO-CTCAE enhances documentation of symptomatic adverse events by capturing frequency, severity, and interference directly from patients. Technology-enabled systems (e.g., online platforms and AMTRA) show high compliance and improved patient–provider communication. Overall, literature highlights both the potential of AI/LLMs and the challenges of integrating PRO data into workflows, ensuring accuracy, avoiding bias, maintaining transparency, and protecting privacy.

Methodology

Design: Comparative pilot analysis to evaluate an LLM’s ability to classify subjective chemotherapy toxicities. Ethics: No real patients were involved; fictitious cases were used to assess feasibility before studies with real patients. Participants: Thirteen practicing medical oncologists from Basurto University Hospital (Osakidetza), familiar with CTCAE v5, completed their assigned evaluations (initially planned n=15; final n=13 after two nonresponses). They were split into two groups (8 and 5 oncologists) for efficiency; each group evaluated 15 cases (total 30 cases). Inclusion criteria included current clinical practice, CTCAE v5 familiarity, and availability; exclusion was inability to complete assigned cases. Fictitious cases: Thirty cases were created using expert knowledge to reflect realistic symptom profiles and proportional incidence/severity. For each profile, GPT-4 generated naturalistic Spanish narratives expressing symptoms and daily-life context. OpenAI Text-to-Speech produced ~3-minute Spanish audios per case. Cases were reviewed for realism. Toxicity framework: CTCAE v5 grading was used (Grade 0 none; 1 mild; 2 moderate with IADL limitations; 3 severe potentially disabling, may require hospitalization with ADL limitations; 4 life-threatening; 5 death). Expert evaluation: Oncologists listened to the audios and graded reported toxicities per CTCAE v5. LLM model: A customized, contextualized GPT-4 (trained to Dec 2023) was configured with detailed CTCAE v5 guidance and indexed references. Importantly, no real classification examples were provided; the model relied solely on contextualization. Analysis: Because each case had 8 or 5 oncologist ratings, consensus metrics used both mode and mean of expert responses. The LLM evaluated each case 10 times to capture response variability; the final model output per case used the mode and mean across those 10 runs. Accuracy was computed in two ways: (1) General categories (0, 1–2, 3–4) by agreement with oncologist mode; and (2) Specific categories (exact CTCAE grade). Error taxonomy: Mild errors were overestimates versus experts (potentially conservative, low safety risk); severe errors were underestimates (higher safety risk). False alarms were instances where the model graded severe (3–4) while the expert mean did not.

Key Findings

- Expert variability: Considerable dispersion in oncologist ratings was observed, attributed to lack of interactive questioning with fictitious, pre-recorded audios. Entropy analysis showed higher variability for some subjective toxicities (e.g., anorexia entropy 0.794; depression 0.664; taste alteration 0.637; lower for hematuria 0.137). - LLM accuracy (vs. oncologist consensus): - General categories: 81.5% (mode comparison) and 85.7% (mean comparison). - Specific categories (exact grade): 64.4% (mode) and 64.6% (mean). - Error profile: - Mild vs. severe errors: 96%/4% (mode) and 96.4%/3.6% (mean), indicating most errors were conservative overestimates. - False alarms (severe grade predicted when experts did not): 8.9% (mode) and 3% (mean). - Expert benchmarks: - Individual oncologist accuracy ranges: General 72.5–89.0% (mode) and 66.7–89.2% (mean); Specific 64.2–80.0% (mode) and 57.0–76.0% (mean). - 95% CI for median oncologist accuracy: General 81.3–87.9% (mode) and 81.9–86.9% (mean); Specific 72.9–77.2% (mode) and 67.6–75.6% (mean). - The LLM’s general-category accuracy falls within expert ranges and CIs, while specific-category accuracy is at the lower end of expert ranges and below median CIs, indicating need for improvement in exact grading.

Discussion

The study directly addresses whether an LLM can interpret subjective symptom narratives and classify chemotherapy toxicities similarly to oncologists using CTCAE v5. Findings show LLM general-category accuracy within expert ranges and confidence intervals, supporting feasibility for triage-level categorization and continuous monitoring. However, exact grade agreement lags behind the median expert performance, suggesting that while the model can reliably separate no/mild vs severe cases, fine-grained grading needs enhancement. The variability among oncologists underscores inherent challenges in subjective toxicity assessment, exacerbated by the study’s non-interactive, audio-based setup. Despite these constraints, the LLM maintained a low rate of severe (underestimation) errors and modest false-alarm rates, making it potentially useful for conservative, safety-oriented monitoring workflows that can reduce clinician workload and enable earlier interventions. For clinical adoption, targeted medical task training, interactive patient questioning, larger and more diverse datasets, and robustness testing are recommended to improve specificity, reduce severe underestimation, and generalize across settings.

Conclusion

LLMs, specifically a contextualized GPT-4, can classify subjective chemotherapy toxicities with performance comparable to expert oncologists for general severity categories, though exact grade classification requires improvement. Implemented thoughtfully, such systems could enhance patient monitoring, enable earlier interventions, and improve care efficiency. Future work should include task-specific training with relevant clinical datasets, validation in studies with real patients, incorporation of interactive capabilities for real-time clarification, expansion of sample sizes, and attention to accuracy, transparency, bias mitigation, and data privacy to safely integrate LLMs into clinical practice.

Limitations

- Use of fictitious cases and pre-recorded audios instead of real patient interactions, limiting clinical nuance and contributing to expert variability. - Lack of interactive questioning by both oncologists and the LLM, which is critical for clarifying subjective symptoms. - The LLM was contextualized but not specifically trained on labeled toxicity cases; targeted training might improve exact grading. - Reliance on text transcriptions of audios (loss of prosodic cues such as tone and pauses that can inform severity assessments). - Pilot scope with limited sample size (30 cases, 13 oncologists); findings require validation with real patients and larger, more diverse datasets.

Related Publications

Explore these studies to deepen your understanding of the subject.

Psychology

Large language models could change the future of behavioral healthcare: a proposal for responsible development and evaluation

E. C. Stade, S. W. Stirman, et al.

Medicine and Health

Risk factors for and pregnancy outcomes after SARS-CoV-2 in pregnancy according to disease severity: A nationwide cohort study with validation of the SARS-CoV-2 diagnosis of Nordic Federation of Societies of Obstetrics and Gynecology (NFOG)

A. J. M. Aabakke, T. G. Petersen, et al.

Medicine and Health

Global, regional, and national burden of diabetes from 1990 to 2021, with projections of prevalence to 2050: a systematic analysis for the Global Burden of Disease Study 2021

G. 2. D. Collaborators, D. Kanyin, et al.

Medicine and Health

A comparative study of COVID-19 transcriptional signatures between clinical samples and preclinical cell models in the search for disease master regulators and drug repositioning candidates

H. Chapola, M. A. D. Bastiani, et al.

Listen, Learn & Level Up

Over 10,000 hours of research content in 25+ fields, available in 12+ languages.

No more digging through PDFs, just hit play and absorb the world's latest research in your language, on your time.

listen to research audio papers with researchbunny