logo
ResearchBunny Logo
Leveraging Large Language Models for Precision Monitoring of Chemotherapy-Induced Toxicities: A Pilot Study with Expert Comparisons and Future Directions

Medicine and Health

Leveraging Large Language Models for Precision Monitoring of Chemotherapy-Induced Toxicities: A Pilot Study with Expert Comparisons and Future Directions

O. R. Sarrias, M. P. M. D. Prado, et al.

This study evaluated the capacity of Large Language Models to classify chemotherapy toxicities, finding them to match the accuracy of expert oncologists in general toxicity ratings. The authors highlight the potential of LLMs for improving patient monitoring while alleviating the burden on medical professionals. The research was conducted by Oskitz Ruiz Sarrias and colleagues.

00:00
00:00
Playback language: English
Introduction
Large Language Models (LLMs), like OpenAI's GPT models, have shown transformative potential across various fields, particularly medicine. Their ability to understand and generate contextual text, adapting to new tasks without specific training, revolutionizes clinical practices by enhancing documentation, patient interaction, and decision-making. In oncology, LLMs offer significant potential for improving patient care through continuous monitoring of chemotherapy-induced toxicities, a task often overwhelming for human resources. However, existing research hasn't sufficiently explored LLMs' accuracy in identifying and assessing subjective toxicities based on patient descriptions. This pilot study aims to evaluate an LLM's ability to accurately classify these toxicities, facilitating personalized and continuous patient care. The continuous and precise patient monitoring crucial in oncology is currently hampered by resource limitations; healthcare providers cannot offer 24/7 personalized care to all patients. An AI system capable of real-time evaluation and alert generation could significantly improve care quality and optimize resource allocation. This study specifically focuses on evaluating whether an LLM can correctly identify the presence and severity of subjective toxicities—those relying on patient descriptions rather than analytical tests—comparing its interpretation to that of experienced oncologists. The ability of an LLM to accurately understand and evaluate these subjective toxicities could revolutionize patient monitoring, enabling personalized and continuous care without additional human resources, improving care quality and reducing risks associated with treatment toxicities. The integration of LLMs in cancer care is an emerging area with promising results in toxicity monitoring, medical record management, and decision-making. However, existing studies also highlight challenges like error rates and the need for expert verification, necessitating careful implementation and rigorous validation.
Literature Review
Several recent studies explore LLMs in cancer care, demonstrating their potential in improving diagnosis and treatment accuracy, supporting both doctors and patients. LLMs show promise in toxicity monitoring by processing patient-reported symptoms and facilitating early identification for timely interventions. They've also been used in managing electronic health records (EHRs) and providing summaries in tumor boards. However, inconsistencies exist. Some studies reveal significant error rates and outdated information provided by LLMs, emphasizing the need for thorough expert verification. In radiotherapy-related toxicity monitoring, a pilot study showed that GPT-4, used in summary or chatbot format, saved oncologists time but highlighted accuracy concerns. Other research revealed that a substantial percentage of ChatGPT responses in radiation oncology contained incorrect or incomplete information. In summary, while LLMs offer value in various cancer care aspects, their effective implementation requires careful verification processes and clear guidelines to minimize risks associated with inaccurate information. Addressing ethical and technical challenges like diverse data management, algorithm transparency, bias avoidance, and data privacy is crucial for their safe integration into clinical practice. Existing research also highlights advancements in toxicity monitoring using technology and patient-reported outcomes (PROs). Studies demonstrate the feasibility and benefits of online platforms for self-reporting toxicity symptoms, leading to improved patient-provider communication, satisfaction, and safety through real-time data and alerts. The use of ePRO assessments enhances care quality, and symptom self-reporting has shown positive impacts on health-related quality of life (HRQL) and reduced hospitalizations. Despite these advancements, challenges remain in integrating PRO data into clinical workflows and systems.
Methodology
This comparative pilot study evaluated an LLM's ability to classify subjective chemotherapy-induced toxicities. Thirty fictitious patient cases were created using expert knowledge and OpenAI's GPT-4, simulating realistic toxicological scenarios with diverse toxicities, proportionate to real-world incidence. These scenarios were carefully reviewed and validated to ensure clinical appropriateness and realism. For each scenario, GPT-4 generated a realistic patient narrative, including personal details and daily life experiences to enhance realism. The narratives were then converted into audio recordings using OpenAI's Text-to-Speech model and subsequently evaluated for realism by experts. Thirteen experienced oncologists (initially fifteen, but two were unavailable), familiar with CTCAE v.5.0, evaluated the audio recordings of the 30 fictitious cases. They classified the reported toxicities according to CTCAE v.5 severity grades (0-4). The oncologists were split into two groups for efficiency, each group evaluating 15 cases. OpenAI's GPT-4 (version trained with data up to December 2023), contextualized with the CTCAE v.5 criteria, was used as the LLM. The contextualization involved detailed explanations of the classification method, toxicity descriptions, and grade interpretations. The model was not shown any real classification examples during training. The LLM classified each case ten times to introduce variability. The mode and mean of the LLM's ten classifications were used for comparison with the oncologists' evaluations. Two primary metrics were used: (1) Accuracy in general categories (grouping toxicities into 'no toxicity', 'mild', and 'severe'), and (2) Accuracy in specific categories (comparing the exact grade assigned by the LLM to that of the oncologists). Error types were classified as mild (higher grade assigned by the LLM) or severe (lower grade assigned by the LLM). False alarms were identified when the LLM classified a toxicity as severe while the oncologists' mean evaluation was not severe. The oncologists' individual accuracies were compared to determine the range of expert-level performance, and 95% confidence intervals for the median accuracy were calculated. The LLM's performance was compared to these benchmarks.
Key Findings
The study revealed considerable variability in oncologists' evaluations due to the lack of direct patient interaction. When comparing the LLM's performance using the mean of responses to the mean of oncologists' responses, the LLM achieved 85.7% accuracy in general toxicity categories and 64.6% accuracy in specific categories. The rate of mild errors was 96.4%, and severe errors were 3.6%. False alarms occurred in 3% of cases. Individual oncologist accuracy ranged from 66.7% to 89.2% for general categories and 57.0% to 76.0% for specific categories. The 95% confidence intervals for the median accuracy of oncologists were 81.9% to 86.9% for general categories and 67.6% to 75.6% for specific categories. The LLM's accuracy in general categories fell within the expert range and confidence intervals, demonstrating comparable performance. However, its accuracy in specific categories was below the expert range and confidence intervals, highlighting a need for improvement. When comparing the mode of responses the LLM achieved 81.5% accuracy in general categories and 64.4% accuracy in specific categories. This indicated a slightly lower performance compared to using the mean of the responses.
Discussion
The findings demonstrate that LLMs can classify subjective chemotherapy-induced toxicities with accuracy comparable to expert oncologists, especially in general toxicity categories. The LLM's performance within the expert range for general categories suggests its potential for assisting in patient monitoring. However, the lower accuracy in specific categories indicates a need for further refinement to reach expert-level precision. The study's limitations—use of fictitious cases, lack of patient interaction, reliance on audio transcriptions—influence the interpretation of results. The variability in oncologists' evaluations, stemming from the absence of direct patient interaction, underscores the importance of future studies incorporating real-time interactions. Despite these limitations, the results suggest significant potential for LLMs in enhancing patient monitoring and reducing oncologists' workload. The potential benefits of early toxicity detection, enabling timely interventions and reducing severe complications, are substantial. The limitations highlight the need for specific LLM training and validation with real patients, enabling more accurate and reliable performance. The study successfully evaluated the feasibility of using fictitious cases to justify future studies with real patients and demonstrated the accuracy of a contextualized LLM without specific training for this task. The model's promising performance, despite its limitations, justifies further research to fully realize its clinical potential.
Conclusion
This study demonstrates the potential of LLMs in classifying chemotherapy-induced toxicities with accuracy comparable to experts, particularly in general categories. While the model shows promise, further research focusing on specific training, real-patient validation, and interactive capabilities is crucial. Ethical considerations regarding data accuracy, transparency, and privacy are paramount for safe clinical integration.
Limitations
The primary limitations of this study are the use of fictitious cases and pre-recorded audio instead of real patient interactions, the lack of an interactive component in the LLM evaluation, and the reliance on audio transcriptions rather than direct processing of audio input. These limitations could have influenced the variability in oncologist evaluations and the LLM's performance. Furthermore, the LLM was contextualized but not specifically trained for this medical task, potentially limiting its accuracy. Future studies with real patients are necessary to validate these findings and address these limitations.
Listen, Learn & Level Up
Over 10,000 hours of research content in 25+ fields, available in 12+ languages.
No more digging through PDFs, just hit play and absorb the world's latest research in your language, on your time.
listen to research audio papers with researchbunny