Introduction
Clinical trials are crucial for advancing medical research and providing treatment options, particularly in oncology. However, patient recruitment for clinical trials faces significant challenges, with only a small percentage of eligible adults participating in cancer trials. One major hurdle is the time-consuming process of manually reviewing patient EHRs against complex trial inclusion/exclusion criteria. Physicians face difficulty systematically reviewing patients against numerous trials, resulting in missed opportunities for patients. The relevant information for clinical trial eligibility is often embedded within unstructured EHR data, such as medical notes, making large-scale interpretation challenging. Existing NLP approaches often rely on rule-based systems, which are inflexible and difficult to scale. While recent studies show promise in using LLMs for patient-trial matching, they frequently use simplified or synthetic datasets, failing to capture real-world complexities like long context (multiple patient notes) and diverse EHR systems. This research aims to address these limitations by evaluating an end-to-end clinical trial matching system using real-world EHR data and LLMs, demonstrating the feasibility and effectiveness of this approach in a large-scale setting.
Literature Review
Two main NLP approaches for patient-trial matching exist: converting trial criteria into structured queries for efficient data retrieval, and extracting key data elements from patient notes for structured filtering. Both approaches have limitations, relying on rule-based engineering and facing scalability issues. Recent research exploring LLMs for patient-trial matching shows promise but primarily uses simplistic or synthetic data, lacking the complexity of real-world scenarios. These studies often focus on narrowly defined variables using specialized models, leading to overfitting and incomplete processing of complex criteria. Many approaches also rely on proprietary LLMs, hindering deployment in privacy-sensitive healthcare settings. This study aims to be the first comprehensive end-to-end evaluation of a clinical trial matching system using real-world EHR data and LLMs, directly ingesting unstructured patient notes and clinical trial criteria without rule-based processing.
Methodology
The PRISM pipeline comprises several modules: a chunking module to segment patient notes for processing by the LLM; a retrieval module to identify relevant information within the chunks; a trial composition module to convert trial criteria into simplified questions suitable for the LLM; a question-answering module which uses the LLM (OncoLLM or other models for comparison) to answer the questions based on the retrieved information; and a scoring module to aggregate the LLM's answers into a final score for each patient-trial pair. The OncoLLM model, a 14B parameter model fine-tuned using a combination of real-world and synthetic oncology data, is compared with several other LLMs (GPT-3.5, GPT-4, Qwen14B-Chat, Mistral-7B-Instruct, Mixtral-8x7B-Instruct, Meditron, MedLlama, and TrialLlama). The dataset consists of 98 cancer patients, each with one ground truth trial and nine other trials for comparison. Three scoring methods are used to rank trials for each patient and vice versa: Simple Counting, Iterative Tier, and Weighted Tier, the latter two incorporating a tiered system for criteria based on clinical importance. The dataset was manually annotated by medical doctors, with inter-annotator agreement of 64% (all annotators) and 70% (top two annotators). A cost analysis comparing OncoLLM and GPT-4 is also performed.
Key Findings
OncoLLM significantly outperforms GPT-3.5 and other similarly sized models in criteria/question level accuracy (63% vs 53% for GPT-3.5), and nearly matches GPT-4 (68%). When excluding ambiguous 'N/A' answers, OncoLLM's accuracy further improves to 66%, exceeding the performance of open-source models. In patient-centric ranking, OncoLLM consistently outperforms GPT-3.5 across all three scoring methods, placing the ground truth trial within the top three 65.3% of the time (Weighted Tier method). In trial-centric ranking, OncoLLM also demonstrates superior performance, achieving an NDCG score of 0.68 (Weighted Tier) compared to 0.62 for GPT-3.5. Error analysis shows that OncoLLM's top-ranked trials are often genuinely eligible, and interpretation accuracy with evidence citations is high (75.26% for answers, 90.91% for explanations). Cost analysis shows OncoLLM is significantly more cost-effective than GPT-4 (~$170 vs ~$6055 for 98 patients with ten trials each), costing ~$0.17 per patient-trial pair compared to ~$6.18 for GPT-4. Analysis of the timing of relevant information in patient notes demonstrates the importance of using the full patient history, though the most recent notes often contain the most critical data.
Discussion
The results demonstrate the potential of using LLMs, specifically smaller fine-tuned models, for efficient and accurate clinical trial matching. PRISM's ability to handle unstructured data and long-context scenarios, combined with its comparable performance to medical doctors, presents a significant advancement over existing methods. The cost-effectiveness of OncoLLM makes it a practical solution for institutions with privacy and budget concerns. While the study focuses on patient-centric matching, the framework can also be applied to trial-centric searches. The tiered scoring approach effectively incorporates domain expertise for improved ranking performance. The findings suggest LLMs are nearing readiness for real-world clinical applications in trial matching.
Conclusion
PRISM, utilizing the OncoLLM model, offers a scalable and cost-effective solution for clinical trial matching, surpassing the performance of several existing LLMs while maintaining comparable accuracy to qualified medical professionals. Future work should focus on integrating structured data, improving embedding-based retrievers, and standardizing annotation processes to further enhance accuracy and reliability. This study paves the way for widespread adoption of AI-driven clinical trial matching, significantly improving patient access to clinical trials and accelerating medical research.
Limitations
The study relies primarily on unstructured data, potentially missing crucial information from structured EHR fields (like lab values). The accuracy of the system, although improved, is not yet perfect, partly due to challenges in obtaining a 'true' ground truth in real-world clinical settings. Inter-annotator variability highlights the need for enhanced annotation strategies and improved methods for defining patient eligibility criteria. Further research should explore hybrid approaches using both structured and unstructured data and refine the eligibility criteria to reduce ambiguity and improve accuracy. The study is limited to a single cancer center's data, and generalizability to other institutions needs to be validated.
Related Publications
Explore these studies to deepen your understanding of the subject.