Medicine and Health

PRISM: Patient Records Interpretation for Semantic clinical trial Matching system using large language models

S. Gupta, A. Basu, et al.

Discover PRISM, a groundbreaking system developed by a team of skilled researchers, including Shashi Gupta and Aditya Basu, that uses Large Language Models to automate the labor-intensive process of clinical trial matching. Our proprietary OncoLLM model outshines traditional models in accuracy while ensuring patient privacy through its deployment on private infrastructure. Join us as we revolutionize the way patients are matched to clinical trials!

00:00

~3 min • Beginner • English

Index

Introduction

The study addresses the challenge of efficiently and accurately matching patients to clinical trials, a task hindered by labor-intensive manual review of unstructured EHR notes and complex inclusion/exclusion criteria. In oncology, critical eligibility details (e.g., staging) are often absent from structured data, making scalable automation especially difficult. Prior NLP approaches rely on rule-based systems or structured query conversion, which are hard to scale across diverse trials and EHR systems. Recent LLM-based methods show promise but have largely been evaluated on simplified or synthetic datasets and short contexts, limiting real-world applicability. The authors propose PRISM, an end-to-end pipeline leveraging LLMs to directly process unstructured patient notes and free-text trial criteria, hypothesizing that a domain-tuned model (OncoLLM) can achieve performance comparable to clinicians while enabling privacy-preserving deployment and cost efficiency.

Literature Review

Two conventional NLP strategies dominate prior work: (1) translating trial inclusion/exclusion criteria into structured queries to retrieve from clinical data warehouses, and (2) extracting key patient elements from notes to structure and filter trials. These methods often require extensive rule engineering and face scaling limitations across heterogeneous trials and EHRs. Recent LLM-driven studies demonstrate improved patient-trial matching but typically use synthetic or simplified notes, limited numbers of notes per patient, and narrow trial sets, often relying on proprietary LLMs. TrialGPT showed GPT-3.5 can rank trials; TrialLlama suggested open-source models (e.g., LLAMA 2) can outperform GPT-3.5 for privacy-sensitive deployments. Other contemporaneous efforts used GPT-4 with retrievers but evaluated only on curated charts (2–5 notes) and a single trial with generic criteria (13 criteria), reducing complexity relative to real settings. The present study differentiates itself by using full EHR journeys (often 100–200 notes), over 200 trials, and roughly 10,000 criteria, and by introducing a fine-tuned, smaller, deployable model (OncoLLM).

Methodology

Study approval: MCW IRB #00044894. Problem formulation: The task is framed as compositional question answering. For a patient P with notes N and trial criteria T, criteria are decomposed into questions Q with answers in {Yes, No, NA}. A composition function aggregates answers via logical operators to produce a match score S used for ranking. Dataset preparation: Patients were identified from a deidentified institutional data warehouse with EHR notes and enrollment info. For each of 98 patients, one ground-truth trial (actual enrollment) was chosen based on criteria complexity; notes were truncated to pre-enrollment. For ranking, 9 additional active trials of the same cancer type per patient were selected as negatives (patient not enrolled), yielding 980 patient-trial pairs. PRISM pipeline: (1) Trial composition module: Trial free-text criteria from ClinicalTrials.gov are converted via GPT-3.5 into simplified, independent Yes/No/NA questions. Complex criteria are decomposed into multiple questions connected with Boolean logic in disjunctive normal form (DNF). Manual evaluation across 50 trials found ~1% missed questions, ~2% incorrectly formed, and ~89% correct Boolean logic. (2) Chunking and retrieval: Patient notes are filtered and chunked using spaCy with one-sentence overlap (three-sentence chunks) to maintain context. Duplicate chunks are removed by exact-match and cosine similarity (threshold 0.96), retaining latest copies. Embeddings are computed with Azure OpenAI Ada; cosine similarity retrieves relevant chunks to question queries. (3) Question-answering: Retrieved chunks are ordered chronologically (with note date, type) to help temporal reasoning. Zero-shot prompting, temperature 0, and max 8k-character outputs are used. Models are served via Azure HIPAA-compliant APIs or VLLM on HIPAA-compliant infrastructure (4×A100 80GB). The output JSON includes: question explanation, answer explanation, final answer (Yes/No/NA), and confidence (1–5). Chain-of-thought is used. (4) Answers→criteria: For each criterion’s logical tree, if some question answers are NA, marginalization over possible combinations is performed: P(criteria met|data) sums over possible NA assignments with uniform priors (1/2^N). Thresholds: Yes if >0.66, No if <0.34, else NA. Logical short-circuiting applies when sufficient to determine truth (e.g., conjunction with a known false antecedent). (5) Scoring module: Three trial-level scoring methods are implemented. Simple counting: proportion of criteria met. Iterative tier: criteria are grouped into tiers T1–T4 by clinical importance; traverse from higher to lower tiers and stop at first violation; score is proportion met until violation normalized by total criteria. Weighted tier: compute criterion-level scores s(x): met = 1; NA = 0.5; unmet in T1 = −0.5; unmet in other tiers = 0. Aggregate tier scores and combine with weights w1=2, w2=1.5, w3=1, w4=0.5. Q&A dataset for accuracy benchmarking: From 10,000 notes across 50 cancer patients, 720 binary questions were created from criteria. GPT-3.5 preselected relevant segments (98% precision, 94% recall) to reduce manual load; five physicians annotated answers (Yes/No/NA). Inter-annotator agreement averaged 64% (five annotators) and 70% (two best). OncoLLM model: Built on Qwen-1.5 14B, fine-tuned via supervised fine-tuning with LoRA (rank 64, alpha 32, dropout 0.1) on several thousand proprietary chunk-question pairs plus ~2B tokens of synthetic oncology text. Training used 8×A100 (40–80GB), ~1500 GPU hours; learning rate 3e-4, batch size 1 with grad accumulation 16; 5 epochs on clinical data, 1 on synthetic data. No patient data or trials used for reported evaluation were in training. Costing: OncoLLM runtime on VLLM measured in tokens/sec and costed using GCP GPU VM hourly rates; GPT-4 costs estimated from Azure pricing.

Key Findings

Question-level accuracy (720 Q&A): OncoLLM achieved 63% accuracy, outperforming GPT-3.5 Turbo (53%), Qwen14B-Chat (43%), Mistral-7B-Instruct (41%), Mixtral-8x7B-Instruct (49%), and domain/task-specific models (Meditron, MedLlama, TrialLlama). GPT-4 reached 68% but at much higher computational cost and scale. McNemar’s test comparing GPT-3.5 vs OncoLLM: p=0.0013 (<0.05), confirming significance. Excluding questions marked NA by experts increased OncoLLM accuracy to 66% (from 63%) and GPT-4 to 72% (from 68%), while GPT-3.5 and other open-source models decreased, indicating inflated performance on ambiguous inputs. Concept-wise accuracy across 13 oncology concepts and four importance tiers showed OncoLLM generally second-best and outperforming GPT-4 on biomarkers. Ranking performance: Patient-centric (98 patients; 10 trials/patient; metric: proportion of ground-truth trial ranked in top-3). OncoLLM: Weighted tier 0.65, Iterative tier 0.63, Simple 0.62; GPT-3.5 Turbo: 0.59, 0.61, 0.57 respectively. Trial-centric (36 trials; NDCG with binary relevance): OncoLLM: Weighted 0.68, Iterative 0.66, Simple 0.62; GPT-3.5 Turbo: 0.62, 0.63, 0.62. Criteria coverage on ground-truth trials (98 pairs): OncoLLM met 62% of criteria on average vs GPT-3.5’s 55.4%; for cases where the ground-truth trial ranked top-3, OncoLLM met 66.7% vs GPT-3.5’s 59%. OncoLLM produced fewer NA answers than GPT-3.5 on simplified questions for top-3 cases, improving practical utility. Error analysis: For 10 trials ranked top-1 by OncoLLM but not the patient’s actual enrollment, a clinical data expert found 9/10 patients eligible, implying an adjusted top-3 success of ~95%. Criteria-level interpretation audit: Final answers correct 75.26%; when answer was correct, explanations were correct 90.91%; citations correct 86.71% and partially correct 6.29%. Cost-benefit: Estimated cost for matching 98 patients × 10 trials: OncoLLM ~$170 total (~$0.17 per patient-trial), GPT-4 ~$6,055 (~$6.18 per patient-trial), ≈35× higher for GPT-4. Training OncoLLM cost ~$2,688 (7 days × $16/hour on 8×A100), a one-time cost amortizable over screenings. Use of all notes: Relevant evidence spans long periods (often >1 year), with a concentration in the most recent 50% of notes, underscoring the need to process the full EHR while recognizing recency importance.

Discussion

Findings demonstrate that an end-to-end LLM pipeline using unstructured EHR notes and free-text clinical trial criteria can approach or match clinician performance at criteria-level tasks while enabling effective two-way ranking (patient-to-trial and trial-to-patient). OncoLLM, a smaller, fine-tuned model, closely approaches GPT-4 accuracy with substantially lower costs and can be deployed within privacy-compliant infrastructures, addressing key barriers to real-world adoption. The ranking analyses validate that higher question-level accuracy translates into superior retrieval of relevant trials/patients, and that weighting criteria by clinical importance (tiering) improves ranking utility in line with clinical workflows. The reduced proportion of NA responses and higher proportion of criteria met on ground-truth trials make OncoLLM’s outputs more actionable for human-in-the-loop screening. Audits of explanations and citations show high correctness, supporting safe verification and traceability. The distribution of relevant evidence across the entire patient timeline emphasizes the necessity of comprehensive note ingestion with chronological structuring to answer temporal and longitudinal criteria. Collectively, the results indicate LLM-driven matching can substantially reduce manual effort and improve identification of eligible trials, including those not previously offered to patients.

Conclusion

The study introduces PRISM, a scalable end-to-end system that ingests unstructured patient notes and free-text trial criteria to perform criteria-level reasoning and ranking for clinical trial matching. The domain-tuned OncoLLM outperforms GPT-3.5, approaches GPT-4 on criteria accuracy, and yields superior patient- and trial-centric ranking, all at a fraction of GPT-4’s cost while supporting privacy-preserving deployment. Weighted tiering aligned with clinical importance improves ranking utility, and explanation/citation audits support reliable human verification. Future work should integrate structured data (e.g., labs) with unstructured notes, improve and rigorously evaluate retrievers, optimize tier weights, and standardize annotation and evaluation protocols to reduce variability and enhance generalizability. These advances will further enable robust, cost-effective, and privacy-sensitive AI systems for clinical trial screening in real-world oncology settings.

Limitations

Current reliance solely on unstructured notes can miss critical information commonly recorded in structured formats (e.g., lab values). Embedding-based retrievers have inherent limitations and were not fine-tuned in this study; their impact on outcomes warrants deeper evaluation. Obtaining definitive ground truth is difficult; annotator variability was notable, particularly for ambiguous criteria, lowering inter-annotator agreement. Some evaluated domain/task-specific baseline models had limited context windows, restricting direct comparisons on all samples. Although OncoLLM performs strongly, end-to-end accuracy has not reached ideal levels, and broader external validation across institutions and cancer types is needed.

Related Publications

Explore these studies to deepen your understanding of the subject.

Chemistry

ChatMOF: an artificial intelligence system for predicting and generating metal-organic frameworks using large language models

Y. Kang and J. Kim

Medicine and Health

Large language models streamline automated machine learning for clinical studies

S. T. Arasteh, T. Han, et al.

Computer Science

Detecting hallucinations in large language models using semantic entropy

S. Farquhar, J. Kossen, et al.

Linguistics and Languages

Applying large language models for automated essay scoring for non-native Japanese

W. Li and H. Liu

Listen, Learn & Level Up

Over 10,000 hours of research content in 25+ fields, available in 12+ languages.

No more digging through PDFs, just hit play and absorb the world's latest research in your language, on your time.

listen to research audio papers with researchbunny