logo
Loading...
Large Language Models in Randomized Controlled Trials Design: Observational Study
Medicine and HealthJournal of Medical Internet Research

Large Language Models in Randomized Controlled Trials Design: Observational Study

L. Jin, J. C. L. Ong, et al.

Large language models may reshape randomized controlled trial design: GPT-4-Turbo-Preview replicated 72% of trial design elements, excelling in recruitment (88%) and intervention planning (93%) while enhancing diversity and pragmatism—though eligibility and outcomes need refinement. Expert oversight remains essential for safety and ethics. This research was conducted by the Authors listed in the <Authors> tag.... show more
Abstract
Background: Randomized controlled trials (RCTs) face challenges such as limited generalizability, insufficient recruitment diversity, and high failure rates, often due to restrictive eligibility criteria and inefficient patient selection. Large language models (LLMs) have shown promise in various clinical tasks, but their potential role in RCT design remains underexplored. Objective: This study investigates the ability of LLMs, specifically GPT-4-Turbo-Preview, to assist in designing RCTs that enhance generalizability, recruitment diversity, and reduce failure rates, while maintaining clinical safety and ethical standards. Methods: We conducted a noninterventional, observational study analyzing 20 parallel-arm RCTs, comprising 10 completed and 10 registered studies published after January 2024 to mitigate pretraining biases. The LLM was tasked with generating RCT designs based on input criteria, including eligibility, recruitment strategies, interventions, and outcomes. The accuracy of LLM-generated designs was quantitatively assessed by 2 independent clinical experts by comparing them to clinically validated ground truth data from ClinicalTrials.gov. We conducted statistical analysis using natural language processing–based methods, including Bilingual Evaluation Understudy (BLEU), Recall-Oriented Understudy for Gisting Evaluation (ROUGE)-L, and Metric for Evaluation of Translation with Explicit ORdering (METEOR). Qualitative assessments were performed using Likert scale ratings (1-3) for domains such as safety, clinical accuracy, objectivity or bias, pragmatism, inclusivity, and diversity. Results: The LLM achieved an overall accuracy of 72% in replicating RCT designs. Recruitment and intervention designs demonstrated high agreement with the ground truth, achieving 88% and 93% accuracy, respectively. However, LLMs showed lower accuracy in designing eligibility criteria (55%) and outcomes measurement (53%). NLP-based objective scores averaged BLEU=0.04, ROUGE-L=0.20, and METEOR=0.18. Qualitative evaluations showed LLM-generated designs scored above 2 points and closely matched the original designs across all domains, indicating strong clinical alignment. LLM-based designs ranked similarly high in safety, clinical accuracy, and objectivity or bias in published RCTs and were noninferior to original designs in registered RCTs in multiple domains. LLMs enhanced diversity and pragmatism, key factors in improving RCT generalizability and addressing failure rates. Conclusions: LLMs, such as GPT-4-Turbo-Preview, show potential in improving RCT design, particularly in recruitment and intervention planning, while enhancing generalizability and diversity. Expert oversight and regulatory measures remain essential to ensure patient safety and ethics. The findings support further integration of LLMs into clinical trial design, with continued refinement needed for eligibility and outcomes measurement.
Publisher
Journal of Medical Internet Research
Published On
Sep 03, 2025
Authors
Liyuan Jin, Jasmine Chiat Ling Ong, Kabilan Elangovan, Yuhe Ke, Alexandra Pyle, Daniel Shu Wei Ting, Nan Liu
Tags
Large language modelsRCT designGeneralizabilityRecruitment diversityGPT-4-Turbo-PreviewEligibility criteriaOutcome measurement
Listen, Learn & Level Up
Over 10,000 hours of research content in 25+ fields, available in 22+ languages.
No more digging through PDFs, just hit play and absorb the world's latest research in your language, on your time.
listen to research audio papers with researchbunny