Medicine and HealthJournal of Medical Internet Research

Large Language Models in Randomized Controlled Trials Design: Observational Study

L. Jin, J. C. L. Ong, et al.

Large language models may reshape randomized controlled trial design: GPT-4-Turbo-Preview replicated 72% of trial design elements, excelling in recruitment (88%) and intervention planning (93%) while enhancing diversity and pragmatism—though eligibility and outcomes need refinement. Expert oversight remains essential for safety and ethics. This research was conducted by the Authors listed in the <Authors> tag.... show more

General Summary Metrics

Abstract

Background: Randomized controlled trials (RCTs) face challenges such as limited generalizability, insufficient recruitment diversity, and high failure rates, often due to restrictive eligibility criteria and inefficient patient selection. Large language models (LLMs) have shown promise in various clinical tasks, but their potential role in RCT design remains underexplored. Objective: This study investigates the ability of LLMs, specifically GPT-4-Turbo-Preview, to assist in designing RCTs that enhance generalizability, recruitment diversity, and reduce failure rates, while maintaining clinical safety and ethical standards. Methods: We conducted a noninterventional, observational study analyzing 20 parallel-arm RCTs, comprising 10 completed and 10 registered studies published after January 2024 to mitigate pretraining biases. The LLM was tasked with generating RCT designs based on input criteria, including eligibility, recruitment strategies, interventions, and outcomes. The accuracy of LLM-generated designs was quantitatively assessed by 2 independent clinical experts by comparing them to clinically validated ground truth data from ClinicalTrials.gov. We conducted statistical analysis using natural language processing–based methods, including Bilingual Evaluation Understudy (BLEU), Recall-Oriented Understudy for Gisting Evaluation (ROUGE)-L, and Metric for Evaluation of Translation with Explicit ORdering (METEOR). Qualitative assessments were performed using Likert scale ratings (1-3) for domains such as safety, clinical accuracy, objectivity or bias, pragmatism, inclusivity, and diversity. Results: The LLM achieved an overall accuracy of 72% in replicating RCT designs. Recruitment and intervention designs demonstrated high agreement with the ground truth, achieving 88% and 93% accuracy, respectively. However, LLMs showed lower accuracy in designing eligibility criteria (55%) and outcomes measurement (53%). NLP-based objective scores averaged BLEU=0.04, ROUGE-L=0.20, and METEOR=0.18. Qualitative evaluations showed LLM-generated designs scored above 2 points and closely matched the original designs across all domains, indicating strong clinical alignment. LLM-based designs ranked similarly high in safety, clinical accuracy, and objectivity or bias in published RCTs and were noninferior to original designs in registered RCTs in multiple domains. LLMs enhanced diversity and pragmatism, key factors in improving RCT generalizability and addressing failure rates. Conclusions: LLMs, such as GPT-4-Turbo-Preview, show potential in improving RCT design, particularly in recruitment and intervention planning, while enhancing generalizability and diversity. Expert oversight and regulatory measures remain essential to ensure patient safety and ethics. The findings support further integration of LLMs into clinical trial design, with continued refinement needed for eligibility and outcomes measurement.

Publisher

Journal of Medical Internet Research

Published On

Sep 03, 2025

Authors

Liyuan Jin, Jasmine Chiat Ling Ong, Kabilan Elangovan, Yuhe Ke, Alexandra Pyle, Daniel Shu Wei Ting, Nan Liu

DOI

https://doi.org/10.2196/67469

Explore these studies to deepen your understanding

Adjacent work that informs or extends this paper's methodology and findings.

Psychology

Self-administered mindfulness interventions reduce stress in a large, randomized controlled multi-site study

A. Sparacio, H. Ijzerman, et al.

Medicine and Health

Comparative acute effects of mescaline, lysergic acid diethylamide, and psilocybin in a randomized, double-blind, placebo-controlled cross-over study in healthy participants

L. Leyl, F. Holze, et al.

Medicine and Health

Comparing the efficacy and pregnancy outcome of intrauterine balloon and intrauterine contraceptive device in the prevention of adhesion reformation after hysteroscopic adhesiolysis in infertile women: a prospective, randomized, controlled trial study

Q. Zhang, H. Ding, et al.

Medicine and Health

Effectiveness of virtual reality therapy in the treatment of anxiety disorders in adolescents and adults: a systematic review and meta-analysis of randomized controlled trials

W. Zeng, J. Xu, et al.

Listen, Learn & Level Up

Over 10,000 hours of research content in 25+ fields, available in 22+ languages.

No more digging through PDFs, just hit play and absorb the world's latest research in your language, on your time.

listen to research audio papers with researchbunny