logo
ResearchBunny Logo
Large Language Model Analysis of Reporting Quality of Randomized Clinical Trial Articles: A Systematic Review

Medicine and Health

Large Language Model Analysis of Reporting Quality of Randomized Clinical Trial Articles: A Systematic Review

A. Srinivasan, J. Berkowitz, et al.

Incomplete reporting in randomized clinical trials obscures bias and limits reproducibility. Using a zero-shot large-language-model pipeline, the authors automated CONSORT assessment across 21,041 RCTs, mapping reporting quality over time and disciplines and uncovering persistent gaps in critical trial details. Research conducted by Apoorva Srinivasan, Jacob Berkowitz, Nadine A. Friedrich, Sophia Kivelson, and Nicholas P. Tatonetti.

00:00
00:00
~3 min • Beginner • English
Abstract
IMPORTANCE Incomplete reporting in randomized clinical trials (RCTs) obscures bias and limits reproducibility. Manual audits for adherence to the Consolidated Standards of Reporting Trials (CONSORT) guideline cannot keep pace with publication volume. OBJECTIVES To build and validate a zero-shot large-language-model (LLM) pipeline for automated CONSORT assessment and to map reporting quality over time, biomedical disciplines, and trial features. DESIGN, SETTING, AND PARTICIPANTS This systematic review included RCTs that were indexed on PubMed, available in English, open access, human-participant research, and published between MONTH 1966 to MONTH 2024. PubMed PDFs were converted to XML and linked with Semantic Scholar and ClinicalTrials.gov metadata. Chat GPT-4o-mini was tested on the 50-article CONSORT-Text Classification Model (CONSORT-TM) benchmark, checked by experts in 70 randomly sampled RCTs, and then applied to the full sample. EXPOSURE Publication year, biomedical discipline, funding source, trial phase, US Food and Drug Administration regulation, and oversight features. MAIN OUTCOMES AND MEASURES The LLM judged whether each of 21 CONSORT items was met. Primary outcomes were (1) model performance vs expert review (precision, recall, and macro F1 score) and (2) proportion of items reported. RESULTS OF 53 137 screened PDFs, 21 041 RCTs (median [IQR] publication year, 2014 [2003-2020]; 30 disciplines) were included, with a registry-linked subset of 1790 RCTs that had a median (IQR) planned enrollment of 210 (95-440) participants. In the 70-article validation set (2210 decisions) LLM outputs matched experts 91.7% of the time (2026 of 2210 decision); the macro F1 score on CONSORT-TM was 0.86 (95% CI, 0.84-0.87). Mean CONSORT compliance increased from 27.3% (95% CI, 27.0%-27.6%) in 1966 to 1990 to 57.0% (95% CI, 56.8%-57.2%) in 2010 to 2024. However, reporting critical elements remained uncommon, such as allocation-concealment mechanism (16.1% [95% CI, 15.6%-16.6%]) and external-validity discussion (1.6% [95% CI, 1.5%-1.8%]). Compliance varied across disciplines from 35.2% (95% CI, 34.8%-35.6%) in pharmacology to 63.4% (95% CI, 62.1%-64.7%) in urology and showed only negligible associations with clinical trial characteristics (all Cramer V <0.10). CONCLUSIONS AND RELEVANCE In this systemic review of RCTs, a zero-shot LLM audited CONSORT adherence at scale, uncovering persistent reporting gaps and wide disciplinary variation across biomedical fields, underscoring the need for targeted editorial action to boost transparency and reproducibility.
Publisher
JAMA Network Open
Published On
Aug 28, 2025
Authors
Apoorva Srinivasan, Jacob Berkowitz, Nadine A. Friedrich, Sophia Kivelson, Nicholas P. Tatonetti
Tags
CONSORT adherence
randomized clinical trials
large language model
zero-shot auditing
reporting transparency
biomedical disciplines
reproducibility
Listen, Learn & Level Up
Over 10,000 hours of research content in 25+ fields, available in 12+ languages.
No more digging through PDFs, just hit play and absorb the world's latest research in your language, on your time.
listen to research audio papers with researchbunny