Medicine and Health

Large Language Model Analysis of Reporting Quality of Randomized Clinical Trial Articles: A Systematic Review

A. Srinivasan, J. Berkowitz, et al.

Incomplete reporting in randomized clinical trials obscures bias and limits reproducibility. Using a zero-shot large-language-model pipeline, the authors automated CONSORT assessment across 21,041 RCTs, mapping reporting quality over time and disciplines and uncovering persistent gaps in critical trial details. Research conducted by Apoorva Srinivasan, Jacob Berkowitz, Nadine A. Friedrich, Sophia Kivelson, and Nicholas P. Tatonetti.

00:00

~3 min • Beginner • English

Index

Introduction

Randomized clinical trials (RCTs) underpin evidence-based medicine, but methodological flaws and incomplete reporting can bias results and impede reproducibility and guideline development. The CONSORT statement (1996, updated in 2001 and 2010) aims to standardize and improve trial reporting via a 25-item checklist, yet adherence remains inconsistent. Manual audits are labor-intensive and limited to small samples; earlier automated, rule-based or traditional machine-learning approaches have struggled with the complexity of full-text reporting. This study asks whether a zero-shot large language model (LLM) can accurately and scalably assess CONSORT compliance and, using that capability, characterize temporal trends, disciplinary differences, and associations with trial features across decades of RCT publications.

Literature Review

Prior NLP work on RCTs largely addressed PICO extraction, risk of bias, and abstract classification rather than comprehensive CONSORT compliance. Early compliance tools used rule-based methods, while transformer-based models improved performance but required substantial labeled data. Initial LLM applications showed promise on small corpora but mixed zero-shot performance (e.g., macro F1 ≈0.51 on the target dataset). Research on CONSORT adherence has typically relied on manual coding in small, discipline-specific samples or keyword checks for guideline references, limiting scale and comprehensiveness. Recent large-scale analyses of methods sections suggest improvements over time but do not cover all CONSORT items. This work advances the field by validating a zero-shot LLM against a benchmark and expert review, then applying it to a large corpus of full-text RCTs to comprehensively assess CONSORT reporting.

Methodology

Design: Systematic review and automated assessment of CONSORT compliance using zero-shot LLMs. Datasets: Benchmark evaluation used the CONSORT-TM corpus of 50 RCT publications annotated at the sentence level for 37 CONSORT items. For large-scale analysis, 53,137 open-access human RCTs (PubMed, 1966-2024) were identified; 21,041 full-text PDFs met inclusion criteria and were converted to XML (PyMuPDF v1.26.3). Metadata were enriched via Semantic Scholar. For 1,790 articles, NCT identifiers enabled linkage to ClinicalTrials.gov to extract trial characteristics (phase, funder, FDA status, data monitoring committee, safety outcomes). Only primary results articles were included; protocols/rationale articles were excluded. Models and prompting: Three proprietary models (GPT-4, GPT-4o, GPT-4o-mini) via an Azure HIPAA-compliant endpoint and one open-source model (Llama-2-7B-chat) were tested in zero-shot mode. Each CONSORT criterion was assessed independently per article by passing the entire article text and a definition of the item in a structured prompt. The model returned JSON with: criterion, rationale (step-by-step explanation), decision (MET/NOT MET), and confidence (Low/Medium/High). Evaluation and deployment: Performance was measured using precision, recall, accuracy, and macro/micro F1 scores. The best speed-accuracy model (GPT-4o-mini; LLM 3) was applied to the full corpus, reprompted item-by-item per article. Confidence filtering retained only high-confidence decisions for downstream analyses; medium-confidence outputs were excluded, and no low-confidence outputs occurred. Human validation: A stratified random sample of 70 articles (spanning eras and 30 Scimago specialties) was independently reviewed by 4 experts, who judged each model decision as correct/partially correct/incorrect. Inter-rater reliability (double-annotation on 10 articles) yielded Cohen κ=0.64, indicating substantial agreement. Error analysis informed exclusion of four CONSORT items (3b, 6b, 7b, 14b) that describe events not universally applicable and were often misclassified as nonreporting. Analysis: Temporal trends across four eras (1966-1990, 1990-2000, 2000-2010, 2010-2024), discipline-level patterns (Scimago specialties), and trial-level factors (phase, funder, FDA regulation, region, monitoring, adverse events) were assessed. Statistics included Wilson 95% CIs for proportions, χ²/Fisher tests for categorical comparisons, Pearson correlations for continuous associations, and Cramer V effect sizes interpreted as negligible (<0.10), small (0.10-0.19), medium (0.20-0.29), or large (≥0.30). Analyses were conducted in Python 3.8 with pandas 2.0, SciPy 1.10, and statsmodels 0.14.

Key Findings

Corpus: 21,041 open-access human RCT articles (886,788 item-level assessments); median publication year 2014 (IQR 2003-2020). Registry-linked subset: 1,790 trials; median planned enrollment 210 (IQR 95-440). Model performance: All LLMs outperformed prior state-of-the-art by ≥40 percentage points in macro F1. LLM 1 (GPT-4) macro F1=0.89 (95% CI 0.88-0.90), precision=0.93; LLM 2 (GPT-4o) macro F1=0.84; LLM 3 (GPT-4o-mini) macro F1=0.86 (95% CI 0.84-0.87), precision=0.97; LLM 4 (Llama-2-7B-chat) macro F1=0.74. LLM 3 selected for deployment due to best speed-accuracy trade-off; run-to-run identical results. In the 70-article validation set (2,210 decisions), LLM outputs matched experts 91.7% of the time. High-confidence decisions (90.8% of items) achieved macro F1=0.95 and precision=0.97, whereas medium-confidence decisions were unreliable (macro F1=0.31). Item reporting: Frequently reported items included background/rationale (95.9%) and objectives/hypotheses (89.2%). Critical methodological elements were often missing: external validity/generalizability discussion (1.6%), allocation concealment mechanism (16.1%), and protocol access (2.2%). Temporal trends: Mean CONSORT compliance rose from 27.3% (1966-1990) to 33.9% (1990-2000; +24.3% relative, P<.001) and to 57.0% (2010-2024; +26.7% relative, P<.001), yet overall remained below 60%, indicating persistent gaps despite guideline updates and journal adoption. Disciplinary variation: Highest reporting in urology/nephrology (63.4%) and critical care (62.3%); lowest in pharmacology (35.2%) and radiology (40.5%). Trial characteristics and oversight: Phase 2 trials showed the highest compliance (66.6%); early-phase trials were lower (≈58-59%). European trials had higher compliance (67.2%) than North American (63.8%). FDA-regulated RCTs had lower compliance (61.1%) vs non-FDA (67.4%; P<.001), as did trials with data-monitoring committees; trials reporting serious adverse events or deaths showed slight decreases. All effect sizes were negligible (Cramer V <0.10). Adjustments: Four CONSORT items (3b, 6b, 7b, 14b) were excluded from final analyses due to misclassification of non-occurring events and low prevalence (<5%), minimizing impact on overall compliance estimates.

Discussion

A zero-shot LLM can reliably audit CONSORT adherence, achieving near state-of-the-art performance without fine-tuning and aligning with human expert judgments. Applying this approach across >21,000 RCTs reveals steady improvements in reporting since the introduction and updates of CONSORT, but critical methodological details—especially randomization procedures, allocation concealment, and protocol accessibility—remain frequently underreported. Wide variation by discipline likely reflects differing research cultures, norms, and journal enforcement policies, suggesting opportunities for targeted editorial and educational interventions. Differences associated with regulatory and safety oversight were statistically significant but practically small, implying that compliance with academic reporting standards is not guaranteed by regulatory processes alone. These findings address the study’s central question by demonstrating scalable, reliable LLM-based assessment and by identifying persistent gaps and heterogeneity that stakeholders can target to enhance transparency and reproducibility.

Conclusion

This systematic review demonstrates that a zero-shot LLM can accurately and scalably quantify CONSORT compliance in RCT publications, enabling large-scale monitoring of reporting quality over nearly six decades. Despite improvements, significant gaps persist in key methodological reporting, with substantial variation across disciplines. Embedding automated LLM auditing into journal and registry workflows could provide real-time feedback to authors and editors, accelerating progress toward transparent, reproducible clinical research. Future work will refine prompts (e.g., two-stage detection for event-dependent items) and expand beyond open-access articles to capture broader publication practices.

Limitations

The analysis is limited to open-access RCT articles and focuses on the presence of reporting items rather than their quality or accuracy. Confidence-based filtering and expert validation mitigate but do not eliminate LLM hallucination risks; improved uncertainty quantification is needed. Applying consistent criteria across decades may be influenced by evolving reporting standards. Only principal results manuscripts were assessed, potentially missing information in separate protocol/rationale articles or registries. Excluding items 3b, 6b, 7b, and 14b to avoid systematic false negatives likely inflates overall performance; future two-stage prompts are planned to re-incorporate these items.

Related Publications

Explore these studies to deepen your understanding of the subject.

Psychology

Probiotics as a Tool for Regulating Molecular Mechanisms in Depression: A Systematic Review and Meta-Analysis of Randomized Clinical Trials

A. Pinhasov, M. Sikorska, et al.

Medicine and Health

Effectiveness of mindfulness-based interventions on burnout, resilience and sleep quality among nurses: a systematic review and meta-analysis of randomized controlled trials

J. Dou, Y. Lian, et al.

Medicine and Health

Nutritional and Exercise-Focused Lifestyle Interventions and Glycemic Control in Women with Diabetes in Pregnancy: A Systematic Review and Meta-Analysis of Randomized Clinical Trials

C. F. Dingena, D. Arofikina, et al.

Medicine and Health

Effectiveness of virtual reality therapy in the treatment of anxiety disorders in adolescents and adults: a systematic review and meta-analysis of randomized controlled trials

W. Zeng, J. Xu, et al.

Listen, Learn & Level Up

Over 10,000 hours of research content in 25+ fields, available in 12+ languages.

No more digging through PDFs, just hit play and absorb the world's latest research in your language, on your time.

listen to research audio papers with researchbunny