Medicine and Health

Large language models streamline automated machine learning for clinical studies

S. T. Arasteh, T. Han, et al.

This innovative study by Soroosh Tayebi Arasteh and colleagues explores the potential of ChatGPT Advanced Data Analysis (ADA) in bridging the gap between machine learning and clinical practice. With ADA autonomously creating ML models that match or exceed those developed by experts, this research offers exciting possibilities for enhancing clinical data analysis and democratizing ML in medicine.

00:00

~3 min • Beginner • English

Index

Introduction

Machine learning (ML) is increasingly used in medicine for diagnosis and outcome prediction, supported by growing data availability, computational power, and research activity. Despite this momentum, developing, implementing, and validating ML models remains complex for most clinicians and medical researchers, limiting broader adoption. Automated machine learning (AutoML) platforms help non-technical users by automating algorithm selection and tuning, but they typically require structured interfaces and do not translate natural language commands into executable code. Large language models (LLMs), particularly GPT-4 with Advanced Data Analysis (ADA), offer a conversational interface that can reason, write, execute, and refine code, potentially lowering barriers to advanced analytics. However, their validity and reliability for sophisticated clinical trial data analysis have not been systematically evaluated. This study investigates whether ChatGPT ADA can autonomously develop, implement, and execute ML models on real-world clinical trial datasets without specific methodological guidance and whether its performance matches that of specialized data scientists.

Literature Review

Prior work established AutoML platforms (e.g., MATLAB Classification Learner, Google Vertex AI, Microsoft Azure) as viable tools for non-experts to train ML models by automating model selection and hyperparameter tuning. Although useful, these platforms generally do not support full natural language-to-code pipelines. Recent advances in LLMs (e.g., GPT-4) with code execution capabilities (formerly Code Interpreter, now Advanced Data Analysis) extend AutoML by allowing users to specify tasks in plain language and have the system generate and run the necessary code. While LLMs have shown promise in medical reasoning and education, their role in autonomously conducting end-to-end ML analyses on clinical trial datasets and matching expert performance had not been validated prior to this work.

Methodology

Study design: Four large real-world clinical trial datasets spanning different medical domains were analyzed: (1) metastatic disease prediction in endocrinologic oncology (pheochromocytoma/paraganglioma), (2) esophageal cancer screening (cytologic + epidemiologic features), (3) hereditary hearing loss (genetic variants), and (4) cardiac amyloidosis (EHR-derived features). The same training/test splits as in the original studies were used when available; for cardiac amyloidosis, external validation data were unavailable, so an 80/20 internal split (1712/430) was applied per the original methodology. Workflow: For each dataset, a new ChatGPT ADA session (GPT-4, ADA feature) was initiated. The prompt included a brief description of the study background, objectives, and dataset availability. No specific instructions on preprocessing, model choice, or hyperparameters were provided. ChatGPT ADA was tasked to autonomously (i) select appropriate preprocessing and the optimal ML model and (ii) generate predictions for the test data (with ground-truth labels withheld during model development). All intermediate Python code was provided by ADA and inspected for correctness. Benchmarking: Performance metrics (e.g., AUROC, accuracy, F1-score, sensitivity, specificity) for ChatGPT ADA were computed using Python (NumPy, SciPy, scikit-learn, pandas) and compared to (a) the best-performing models reported in the original benchmark publications and (b) a validity re-implementation performed by a seasoned data scientist who re-implemented and optimized the best original model using the training data, adhering closely to published methods. Statistical analysis: Comparative evaluations were performed in a paired manner on test sets. Bootstrapping with replacement (1000 redraws) was used to estimate means, standard deviations, and 95% confidence intervals. Multiple comparisons were adjusted using false discovery rate with family-wise alpha 0.05. Thresholds for F1, sensitivity, and specificity were determined by Youden’s criterion. Explainability and reproducibility: ChatGPT ADA was instructed to perform SHAP analyses autonomously to identify the top 10 influential features per model. The data scientist reviewed ADA’s code and independently replicated SHAP computations (TreeExplainer) to confirm outputs. Reproducibility was tested across separate sessions on three consecutive days using identical data and prompts; model choices and parameters were consistent, with deviations only when computational resources were limited. Dataset-specific implementations (examples): - Metastatic disease (endocrinologic oncology): ChatGPT ADA selected Gradient Boosting Machine (GBM) with standard scaling for numeric features and one-hot encoding for categorical features; median imputation for missing data. Validity re-implementation used AdaBoost, with tuned hyperparameters via 10-fold CV. - Esophageal cancer (gastroenterology): ChatGPT ADA selected GBM, addressing class imbalance; validity re-implementation used LightGBM per the original study with L1/L2 regularization. - Hereditary hearing loss (otolaryngology): Binary/ordinal genetic variant counts; ChatGPT ADA selected Random Forest and used zero-imputation aligned with domain assumptions that missing implies absence; validity re-implementation followed the original best model specification (e.g., SVM or RF as reported) with stratified CV and scaling when appropriate. - Cardiac amyloidosis (cardiology): High-dimensional binary features; ChatGPT ADA selected Random Forest without scaling; validity re-implementation used RF with CV and grid search akin to the original study. Ethics and data: Retrospective analysis approved by the RWTH Aachen University ethics committee (EK 028/19). Datasets retrieved from public repositories as cited in the original studies. Hardware: 8-core CPU, 16 GB RAM; no GPU.

Key Findings

- Across four clinical trial datasets, ChatGPT ADA autonomously selected and executed ML workflows whose performance matched or exceeded published benchmarks. - Metastatic disease (endocrinologic oncology): ChatGPT ADA (GBM) outperformed the original best model (AdaBoost) in the benchmark comparison with AUROC 0.949 vs 0.942, accuracy 0.920 vs 0.907, F1 0.806 vs 0.755. In head-to-head testing against the validity re-implementation (AdaBoost), performance was similar: AUROC 0.949 ± 0.015 (95% CI: 0.917–0.974) vs 0.951 ± 0.014 (95% CI: 0.920–0.977), p = 0.464. - Esophageal cancer (gastrointestinal oncology): Original best LightGBM AUROC 0.960; ChatGPT ADA (GBM) AUROC 0.981, accuracy 0.979, F1 0.959, sensitivity 0.963, specificity 0.952. Head-to-head with validity re-implementation showed no significant difference: AUROC 0.979 ± 0.004 (95% CI: 0.970–0.986) vs 0.978 ± 0.005 (95% CI: 0.967–0.986), p = 0.496. - Hereditary hearing loss (otolaryngology): ChatGPT ADA (RF) AUROC 0.771 (other metrics not reported), comparable to reported models (original best RF AUROC 0.773). - Cardiac amyloidosis (cardiology): ChatGPT ADA (RF) AUROC 0.964, accuracy 0.900, F1 0.907, sensitivity 0.906, specificity 0.904; better than original reported metrics and validity re-implementation showed no significant difference in head-to-head AUROC comparisons elsewhere reported (e.g., AUROCs ~0.954 vs 0.952, p = 0.539). - Overall, no significant differences were detected between ChatGPT ADA-crafted models and expert re-implementations across datasets (global statement p ≥ 0.072). In several cases, ChatGPT ADA exceeded the originally published models’ performance. - Explainability: SHAP analyses run autonomously by ChatGPT ADA identified clinically plausible top features (e.g., sex/age and specific cytologic findings for esophageal cancer; specific pathogenic variants for hearing loss; prior cardiomyopathy for cardiac datasets).

Discussion

Findings demonstrate that a code-executing LLM (ChatGPT ADA) can autonomously perform end-to-end ML analyses on diverse clinical datasets, achieving performance comparable to expert-crafted models and often surpassing originally published benchmarks. The approach reduces technical barriers for clinicians by handling preprocessing, model selection, training, and evaluation from natural language prompts, potentially democratizing ML in clinical research and practice. The workflow showed transparency and trust-building features: ADA displayed intermediate Python code and supported SHAP-based explainability, revealing plausible, domain-consistent feature importance. ADA’s preprocessing choices (e.g., zero-imputation for binary genetic variant data) sometimes reflected domain knowledge and, in one case, were more appropriate than the data scientist’s default approach. Consistency tests indicated stable model selection and parameters across sessions, with variations only due to resource constraints. Together, these results suggest that LLM-driven AutoML can streamline clinical ML pipelines while maintaining validity and reliability. Nonetheless, real-world deployment must address data privacy, security, black-box concerns, and algorithmic bias, and should incorporate safeguards, audits, and external validation to ensure trustworthy clinical use.

Conclusion

ChatGPT Advanced Data Analysis can autonomously design and execute ML pipelines for clinical datasets with performance comparable to expert-crafted models, and in some cases superior to published benchmarks. By lowering technical barriers, it may help bridge the gap between ML developers and clinical practitioners, accelerating research and supporting data-driven decision-making. However, such tools should augment—not replace—specialized training and resources. Future work should evaluate robustness on less curated, noisier real-world data; assess generalizability across more domains; explore prompt sensitivity; and institute rigorous privacy, transparency, and bias mitigation practices, including external validation.

Limitations

- Data curation: Analyses relied on well-curated clinical datasets; performance on messy real-world data with extensive missingness or irregularities remains untested. - External validation: External validation was unavailable for the cardiac amyloidosis dataset; only internal validation was performed. - Implementation variability: Differences in preprocessing, data splitting, model configurations, and hyperparameter choices between ADA and re-implementations limit strict comparability. - Training/data bias: Potential exposure of ADA to literature up to 2021 may introduce bias; foundational data may encode algorithmic biases. - Prompt sensitivity: LLM outputs can depend on prompt phrasing; performance may vary with different prompts. - Privacy and security: Risks include inadvertent disclosure of sensitive patient data through prompts; retention of user inputs by the service provider may raise confidentiality concerns. - Proprietary black-box: Commercial, closed-source nature can reduce transparency and trust; potential for commercial and algorithmic biases. - Computational constraints: Limited resources can force alternative model choices, though ADA communicated such constraints when they occurred.

Related Publications

Explore these studies to deepen your understanding of the subject.

Computer Science

The Goldilocks paradigm: comparing classical machine learning, large language models, and few-shot learning for drug discovery applications

S. H. Snyder, P. A. Vignaux, et al.

Linguistics and Languages

Applying large language models for automated essay scoring for non-native Japanese

W. Li and H. Liu

Medicine and Health

PRISM: Patient Records Interpretation for Semantic clinical trial Matching system using large language models

S. Gupta, A. Basu, et al.

Computer Science

PENTESTGPT: Evaluating and Harnessing Large Language Models for Automated Penetration Testing

G. Deng, Y. Liu, et al.

Listen, Learn & Level Up

Over 10,000 hours of research content in 25+ fields, available in 12+ languages.

No more digging through PDFs, just hit play and absorb the world's latest research in your language, on your time.

listen to research audio papers with researchbunny