Medicine and Health
Heterogeneity and predictors of the effects of AI assistance on radiologists
F. Yu, A. Moehring, et al.
This large-scale study conducted by Feiyang Yu, Alex Moehring, Oishi Banerjee, Tobias Salz, Nikhil Agarwal, and Pranav Rajpurkar uncovers unexpected insights on the effects of AI assistance on radiologists' performance in chest X-ray diagnostics. The research highlights the vital role of accurate AI models, revealing that traditional experience does not guarantee better outcomes.
~3 min • Beginner • English
Introduction
The study addresses how AI assistance affects individual radiologists’ diagnostic performance and which factors predict this heterogeneity. Despite demonstrated average benefits of AI in medical image interpretation, clinicians vary in expertise, experience and decision-making styles, making a one-size-fits-all approach suboptimal. Prior work largely analyzed clinician groups, with limited exploration of individual-level effects and predictors, and often measured changes in predictions rather than accuracy. This study investigates heterogeneous treatment effects (change in performance from without to with AI) across 140 radiologists on 15 chest X-ray tasks, testing whether experience-based characteristics (years of experience, thoracic subspecialty, prior AI tool use), direct measures of diagnostic skill (unassisted performance), and AI error (magnitude and direction) predict the effect of assistance. Performance is examined for calibration (absolute error) and discrimination (AUROC).
Literature Review
Previous studies of clinician–AI collaboration often evaluated aggregate clinician performance and overlooked individual heterogeneity. Some explored heterogeneity by seniority, task expertise, or experience level, but with limitations: emphasis on prediction changes rather than accuracy changes, omission of potentially relevant predictors such as AI tool experience, and reliance on indirect skill proxies (e.g., years of experience) with little validation of direct diagnostic skill measures. There remains limited understanding of whether direct measures of diagnostic performance predict who benefits from AI assistance. This work builds on and addresses these gaps by systematically examining experience-based factors, direct unassisted performance, and AI error characteristics as predictors of heterogeneous effects across multiple pathologies.
Methodology
Design and participants: A large-scale diagnostic study evaluated 140 radiologists on 324 retrospective patient cases from Stanford University’s healthcare system. Radiologists received onboarding training on the assistive AI system and example AI predictions to calibrate interpretation. Two experimental designs were used: (1) Non-repeated-measure design with 107 radiologists, each reviewing 60 cases: 30 without AI and 30 with AI, with half of each set presented with clinical histories; cases were randomly assigned and treatment conditions randomized. (2) Repeated-measure design with 33 radiologists, each reviewing the same 60 cases under four conditions across four sessions (with/without AI × with/without clinical histories), with random order and a two-week washout between sessions. Analyses combined clinical-history conditions from both designs.
Tasks and AI system: Fifteen chest X-ray pathologies were evaluated: abnormal, airspace opacity, atelectasis, bacterial/lobar pneumonia, cardiomediastinal abnormality, cardiomegaly, consolidation, edema, lesion, pleural effusion, pleural other, pneumothorax, rib fracture, shoulder fracture and support device hardware. Assistive AI predictions were generated by the CheXpert model (DenseNet121-based), presented to radiologists as probabilities on a 0–100 scale for all 15 pathologies simultaneously. Ground truth probabilities were the average of five board-certified thoracic radiologists’ probability estimates (0–100). Binary labels for certain analyses were obtained by thresholding the ground truth probability at 50.
Performance metrics: Calibration performance was measured by absolute error (|radiologist-predicted probability − ground truth probability| on a 0–100 scale). Treatment effect was defined as improvement in absolute error = (unassisted error) − (assisted error). Discrimination performance was measured using AUROC (with ground truth probabilities thresholded at 50), and treatment effect on AUROC = (assisted AUROC) − (unassisted AUROC).
Statistical approach: To reduce overestimation of heterogeneity due to measurement error, empirical Bayes shrinkage was applied to individual treatment effects and performance metrics. Experience-based predictors (years of experience dichotomized at median of 6 years; thoracic subspecialty yes/no; prior AI tools experience yes/no) were collected for 136 radiologists via post-experiment survey. A combined characteristics linear regression model used these predictors to estimate each radiologist’s mean treatment effect and split radiologists into subgroups for comparison (with balanced training/test splits). Individual predictor analyses split radiologists by median of each predictor.
Unassisted performance predictor: Linear regressions assessed whether unassisted error predicts treatment effect, and whether unassisted error predicts assisted error, with split sampling to avoid reversion-to-the-mean bias. Separate patient case sets were used to compute unassisted performance (independent variable) and treatment effect or assisted performance (dependent variable). Attenuation bias from measurement error was adjusted analytically. A naive (non-split) analysis was also shown to illustrate spurious correlations.
AI error predictors: AI predictions were binned by absolute error into five ranges to test heterogeneity of treatment effect across AI error bins for all pathologies aggregated and per-pathology, using joint equality tests and multiple-testing correction. Signed error (AI-predicted probability minus ground truth) was binned into ten ranges to assess whether underestimation vs overestimation differentially affects treatment effect. AUROC-based analogs were performed where definable.
Inference: Linear models were estimated with cluster-robust, two-way clustered standard errors (radiologist and patient case). Wald tests and F-tests were used as appropriate; Benjamini–Hochberg correction controlled FDR across 15 pathologies. Sensitivity/specificity treatment effects were also examined. Alternative definitions using binary ground truth labels were analyzed to verify robustness.
Data and code: The 324 cases are available under a Stanford data use agreement (link provided in paper). The clinician–AI collaboration dataset is accessible upon request via OSF, and analysis code is available via Zenodo. Ethics: The study was deemed exempt by the MIT Committee on the Use of Humans as Experimental Subjects (E-2953).
Key Findings
- Substantial heterogeneity of AI assistance effects: Across all pathologies (absolute error metric), individual treatment effects ranged from −1.295 to 1.440 (IQR 0.797). For high-prevalence labels, the range was much larger for abnormal: −8.914 to 5.563 (IQR 3.245). Sensitivity and specificity also showed heterogeneous changes: average ranges of treatment effects on sensitivity and specificity were 1.9% to 11.8% (IQR 1.9%) and −4.0% to 3.1% (IQR 1.6%), respectively.
- Oracle subgrouping confirms heterogeneity: Splitting 136 radiologists by an oracle predictor at the median treatment effect produced a significant difference between subgroups on all pathologies aggregated (difference ≈ −0.828; P<0.001) and for each high-prevalence label (adjusted P<0.001).
- Experience-based characteristics do not predict effects: A combined model using years of experience (≤6 vs >6), thoracic subspecialty (yes/no), and prior AI tool experience (yes/no) did not significantly predict treatment effect on all pathologies or individual pathologies (P>0.05 after correction). Individually, none of these predictors showed significant differences (except isolated uncorrected findings), for both absolute error and AUROC metrics.
- Unassisted performance is a poor predictor of treatment effect: The regression coefficient of unassisted error predicting treatment effect on all pathologies aggregated was not significant (P>0.05). Some per-pathology associations were nominally significant without multiple-testing correction, but none remained after correction. For AUROC, unassisted AUROC was also a poor predictor of AUROC treatment effect (P>0.05). Without split sampling, strong spurious ("hallucinated") correlations appeared (e.g., coefficient ≈ 0.357, P<0.001), illustrating reversion-to-the-mean bias.
- Higher-performing radiologists remain higher-performing with AI: Unassisted error strongly predicted assisted error for all pathologies aggregated (P<0.001) and most individual pathologies (adjusted P<0.05), indicating relative performance ranking is largely preserved under AI assistance. Similarly, unassisted AUROC predicted assisted AUROC for all pathologies aggregated (P<0.001).
- AI error magnitude predicts treatment effect: Binning by AI absolute error revealed significant heterogeneity in treatment effect for all pathologies aggregated (P<0.001). When AI absolute error <20, mean treatment effect was +0.679 (95% CI: 0.492 to 0.865; n=176,130). When AI absolute error >80, mean treatment effect was −16.845 (95% CI: −24.288 to −9.403; n=371). Many individual pathologies also showed significant differences across AI error bins; more accurate AI generally led to better effects for abnormal, airspace opacity, cardiomediastinal abnormality, cardiomegaly, pleural effusion, pleural other, pneumothorax and support device hardware, whereas reverse trends appeared for bacterial/lobar pneumonia and consolidation.
- Direction of AI error matters: Given the same absolute error, AI underestimation (negative signed error) led to better treatment effects than overestimation on all pathologies aggregated (P<0.001). Several individual pathologies showed similar patterns (airspace opacity, atelectasis, cardiomegaly, consolidation, lesion) after multiple-testing correction.
- Robustness to alternative metrics/labels: Results were generally consistent when using binary ground truth labels for absolute and signed errors, corroborating findings on predictor relationships and AI error effects.
Discussion
The study demonstrates strong heterogeneity in how AI assistance affects individual radiologists’ calibration and discrimination performance. Conventional experience-based factors (years of experience, thoracic subspecialty, prior AI tool use) do not reliably identify who benefits or is harmed, challenging assumptions from prior smaller or narrower studies. Direct measures of diagnostic skill (unassisted performance) also fail to predict treatment effect, contradicting the notion that lower performers benefit more. Proper statistical design is critical: split sampling revealed that naive analyses can produce spurious associations due to reversion to the mean.
The findings suggest that, absent reliable predictors, radiologists’ responses to AI should be empirically assessed in realistic simulations prior to targeted deployment. Importantly, AI error characteristics strongly influence outcomes: lower AI error produces more favorable treatment effects on aggregate, while high AI error can be detrimental, implying that clinicians have difficulty consistently distinguishing accurate from inaccurate AI advice. Moreover, underestimation by AI can be less harmful than overestimation for a given error magnitude. These insights emphasize the need for high-quality AI systems, better tools to help clinicians identify and interpret potentially inaccurate AI outputs, and personalized clinician–AI collaboration strategies tailored to tasks and settings.
Conclusion
This work provides large-scale evidence of substantial individual heterogeneity in the effects of AI assistance on radiologists and shows that common experience-based characteristics and direct measures of unassisted performance do not predict who benefits. Instead, AI error magnitude and direction are key determinants of treatment effects, with accurate AI predictions producing better outcomes and overestimation tending to be more harmful than underestimation at equal error.
For practice, these results argue for individualized, data-driven deployment strategies, development of more accurate AI models, and assistive interfaces that help clinicians assess AI reliability. Future research should explore cognitive and behavioral predictors of AI benefit, evaluate richer assistive outputs (e.g., explanations, localizations, or report drafts), analyze temporal learning effects in clinician–AI interaction, and design experiments that decouple patient case confounding (e.g., via artificial predictions) and examine task granularity across hierarchical pathology labels.
Limitations
- Temporal dynamics not assessed: Randomization necessary to eliminate confounding prevented analysis of whether radiologists’ incorporation of AI improves over time.
- Limited assistive information: AI assistance provided only probabilities without explanatory aids (e.g., localization or rationales), which may limit effective use and generalizability to richer AI tools.
- Task granularity and interdependence: The 15 pathologies include hierarchical/overlapping concepts (e.g., airspace opacity vs atelectasis/edema/consolidation). Simultaneous presentation of all AI predictions may cause cross-influence among tasks, complicating attribution of effects.
- Patient-case confounding in AI error analyses: Using actual AI predictions on real cases makes it hard to fully disentangle effects of case difficulty from AI accuracy; controlled studies with artificial predictions could isolate mechanisms.
- AUROC bin analyses were limited for some pathologies due to insufficient cases per bin, restricting discrimination-focused inferences.
Related Publications
Explore these studies to deepen your understanding of the subject.

