Medicine and Health
Quantifying the impact of AI recommendations with explanations on prescription decision making
M. Nagendran, P. Festor, et al.
This study by Myura Nagendran, Paul Festor, Matthieu Komorowski, Anthony C. Gordon, and Aldo A. Faisal delves into the intriguing effects of AI recommendations on physician prescription choices in the ICU. With 86 participants, the research reveals AI significantly sways decisions, yet simple explanations do not enhance this influence, challenging existing notions in the clinical domain.
~3 min • Beginner • English
Introduction
AI-driven clinical decision support systems could improve medical care but face a translation gap in real-world clinical environments, including data-rich intensive care units. Clinicians, researchers, and regulators increasingly demand explainable AI that not only recommends actions but justifies them. However, most practical XAI evaluations focus on general tasks or diagnostic problems with clear gold standards, limiting applicability to complex, non-diagnostic decisions such as hemodynamic management in sepsis. Using the AI Clinician (an AI system for sepsis resuscitation), the study investigates how clinicians’ prescriptions are influenced by additional information and whether influence depends on information source (peers vs AI) and the presence of simple explanations (feature importance). The goal is to quantify influence on dosing decisions, assess interactions with clinician attitudes and experience, and evaluate whether self-reported usefulness of explanations aligns with actual behavioral adherence.
Literature Review
Prior work highlights a gap between AI system performance and clinical adoption in ICU settings. Practical evaluations of XAI often emphasize non-clinical or diagnostic tasks with gold standards, which may not generalize to complex therapeutic decisions. Limited clinical XAI studies suggest explanations may not reliably mitigate automation bias or improve decision quality, with evidence that clinicians sometimes fail to reject poor AI advice. In related experiments, explanations increased perceived AI influence regardless of explanation quality, suggesting reassurance rather than critical appraisal may drive adherence. These insights motivate testing whether simple feature-importance explanations add value beyond AI recommendations in complex treatment decisions like sepsis resuscitation.
Methodology
Design: Human-AI interaction vignette study using a modified between-subjects, multi-factorial design with four arms: (i) Baseline (no extra information), (ii) Peer (distribution of peer clinician doses), (iii) AI (AI Clinician suggested doses), (iv) XAI (AI suggested doses plus simple feature-importance explanation). Each ICU doctor experienced all four arms on different subsets of patient scenarios.
Participants: N=86 ICU doctors (31 senior/consultant-attending, 42 intermediate registrar/fellow, 13 junior). Median age 37 years (IQR 34–43), median clinical experience 11 years (IQR 9–19). Inclusion: practicing physicians with at least 4 months in adult ICU and current or recent ICU work. Convenience sampling; participation remotely or in person; informed consent obtained; ethics approval by Imperial College London RGIT (ICREC 21127) and Israel Biomedical Research Center (I-0219-07) for use of anonymized MIMIC-III data.
Procedure: Each participant completed 16 trials on a computer interface (HTML/JavaScript using jsPsych). The first four trials were baseline; the subsequent 12 comprised the main experiment with randomized assignment of scenarios to arms. For each trial, participants reviewed a fixed-format patient data table (demographics, diagnosis, treatment snapshots) designed to resemble UK EHRs and prescribed continuous doses for two interventions to be applied in the next hour: intravenous fluids and vasopressor (noradrenaline-equivalent).
Arms and stimuli:
- Baseline: patient data only.
- Peer: violin plots (boxplot plus KDE) showing probability density functions of IV fluid and vasopressor doses prescribed by other doctors in the MIMIC-III dataset for patients in similar states (proxy for peer practice).
- AI: text display of AI Clinician suggested doses for fluid and vasopressor.
- XAI: AI suggested doses plus a simple explanation via feature importance highlighting influential variables.
Patient scenarios: 24 total vignettes. Twelve “expert-selected” to cover four categories: (i) both AI fluid and vasopressor suggestions similar to human clinicians in MIMIC-III; (ii) only vasopressor similar; (iii) only fluid similar; (iv) neither similar. These spanned a wide range of vasopressor support (including >0.5 mcg/kg/min). The other 12 were selected by clustering the entire MIMIC-III cohort (n=17,083) into 12 clusters and choosing a representative patient near each cluster centroid to enhance coverage of the dataset’s state space.
Data collection: Pre-experiment demographics, clinical experience, and AI attitude (four-item questionnaire) were collected; post-experiment, participants rated likelihood of using AI, perceived usefulness of AI and explanations, and usefulness of presenting peer and AI suggestions together. For each trial, recorded outputs included prescribed fluid and vasopressor doses and time to complete the scenario. Data were standardized to UK clinical units.
Outcomes and analyses: Primary behavioral outcome was adherence/influence measured as absolute difference between participant-selected dose and AI-suggested dose (or change from baseline), and analysis of inter- and intra-clinician variability. Associations were tested between adherence and AI attitude (principal component from four items) and years of experience using linear least-squares regression/Gaussian process fits. Variability effects were examined relative to whether recommendations were higher or lower than baseline prescriptions.
Key Findings
- Influence of additional information: Providing recommendations (peer, AI, or XAI) influenced prescriptions. AI information significantly shifted dosing relative to baseline; peer information had a weaker/non-significant effect. Simple XAI (feature importance) did not increase influence beyond AI alone.
- Practice variation: Inter-clinician dose variability shifted depending on the recommendation’s direction relative to baseline. When the recommendation was higher than baseline, across-doctor variability increased; when lower than baseline, variability decreased.
- Attitudes and experience: Composite AI attitude (PCA first component explaining 69% variance) was not significantly associated with adherence to AI suggestions: fluids difference coefficient = -0.208 (p=0.075), vasopressors = -0.074 (p=0.092). Years of clinical experience showed no meaningful association with adherence (e.g., fluids r = -0.086, p = 0.047 by LLSR, interpreted by authors as not significant).
- Self-reports vs behavior: Post-experiment, training doctors vs non-training: likelihood of using AI mean 2.55 (SD 0.96) vs 2.16 (SD 1.07), p=0.091; perceived usefulness mean 2.42 (SD 1.03) vs 1.97 (SD 1.11), p=0.296. Self-reported usefulness of explanations did not correlate with actual adherence to XAI suggestions.
- Overall: The marginal impact of simple feature-importance XAI on prescription decisions was low; AI suggestions alone primarily drove behavioral change.
Discussion
The study addressed whether supplemental information, particularly AI recommendations and explanations, influences clinicians’ dosing decisions in complex, non-diagnostic tasks. Results show AI recommendations substantially affect prescriptions, while peer information exerts less influence. Crucially, adding a simple feature-importance explanation did not yield additional behavioral adherence beyond the AI suggestion, indicating limited marginal utility of basic XAI in this context.
The findings suggest that clinician attitudes toward AI and years of experience do not reliably predict adherence to AI recommendations, and that self-reported usefulness of explanations does not track actual behavior. This challenges the practice of relying on subjective XAI evaluations as proxies for effectiveness in clinical decision support.
These outcomes align with concerns about automation bias and the nuanced role of explanations observed in prior work: explanations may reassure users without improving critical appraisal of advice quality. In high-stakes clinical environments, this underscores the need for rigorous behavioral metrics rather than self-report alone to evaluate XAI. The observed modulation of inter-clinician variability by the direction of recommendations indicates that AI can systematically shape practice variation, which has implications for standardizing care while also potentially amplifying variation depending on context.
Overall, the study supports cautious integration of AI-CDSS, emphasizes the limited added value of simple explanations for influencing prescribing behavior, and highlights the importance of designing evaluations that capture real behavioral impact and resilience to poor advice.
Conclusion
This study quantifies the influence of AI and simple explanations on ICU clinicians’ prescription decisions for fluids and vasopressors in sepsis scenarios. AI recommendations significantly altered prescribing behavior, whereas peer information was less impactful. Simple feature-importance XAI did not increase adherence beyond AI alone. Clinician attitude and experience did not meaningfully predict adherence, and self-reported usefulness of explanations did not correlate with behavioral influence.
These findings suggest prioritizing the design of AI recommendations themselves and caution against relying on basic XAI or self-reported measures to assess effectiveness. Future work should explore richer, clinically meaningful explanation modalities (e.g., confidence intervals, uncertainty estimates, graphical/range-based recommendations), user-centered presentation formats, and methods to mitigate automation bias. Studies leveraging higher-fidelity, longitudinal clinical contexts and qualitative methodologies (e.g., think-aloud, interviews) could illuminate the cognitive mechanisms underpinning AI influence and inform safer deployment.
Limitations
- Explanation modality: XAI used simple feature importance, a lower-information explanation that may limit impact.
- Clinical context and user experience: Effects likely depend on specific clinical tasks and users’ familiarity with AI systems.
- Scenario coverage: Despite a two-pronged selection strategy, gaps in the patient state space remain relative to the full MIMIC-III cohort.
- Low-fidelity vignettes: Static scenarios lack dynamic patient evolution and feedback about treatment effects.
- Sampling: Convenience sample of ICU doctors may not represent the broader clinician population.
- Presentation of AI output: Recommendations were isolated point estimates without confidence bounds, ranges, or uncertainty visualization, which may affect adherence.
- Cognitive process unobserved: Internal decision-making processes were not directly measured; reliance on self-reports and final prescriptions limits inference about mechanisms.
- Potential ceiling effects and heuristics: Repetitive, burdensome decisions may have led to heuristic adoption, potentially masking any added value of explanations.
Related Publications
Explore these studies to deepen your understanding of the subject.

