logo
ResearchBunny Logo
Mitigating the impact of biased artificial intelligence in emergency decision-making

Medicine and Health

Mitigating the impact of biased artificial intelligence in emergency decision-making

H. Adam, A. Balagopalan, et al.

In a groundbreaking study by Hammaad Adam, Aparna Balagopalan, Emily Alsentzer, Fotini Christia, and Marzyeh Ghassemi, researchers reveal how biased AI recommendations can sway emergency decision-making in mental health crises. Their findings highlight the critical dangers of using flawed algorithms in medicine and propose innovative solutions to mitigate these biases. Discover how framing AI advice can support unbiased decisions in high-stakes scenarios!

00:00
00:00
~3 min • Beginner • English
Introduction
The study investigates how biased AI recommendations affect human decision-making in high-stakes health-related scenarios and whether changing the framing of AI guidance can mitigate harmful effects. As AI and machine learning tools proliferate in healthcare, prior work shows that model errors can degrade clinician decisions and that experts may be as susceptible as non-experts to incorrect recommendations. Moreover, large language models and clinical NLP systems can encode biases against racial and religious minorities. However, less is known about the practical impact of deploying such biased models in realistic settings and how to design AI advice to preserve fair human judgment. The authors evaluate whether biased AI can induce disparities in emergency response decisions and whether presenting AI output prescriptively versus descriptively changes adherence and bias.
Literature Review
The paper situates its contribution within literature showing: (1) AI/ML in healthcare can adversely affect decisions when models err, and explainability alone may not help detect mistakes; (2) language models like BERT and GPT-3 and clinical embeddings can encode biases against minorities, e.g., associating Muslims with violence; (3) biased models have been documented in criminal justice and population health management; (4) recommendation presentation and framing effects influence human judgment, and risk scores versus prescriptive outputs are used across domains (e.g., pretrial risk assessment, autonomous diagnostics). The authors highlight the ease of inducing bias via fine-tuning and the need for auditing, human factors, and careful design of human-AI interaction as underscored by regulatory guidance and HCI frameworks.
Methodology
Design: A randomized web-based experiment (Qualtrics, May–Dec 2021) with 954 consented participants: 438 clinicians and 516 non-experts. Recruitment: Clinicians were recruited via emails to staff and residents at hospitals in the US and Canada; non-experts via social media and university lists. Ethics: Study met criteria for IRB exemption (MIT COUHES; 45 CFR 46). Exclusions: Participants flagged as bots, those completing under 5 minutes, and duplicates (removed 15 clinician and 2347 non-expert duplicate responses) were excluded. Procedure: Each respondent viewed 8 vignettes summarizing calls to a fictitious crisis hotline involving a male experiencing a mental health emergency. Vignettes included scenario details and indicated race (Caucasian or African-American) and religion (Muslim or not mentioned). Identities were randomly assigned per vignette and participant. Respondents chose to send medical help or contact police (police recommended only if risk of violence). Experimental groups (between-subjects): 1) Baseline (no AI): 102 clinicians, 108 non-experts. 2) Prescriptive Unbiased AI: 87 clinicians, 114 non-experts. 3) Prescriptive Biased AI: 90 clinicians, 103 non-experts. 4) Descriptive Unbiased AI: 80 clinicians, 94 non-experts. 5) Descriptive Biased AI: 79 clinicians, 97 non-experts. AI Recommendations: Generated by GPT-2 models fine-tuned on custom datasets to induce bias or neutrality. The biased model was more likely to suggest police action for African-American or Muslim subjects; the unbiased model suggested police equally across groups. Presentation styles: Prescriptive (“you should call for police/medical help”) versus Descriptive (a flag indicating risk of violence versus no flag; flag corresponds to police recommendation). Model bias and style were fixed within each participant across their 8 vignettes. Statistical analysis: Separate analyses for clinicians and non-experts by experimental group. Mixed-effects logistic regression models estimated the impact of race and religion on the decision to call police, with random intercepts for respondent and vignette. Additional mixed-effects logistic models estimated adherence: the effect of an AI recommendation to call police on the likelihood of choosing police. Odds ratios with 95% CIs reported; significance assessed by two-sided likelihood ratio tests using z-statistics. Robustness checks included controls for respondent characteristics (race, gender, attitudes toward policing); power analysis and covariate balance reported in Supplementary Materials.
Key Findings
- Baseline decisions: No inherent bias detected. Clinicians (Baseline): African-American vs Caucasian OR 0.84 (95% CI 0.6–1.17); Muslim vs not mentioned OR 0.85 (0.6–1.2). Non-experts: African-American OR 1.1 (0.81–1.5); Muslim OR 0.73 (0.53–1.01). - Biased prescriptive AI induces disparities. Under prescriptive biased recommendations, respondents were significantly more likely to call police for African-American and Muslim men: Clinicians—African-American OR 1.54 (1.06–2.25), Muslim OR 1.49 (1.01–2.21); Non-experts—African-American OR 1.55 (1.13–2.11), Muslim OR 1.72 (1.24–2.38). Effects remained after controlling for demographics and attitudes. - Biased descriptive AI does not induce disparities. With descriptive biased flags, decisions remained unbiased (no significant race/religion effects). - Algorithmic adherence depends on style. Odds of calling police when AI recommended police: Prescriptive—Clinicians Unbiased OR 2.74 (1.76–4.25), Biased OR 2.82 (1.81–4.4); Non-experts Unbiased OR 2.87 (1.95–4.21), Biased OR 3.82 (2.66–5.48). Descriptive—Clinicians Unbiased OR 1.57 (1.04–2.38), Biased OR 0.87 (0.57–1.33); Non-experts Unbiased OR 1.18 (0.8–1.73), Biased OR 0.99 (0.69–1.43). - Expertise did not protect against biased prescriptive influence: clinicians and non-experts were similarly susceptible. - Fine-tuning can easily induce bias: GPT-2 fine-tuned on ~2000 short examples produced consistent biased recommendations.
Discussion
The study shows that even when human decision-makers exhibit no baseline bias, prescriptive recommendations from a biased AI can induce discriminatory choices, increasing the likelihood of calling police for African-American and Muslim individuals in mental health crises. Presenting AI outputs descriptively (risk flags) mitigated this effect, enabling respondents—especially clinicians—to exercise judgment and maintain fairness, while still allowing some beneficial adherence to unbiased descriptive advice. These findings underscore the importance of auditing models for bias and inaccuracy and carefully designing recommendation styles in human-AI systems. The ease with which fine-tuning biased GPT-2 outputs was achieved highlights risks in prevalent transfer learning workflows. Beyond healthcare, similar risks exist in domains like hiring, grading, and content moderation where language models are deployed. Effective, ethical deployment requires rigorous human factors evaluation, selection of recommendation framing that reduces undue adherence, and broader bias mitigation strategies (e.g., peer networks), recognizing that expertise alone may not counteract biased AI influence.
Conclusion
The paper demonstrates that biased AI, when presented prescriptively, can induce discriminatory emergency response decisions among both clinicians and non-experts, while descriptive framing preserves unbiased decision-making. Contributions include experimental evidence of framing effects as a mitigation strategy for AI bias in high-stakes decisions and quantification of algorithmic adherence by user type and style. The authors advocate thorough validation of both model performance and human-AI interaction before deployment, continuous auditing for bias, and careful selection/testing of recommendation styles. Future research should examine subtler cues of identity (e.g., names, accents, images) that may trigger implicit biases, evaluate additional domains and decision types, and develop standardized frameworks for presenting AI advice to balance influence with human judgment.
Limitations
- Identity signaling: Race and religion were communicated explicitly in text; such direct mention may not trigger implicit biases as strongly as subtler cues (e.g., names, accents, images). The study cannot rule out baseline biases under different stimulus designs. - Scenario generalizability: The decision task, while analogous to real clinical judgments, is a vignette-based mental health emergency context and not strictly medical; findings may not generalize to all clinical settings or tasks. - Model presentation: Only two framing styles (prescriptive vs descriptive flag) were tested; other formats (e.g., calibrated risk scores with action thresholds) were not explored. - Recruitment/sample: Participants were recruited online and via institutions in the US/Canada, which may limit generalizability; exclusions based on speed and bot/duplicate detection may bias the sample (not explicitly assessed).
Listen, Learn & Level Up
Over 10,000 hours of research content in 25+ fields, available in 12+ languages.
No more digging through PDFs, just hit play and absorb the world's latest research in your language, on your time.
listen to research audio papers with researchbunny