Medicine and Health
Peeking into a black box, the fairness and generalizability of a MIMIC-III benchmarking model
E. Röösli, S. Bozkurt, et al.
This study explores critical issues of bias and fairness in a MIMIC-III trained model for predicting in-hospital mortality, revealing significant concerns for Black and publicly insured ICU patients. Conducted by Eliane Röösli, Selen Bozkurt, and Tina Hernandez-Boussard, this research emphasizes the necessity for robust fairness assessments in healthcare model evaluations.
~3 min • Beginner • English
Introduction
The growth of digital health data has spurred widespread development of AI models for clinical decision support, with the promise of more efficient, effective care. Yet mounting evidence shows that such models can be biased, risking harm to minorities and historically disadvantaged populations. Bias may arise because models learn from data that reflect an inequitable healthcare system, potentially reinforcing disparities through feedback loops. Prior work has proposed fairness reporting frameworks and numerous formal definitions of fairness (e.g., classification parity and calibration), but a standard, comprehensive evaluation set is still emerging. Given the central role of open datasets such as MIMIC for research, education, and benchmarking, it is important to understand their inherent biases, demographic representativeness, and risks of overfitting. This study conducts an empirical case study of a widely used MIMIC-III in-hospital mortality (IHM) benchmark model. The objectives are to: (1) replicate the original benchmark cohort and evaluate demographics; (2) externally validate the model on an independent EHR system (STARR) to test generalizability; and (3) retrain and test the model on STARR, applying at each stage a fairness and generalizability framework to assess performance parity and calibration across gender, ethnicity, and insurance payer type.
Literature Review
The study builds on a growing body of work documenting bias and fairness concerns in healthcare AI. Prior frameworks span algorithmic audits, life-cycle bias screenings, and single-attribute fairness studies, yet no universal fairness definition exists; many can be grouped into anti-classification, classification parity, and calibration. The MIMIC database is a foundational open EHR resource that has catalyzed thousands of studies and serves as a benchmark for AI development and education. Harutyunyan et al. developed a MIMIC-III multi-task benchmarking pipeline for several prediction tasks, including in-hospital mortality, which has since been adopted by many research teams with limited comprehensive fairness evaluations. Given the potential for single-site benchmarks to embed bias and the widespread influence of MIMIC, there is a need for rigorous fairness and generalizability assessments beyond traditional performance reporting (e.g., accuracy, AUROC), especially under class imbalance where such metrics can be misleading.
Methodology
Data sources: MIMIC-III v1.4 includes deidentified ICU EHRs from Beth Israel Deaconess Medical Center (2001–2012). STARR includes EHR data from Stanford Health Care and Stanford Children’s Hospital (ICU stays from Nov 2014–Jul 2019). Demographics: gender (female/male), insurance (Medicare, Medicaid, private, other), and ethnicity (White [non-Hispanic], Black, Asian, Hispanic, Other) as mapped consistently across datasets.
Cohort construction: For both MIMIC and STARR, identical exclusion criteria were applied to mirror the benchmark: exclude admissions with multiple ICU stays or transfers, patients <18 years, retain only events mapping to the 17 physiological variables. Build the IHM cohort by selecting ICU stays with length of stay >48 hours and with observations in the first 48 hours after ICU admission.
Benchmark model: The focus is the channel-wise LSTM (without deep supervision) from Harutyunyan et al. The architecture preprocesses each variable via a bidirectional LSTM before combining, using 48-hour time windows to capture trajectory changes. The outcome is in-hospital mortality before discharge.
Three-stage analytical framework: (1) Internal validation: train and test on MIMIC; (2) External validation: apply the MIMIC-trained model to STARR; (3) Retraining: train on STARR and test internally on STARR. At each stage, conduct: (A) descriptive cohort analyses (demographics, outcome distribution, missingness), (B) performance and fairness evaluations (discrimination and calibration overall and stratified), and (C) comprehensive reporting per MINIMAR.
Performance metrics: Threshold-based precision, recall, accuracy at a 0.5 decision threshold; ranking metrics AUROC and AUPRC (with AUPRC baseline equal to the event rate, important under class imbalance). Uncertainty via bootstrap (10,000 resamples) to obtain 95% CIs.
Calibration assessment: Validation plots per Steyerberg et al. using Hosmer–Lemeshow-style grouping but with exponential quantiles (f(q)=q^5) to address skew, Wilson-score binomial CIs for risk groups, and a LOWESS smoothing line. Calibration-in-the-large compares average predicted vs observed risk (with bootstrap CIs). Comorbidity-risk plots relate risk score percentiles to Charlson comorbidity index (computed from 2 years pre-ICU; available for STARR only) under demographic stratification.
Fairness evaluations: Classification parity assessed via AUROC and AUPRC across protected groups, carefully interpreted in light of differing event rates due to class imbalance. Calibration fairness via calibration-in-the-large and comorbidity-risk analyses across gender, ethnicity, and insurance payer type. SMOTE was tested during training as a mitigation for imbalance but did not resolve fairness constraints.
Key Findings
Descriptive cohorts: MIMIC included 18,094 patients and 21,339 ICU stays with an overall IHM rate of 13.23%; STARR included 6,066 patients and 6,407 ICU stays with IHM 10.18%. STARR had similar gender distribution but fewer patients aged ≥70 and was more ethnically diverse (approximately half non-Hispanic White vs two-thirds in MIMIC). Insurance mix was similar (≈ one-third private, >50% Medicare). Female patients had higher IHM in both datasets; IHM rose with age.
Missingness: In MIMIC, capillary refill rate, FiO2, and height were missing in over two-thirds of ICU stays; CRR was effectively unavailable. Eleven of 17 variables had <2% of stays with no measurements in the first 48 hours. STARR had better coverage for FiO2 and height but more variables with entirely missing data per stay; four variables had full coverage (>50% of stays) in STARR vs none >20% in MIMIC.
Performance (discrimination): Stage 1 (MIMIC internal): test IHM 11.56%; AUROC and accuracy >0.8; AUPRC lower (≈0.50). Event recall was low (~25%), indicating poor sensitivity to true positives despite good AUROC/accuracy. Stage 2 (external on STARR; IHM 10.18%): AUROC remained similar while AUPRC dropped by almost 0.1; event recall fell below 20%, identifying only about one in five high-risk patients. Stage 3 (retrained on STARR; IHM 10.19%): AUROC and AUPRC improved vs stage 2; event precision improved markedly, but event recall remained very low.
Calibration: Stage 1: generally well-calibrated with slight underestimation at low risk and overestimation at high risk. Stage 2: still reasonably calibrated but with stronger underestimation at low risk (statistically significant). Stage 3: consistent risk underestimation across all strata; underestimation slightly less severe at low risk but worse at high risk than stage 2.
Classification parity (fairness): Stage 1: Despite similar event rates for Medicaid vs private (8.5% vs 8.3%), Medicaid had worse AUROC/AUPRC; White patients (higher IHM 11.3%) outperformed Black patients (9.7%), violating parity; no notable gender differences; high performance for Hispanic aligned with notably lower IHM (7.3%). Stage 2: Insurance groups showed minimal performance differences; Medicaid predictions were much better than in MIMIC. However, Black patients again had markedly worse discrimination than White patients despite similar IHM (9.3% vs 8.7%); Asian and Hispanic performance was similar or slightly better than White; gender differences remained negligible. Stage 3: Despite similar event rates across insurance, Medicaid again underperformed on AUROC/AUPRC; female predictions were more accurate than male despite higher IHM; Black patients had the worst discrimination among ethnic groups, with performance gaps persisting even though Black patients had the lowest IHM (~7.3%), which should favor performance under imbalance.
Calibration-in-the-large: Stage 1: Overall underestimation (predicted 10.4% vs observed 11.6%). Private was best calibrated; Medicare risk was underestimated; Medicaid slightly overestimated. Gender stable; by ethnicity, Hispanic risk was overestimated; minority-group CIs were wider. Stage 2: Underestimation intensified (predicted 7.0% vs observed 10.2%); Medicaid and private remained relatively well calibrated; Medicare strongly underestimated; female risk more underestimated than male; insurance-group calibration differences largely diminished. Stage 3: Persistent underestimation for all groups; female underestimation remained worse; Black group showed best calibration-in-the-large despite poor discrimination.
Comorbidity-risk relationships (STARR): High-risk Medicaid patients exhibited disproportionately high comorbidity burden, whereas high-risk privately insured patients showed lower comorbidity at a given risk percentile. Medicare patients had consistently higher comorbidity (~+1 Charlson point) across risk percentiles. At higher risk strata, males had slightly higher comorbidity than females. Hispanic and Asian patients tended to have higher comorbidity than White and Black patients at the same risk percentiles; group-specific curves diverged notably at high risk.
Discussion
The study’s framework exposed important limitations of a widely used MIMIC-III IHM benchmark when examined through fairness and generalizability lenses. Despite strong AUROC/accuracy, low event recall across stages indicates a substantial class imbalance problem that undermines clinical utility, particularly for identifying high-risk patients. External validation demonstrated partial generalizability, yet fairness analyses revealed systematic disparities: Medicaid patients (vs private) and Black patients (vs White) consistently experienced worse discrimination despite similar or even more favorable event rates. Calibration disparities, especially for Medicare and female patients in external settings, highlight systematic risk underestimation that could translate into under-allocation of care for certain groups. Comorbidity-risk analyses further suggested that high-risk Medicaid patients’ predictions may be disproportionately driven by chronic disease burden, reflecting broader systemic inequities in care. Together, these findings address the research objective by showing that open-benchmark models can mask bias and performance deficiencies without comprehensive, stratified evaluations and external validation, and that relying on aggregate metrics alone (e.g., accuracy, AUROC) is insufficient under class imbalance.
Conclusion
This case study demonstrates that a prominent MIMIC-III IHM benchmark exhibits class imbalance challenges and fairness concerns, particularly affecting Black and publicly insured patients. While the model shows some generalizability, discrimination and calibration disparities persist across demographic groups, and event recall remains low even after retraining on an external dataset. The work underscores the necessity of comprehensive fairness and performance assessments—including external validation, parity analyses (AUROC/AUPRC with event-rate context), calibration checks, and transparent reporting (e.g., MINIMAR)—for benchmark pipelines based on open data. Future research should develop and adopt standardized fairness evaluation protocols, investigate mitigation strategies tailored to class imbalance and demographic disparities, expand multi-center validation beyond single academic sites, and refine benchmarks/data curation to enhance representativeness and reduce embedded systemic biases.
Limitations
Key limitations include: (1) the intrinsic class imbalance in the outcome, which constrains fairness assessments and yields low recall; (2) the lack of multi-center data beyond two academic health systems, limiting generalizability; and (3) similarities between MIMIC and STARR settings (affluent academic hospitals), which may attenuate the diversity of clinical environments. Additionally, differences in data completeness and variable availability across datasets and the limited availability of comprehensive pre-ICU comorbidity data in MIMIC constrained certain analyses.
Related Publications
Explore these studies to deepen your understanding of the subject.

