logo
ResearchBunny Logo
The limits of fair medical imaging AI in real-world generalization

Medicine and Health

The limits of fair medical imaging AI in real-world generalization

Y. Yang, H. Zhang, et al.

This study reveals the critical challenges of fairness in medical AI for disease classification across various imaging modalities, highlighting how demographic shortcuts lead to biased predictions. Conducted by Yuzhe Yang, Haoran Zhang, Judy W. Gichoya, Dina Katabi, and Marzyeh Ghassemi, the research uncovers that less demographic attribute encoding in models can yield better performance in diverse clinical settings, emphasizing best practices for equitable AI applications.

00:00
00:00
~3 min • Beginner • English
Introduction
Medical AI systems are increasingly deployed in clinical settings, necessitating evaluation not only of overall performance but also of potential biases affecting protected groups. Despite strong performance in medical imaging tasks, models can reflect and amplify biases present in data, leading to disparities such as underdiagnosis in certain racial groups. Recent studies show deep models can infer demographic attributes (race, sex, age) from images, raising concerns that disease models may exploit demographic features as shortcut heuristics with no causal clinical basis. This work addresses four questions: (1) Do disease classification models use demographic information as shortcuts, and does this cause biased predictions? (2) Can state-of-the-art methods remove such shortcuts and yield fair, locally optimal models? (3) How do fairness and shortcuts behave under real-world distribution shifts where shortcuts may not hold? (4) Which algorithms and ID model selection criteria lead to globally optimal models that maintain fairness OOD?
Literature Review
Prior literature documents performance disparities of medical imaging models across subgroups and highlights risks of shortcut learning, where models rely on spurious correlations such as site or demographic features. Studies have shown that AI can predict demographics like race, sex, and age from medical images at levels exceeding human experts, and that chest X-ray classifiers can underdiagnose Black patients. The fairness literature often uses equalized odds/equal opportunity to assess group-level error disparities. Work on robust and fair learning proposes reweighting, distributionally robust optimization, and adversarial approaches to remove group information and mitigate shortcuts. However, fairness-accuracy tradeoffs and incompatibilities with calibration by group are known theoretical challenges, and transfer of fairness under distribution shift remains an open problem.
Methodology
Study scope and datasets: Primary focus on chest X-ray classification across six datasets (MIMIC-CXR, CheXpert, NIH, SIIM-ACR, PadChest, VinDr) spanning multiple geographies. Additional validation in dermatology (ISIC; No Finding) and ophthalmology (ODIR; Retinopathy). Tasks: four CXR binary tasks (‘No Finding’, ‘Effusion’, ‘Pneumothorax’, ‘Cardiomegaly’); dermatology ‘No Finding’; ophthalmology ‘Retinopathy’. Splits: 70/15/15 train/val/test; excluded samples with missing sex/age; race available for MIMIC-CXR and CheXpert. Fairness definition and metrics: Evaluated class-conditioned error rates per demographic group with focus on underdiagnosis risk. For ‘No Finding’ (healthy label), compared false positive rate (FPR) gaps; for diseases, compared false negative rate (FNR) gaps. Equality in these metrics reflects equal opportunity; FPR/FNR gaps also indicate potential shortcut usage. Additional metrics included AUROC (overall and worst-group), calibration (ECE and ECE gap), average precision, and F1. Distribution shift quantification: Assessed ID shifts across groups in prevalence P(Y|A) and representation P(X|A) using total variation distance and MMD on foundation-model embeddings (MedCLIP), with permutation/binomial tests and Bonferroni correction. For OOD (cross-dataset) shifts, examined label P(Y), prevalence P(Y|A), covariate P(X), and group-conditional representation P(X|A) differences. Modeling and algorithms: Trained DenseNet-121 initialized from ImageNet across tasks, attributes (age, sex, race, sex&race), and algorithms: ERM; sample reweighting/resampling (ReSample, GroupDRO); group-adversarial removal of attribute information (DANN, CDANN); general robustness via exponential moving average (MA). Conducted random hyperparameter search (12 configs per task–algorithm–attribute), early stopping by worst-group AUROC. Trained 3,456 models on MIMIC-CXR (4 tasks × 4 attributes × 6 algorithms × 12 hyperparams × 3 seeds). Additionally explored multi-label training on 14 CXR labels. Demographic encoding measurement: For each trained disease model, froze the feature extractor and trained a logistic regression head to predict demographic attributes from penultimate representations. Reported attribute prediction AUROC as a measure of demographic encoding. OOD evaluation and model selection: Trained on one dataset (often MIMIC-CXR) and evaluated externally on five OOD datasets. Proposed eight ID-only selection criteria, including minimum ID fairness gap, best performance, and minimal demographic encoding (minimum attribute prediction accuracy/AUROC). Compared each selection to an oracle that picks the model with smallest OOD fairness gap using OOD labels. OOD fairness decomposition: Expressed OOD fairness gap as the sum of the ID fairness gap plus differences in how distribution shift changes per-group performance, attributing disparities to heterogeneous impact across groups.
Key Findings
- Medical imaging models encode protected attributes: Disease-model representations contain substantial demographic information (age, race, sex, intersection of sex and race) across CXR, ophthalmology, and dermatology. - Encoding correlates with unfairness: Strong positive correlations between attribute-encoding AUROC and fairness gaps (underdiagnosis). Examples include: ‘No Finding’-age R=0.82 (P=4.7×10^-8); ‘No Finding’-sex&race R=0.81 (P=8.4×10^-9); ‘Cardiomegaly’-age R=0.81 (P=1.9×10^-7); ‘Effusion’-race R=0.71 (P=6.4×10^-6); ‘Pneumothorax’-sex R=0.59 (P=2.3×10^-3). Fairness gaps can be as large as ~30% for age on CXR; ophthalmology age gaps up to ~43% FNR. - Shortcut mitigation yields locally optimal ID fairness: Rebalancing and adversarial/robust methods (GroupDRO, DANN, CDANN) reduced FPR/FNR gaps in ID without notable AUROC loss, forming Pareto fronts with small performance sacrifices yielding large fairness gains across modalities. - Tradeoffs with other metrics: Fairer models (smaller fairness gaps) often had worse calibration (larger ECE gaps) and lower average precision/F1. For ‘No Finding’, ECE gap vs fairness gap showed strong negative correlation (age: R=-0.85, P=7.5×10^-42), indicating calibration–fairness incompatibilities. - Fairness does not reliably transfer OOD: While ID vs OOD performance (AUROC) correlated highly, ID vs OOD fairness correlations varied and could be negative (e.g., ‘Pneumothorax’ with sex&race: R=-0.50, P=4.4×10^-4). Models Pareto-optimal ID often lost Pareto optimality OOD. - OOD fairness decomposition: Example CheXpert→MIMIC ‘No Finding’ (sex): ID gap ~-0.1% (ns), OOD FPR gap 3.2% due to larger FPR increase in females (+3.9%) than males (+0.8%), indicating asymmetric distribution-shift impact drives OOD unfairness. - Globally optimal selection favors minimal demographic encoding: Selecting models with minimum ID attribute encoding (minimum attribute prediction accuracy/AUROC) significantly reduced the average increase in OOD fairness gap versus selecting by minimum ID fairness gap (e.g., P=9.60×10^-4 and P=1.95×10^-12, Wilcoxon). - Algorithm choice for OOD fairness: Selecting per algorithm the model with lowest ID fairness gap, DANN achieved the lowest average OOD fairness gap relative to oracle (DANN vs ERM: P=1.86×10^-17), suggesting adversarial removal of demographic information supports fairness transfer.
Discussion
The study answers key questions about demographic shortcut use, mitigation, and generalization. Disease classifiers encode substantial demographic information and exhibit subgroup fairness gaps, supporting the hypothesis that demographic shortcuts can contribute to unfair underdiagnosis. Mitigating shortcuts with rebalancing and adversarial/robust methods can produce locally optimal models with improved ID fairness at small performance costs, but such gains are not guaranteed to transfer across domains. Empirically, ID fairness correlates weakly and inconsistently with OOD fairness, and models that are ID Pareto-optimal often are not OOD Pareto-optimal. Decomposing OOD fairness reveals that heterogeneous distribution-shift effects across groups significantly drive observed OOD disparities, highlighting the need to manage both ID fairness and group-specific robustness under shift. These findings are clinically significant: reliance on demographic encodings may not be appropriate if they reflect non-causal or site-specific correlations, and removing such encodings can help, but practitioners must weigh fairness against calibration and other utility metrics. Regulatory implications include the importance of external validation and ongoing monitoring of both overall and per-group performance, as ID assurances may not predict real-world fairness. Finally, demographic variables may sometimes be causally relevant; decisions about their use should be context-specific and align model reliance to true causal effects, with caution about mislabeled variables and evolving distributions.
Conclusion
Medical imaging AI models encode protected attributes and display fairness gaps across demographics. While shortcut mitigation can yield locally optimal, fairer models in-distribution, fairness frequently does not transfer under distribution shift. Minimizing demographic encoding in representations and employing adversarial debiasing (e.g., DANN) improves the likelihood of achieving globally optimal models with better OOD fairness. Best practices include: evaluating demographic encoding; selecting models with minimal attribute encoding subject to performance constraints; assessing tradeoffs with calibration and other metrics; externally validating across multiple sites; and continuously monitoring per-group performance during deployment. Future work should develop principled methods and theory for fairness under distribution shifts, explore active data collection to reduce group-specific shift impact, and refine fairness definitions suited to clinical contexts while aligning model reliance on demographics with causal relevance.
Limitations
- Fairness–utility tradeoffs: Improving equalized odds/equal opportunity often worsens calibration (ECE gap), average precision, and F1, reflecting known incompatibilities and potential leveling-down effects. - Limited demographic availability: Race labels were only available for some datasets (MIMIC-CXR, CheXpert), constraining cross-site race analyses. - Task and modality scope: Focused on specific CXR tasks and single tasks in dermatology/ophthalmology; results may differ for other tasks, modalities, or clinical settings. - Retrospective, observational datasets: Potential label noise and confounding (e.g., site, equipment) can affect estimates of fairness and encoding. - Distribution shifts uncontrolled: Real-world OOD conditions encompass multiple, interacting shifts; the decomposition is preliminary and may not capture all factors. - Metric dependence: Emphasis on equalized odds/equal opportunity; other fairness notions may be more appropriate depending on clinical use. Thresholding choices and operating points affect FPR/FNR gaps.
Listen, Learn & Level Up
Over 10,000 hours of research content in 25+ fields, available in 12+ languages.
No more digging through PDFs, just hit play and absorb the world's latest research in your language, on your time.
listen to research audio papers with researchbunny