logo
ResearchBunny Logo
The limits of fair medical imaging AI in real-world generalization

Medicine and Health

The limits of fair medical imaging AI in real-world generalization

Y. Yang, H. Zhang, et al.

This study reveals the critical challenges of fairness in medical AI for disease classification across various imaging modalities, highlighting how demographic shortcuts lead to biased predictions. Conducted by Yuzhe Yang, Haoran Zhang, Judy W. Gichoya, Dina Katabi, and Marzyeh Ghassemi, the research uncovers that less demographic attribute encoding in models can yield better performance in diverse clinical settings, emphasizing best practices for equitable AI applications.

00:00
00:00
Playback language: English
Introduction
The increasing use of AI in clinical settings necessitates rigorous evaluation of both model performance and potential biases affecting different demographic groups. While deep learning has achieved impressive results in medical imaging, concerns remain about its tendency to perpetuate existing data biases, resulting in performance disparities among protected subgroups. For example, chest X-ray (CXR) classifiers have shown systematic underdiagnosis of Black patients, potentially delaying crucial care. To address this, understanding the source of bias and implementing corrective measures is vital. Recent studies highlight AI's unexpected ability to infer demographic information (race, sex, age) from medical images, far surpassing radiologists' capabilities. This raises concerns that disease prediction models might exploit these demographic correlations as 'shortcuts'—statistical associations lacking clinical relevance—for instance, using hospital location as a proxy for disease prevalence. This research addresses four key questions: 1) Do disease classification models utilize demographic information as shortcuts, leading to biased predictions? 2) Can state-of-the-art methods effectively remove these shortcuts and create fair ('locally optimal') models? 3) How does algorithmic fairness interact with shortcuts when data shifts occur in real-world clinical deployments (out-of-distribution, OOD)? 4) Which algorithms and model selection criteria yield 'globally optimal' models that maintain fairness in OOD settings? The study focuses on CXR prediction models, validated further using dermatology and ophthalmology data, utilizing six extensive international radiology datasets (MIMIC-CXR, CheXpert, NIH, SIIM, PadChest, VinDr).
Literature Review
Existing literature demonstrates AI's potential to exacerbate healthcare disparities. Studies have revealed fairness gaps in deep learning models for various medical imaging tasks, including chest X-ray classification where underdiagnosis of Black patients is a significant concern. Research has also shown the ability of deep learning models to extract demographic information from medical images, raising concerns about the use of demographic shortcuts in disease prediction. These shortcuts, while improving in-distribution performance, may not be valid in real-world scenarios with data distribution shifts. Prior works have investigated methods for mitigating bias and improving model fairness but the generalizability of these methods to OOD scenarios remains an open question.
Methodology
The study used six publicly available CXR datasets (MIMIC-CXR, CheXpert, NIH, SIIM, PadChest, VinDr), focusing on four binary classification tasks ('No Finding', 'Effusion', 'Pneumothorax', 'Cardiomegaly') known to exhibit performance disparities between groups. Additional analyses were conducted using dermatology (ISIC dataset) and ophthalmology (ODIR dataset) images. Fairness was assessed using class-conditioned error rates (false positive rate (FPR) for 'No Finding', false negative rate (FNR) for other diseases), aiming for equalized odds. Distribution shifts (prevalence shifts, representation shifts, label shifts, covariate shifts) were quantified using statistical tests (two-sample binomial proportion test, MMD distance, permutation-based hypothesis test). A grid of deep convolutional neural networks (DenseNet-121) was trained on the datasets, employing six algorithms: empirical risk minimization (ERM), ReSample, GroupDRO, DANN, CDANN, and exponential moving average (MA). The algorithms were categorized based on their approach to fairness: sample reweighting, adversarial removal of group information, and general robustness improvement. Attribute encoding was assessed by training a model to predict sensitive attributes (age, race, sex, sex & race intersections) using frozen representations from the disease classification models. The correlation between attribute encoding and fairness gaps was analyzed. Model evaluations were performed in both in-distribution (ID) and out-of-distribution (OOD) settings. Locally optimal models (Pareto front) were identified in the ID setting to balance performance and fairness. The transferability of ID optimality to OOD settings was investigated. A decomposition method was proposed to analyze the changes in OOD fairness, attributing it to the combination of ID fairness gap and the disparate impact of distribution shift on each group. Finally, model selection criteria for achieving globally optimal models with good OOD fairness were evaluated, including selecting models with minimum ID demographic encoding, minimum ID fairness gap, and minimum ID overall performance. Statistical analyses included Pearson correlation coefficients, two-sided t-tests, one-tailed Wilcoxon rank-sum test, and non-parametric bootstrap sampling.
Key Findings
The study confirmed that medical imaging AI models leverage demographic shortcuts, resulting in significant fairness gaps (Fig. 2). Stronger encoding of demographic information correlated strongly with larger fairness disparities (Fig. 2g). Algorithmic methods designed to remove shortcuts effectively reduced fairness gaps in the ID setting, creating 'locally optimal' models (Fig. 3). However, these locally optimal models did not consistently maintain fairness in OOD settings (Fig. 4), demonstrating that ID fairness doesn't necessarily transfer to OOD scenarios. The correlation between ID and OOD performance was high, while the correlation between ID and OOD fairness was inconsistent, often negative (Fig. 4). Optimizing for fairness alone led to worse performance in other clinically meaningful metrics (Extended Data Fig. 5), highlighting trade-offs between fairness and other aspects of model performance, such as calibration. A decomposition of OOD fairness gaps revealed that the disparate impact of distribution shift on different groups is a significant factor (Extended Data Fig. 7). Surprisingly, model selection criteria that prioritized minimizing ID demographic encoding (minimum attribute prediction accuracy and AUROC) led to better OOD fairness compared to selecting based solely on ID fairness (Fig. 5). Among the algorithms tested, methods explicitly removing demographic information from embeddings (DANN) resulted in the lowest average OOD fairness gap (Fig. 5b). These findings were consistent across radiology, ophthalmology, and dermatology datasets and tasks.
Discussion
The study's findings have several important implications. First, it cautions against the assumption that eliminating demographic shortcuts automatically leads to better fairness in real-world deployments. While removing shortcuts improves ID fairness, this may not transfer to OOD scenarios, and trade-offs with other important metrics exist. The complex interplay between demographics, disease, and causal factors must be carefully considered, as some demographic variables may indeed be causal factors in certain diseases. The study proposes a decomposition method to understand OOD fairness changes, attributing them to ID fairness and the disparate impact of distribution shifts. This suggests that mitigating both ID fairness and the effects of distribution shifts is crucial for achieving OOD fairness. The results highlight the need for continuous monitoring of both overall and per-group performance of AI models in clinical settings, challenging the current regulatory reliance on developer assurances and emphasizing the need for real-world performance monitoring, including fairness assessments. The choice of fairness definition should be tailored to the specific clinical context, carefully balancing fairness and overall performance.
Conclusion
This study demonstrates the limitations of focusing solely on ID fairness when developing medical AI models. The findings highlight the importance of considering OOD generalization and the trade-offs between fairness and other clinically relevant metrics. Minimizing demographic encoding in ID settings emerged as a promising strategy for achieving better OOD fairness. Future research should focus on developing methods for building models robust to arbitrary distribution shifts and creating fairness-aware model selection criteria suitable for real-world clinical deployment.
Limitations
The study's reliance on publicly available datasets might limit the generalizability to all clinical settings. The chosen fairness metrics might not fully capture all aspects of fairness in healthcare. The decomposition of OOD fairness is a preliminary approach and may require refinement. The study primarily focuses on specific disease prediction tasks, and other tasks might exhibit different behavior.
Listen, Learn & Level Up
Over 10,000 hours of research content in 25+ fields, available in 12+ languages.
No more digging through PDFs, just hit play and absorb the world's latest research in your language, on your time.
listen to research audio papers with researchbunny