
Medicine and Health
Generative models improve fairness of medical classifiers under distribution shifts
I. Ktena, O. Wiles, et al.
Recent research by Ira Ktena and colleagues at Google DeepMind reveals that generative AI, particularly diffusion models, can significantly enhance the performance of machine learning in healthcare. By generating synthetic data to bolster underrepresented groups in medical imaging, this study showcases improvements in diagnostic accuracy and fairness across multiple modalities.
~3 min • Beginner • English
Introduction
The advent of machine learning in healthcare promises advances across applications such as dermatology, computational pathology and radiology, often motivated by shortages of expert clinicians. However, widespread clinical deployment remains limited, in part due to performance degradation under out-of-distribution (OOD) shifts between training and deployment populations, diseases, acquisition technologies or environments. Few studies evaluate on external cohorts, and many models fail to generalize to new settings. Beyond OOD robustness, underrepresentation of specific groups, conditions or hospitals leads to fairness and equity challenges even in-distribution, with poorer performance on rarer groups or intersections. Prior work has shown disparities across subgroups in radiology, histopathology and dermatology, but robustness to distribution shifts and statistical fairness are rarely addressed together. This work proposes leveraging diffusion-based generative models to learn augmentations that increase data diversity in a configurable way, using available labeled and unlabeled data to improve both robustness and fairness without penalizing overrepresented groups. The approach is validated across histopathology, chest X-ray and dermatology, demonstrating high-quality generation and improvements in diagnostic accuracy and fairness; expert evaluation in dermatology showed synthetic images yielded diagnostic performance comparable to real images for diagnosable cases.
Literature Review
Related work has extensively used generative adversarial networks (GANs) for medical imaging augmentation to improve performance in tasks such as liver lesion classification, diabetic retinopathy, mammography, dermoscopy and segmentation. Although effective, GANs can suffer from limited diversity and quality. Denoising diffusion probabilistic models (DDPMs) have recently achieved state-of-the-art image generation and have been explored for medical domains, including 3D modalities and text-conditioned chest X-ray generation. Ethical distinctions are emphasized between augmenting datasets with synthetic images versus replacing real data, particularly without privacy guarantees. In fairness for health, studies have documented biases across the ML pipeline leading to disparities by sex, age, race/ethnicity and socioeconomic status in imaging tasks; e.g., sex imbalance in chest X-ray training data associates with accuracy disparities, and higher false-positive (underdiagnosis) rates for some racial minorities have been reported, with compounding effects for intersectional identities. These findings motivate methods that can improve fairness and robustness simultaneously under distribution shifts.
Methodology
Overview: The method uses conditional diffusion models to learn augmentations that enrich training data for downstream diagnostic classifiers. Steps: (1) Train a generative model (DDPM UNet) on available labeled and unlabeled images, conditioning on diagnostic labels and optionally on a property (e.g., hospital ID or demographic attribute). For higher-resolution needs, train an upsampler diffusion model with the same conditioning. (2) Sample synthetic images according to a fair sampling strategy: preserve the original diagnostic label distribution p(Y) and sample attributes A uniformly (p(A) uniform), to balance subgroups while maintaining disease prevalence. Multiple samples per (Y,A) increase diversity. (3) Train the diagnostic model on a mix of real labeled and synthetic images; the real-to-synthetic mixing ratio α is treated as a hyperparameter. Theoretical motivation frames constructing a training distribution that approximates a desired fair distribution via mixing real and generated samples and optimizing α.
Experimental protocol and metrics: The approach is evaluated on histopathology, chest radiology and dermatology. Performance metrics reflect each task: histopathology uses top-1 accuracy (binary tumor vs not); radiology uses ROC-AUC across five thoracic conditions (multi-label); dermatology uses sensitivity for high-risk conditions (basal cell carcinoma, melanoma, SCC/SCCIS), alongside top-3 accuracy and balanced accuracy. Fairness metrics quantify performance gaps across sensitive attributes: for binary attributes, the gap between subgroups; for categorical attributes (e.g., hospital ID, ethnicity), the difference between best and worst subgroup; for continuous attributes (e.g., age), discretized buckets are used.
Datasets and setups:
- Histopathology (CAMELYON17/WILDS): Five hospitals; three in-distribution (ID) for training, two held out OOD. Labeled training: 302,436 patches; unlabeled: 1.8M patches from training hospitals. Task: presence of metastases. Diffusion trained at 96×96 on labeled+unlabeled ID data, conditioned on label or label+hospital ID (unlabeled samples padded zeros in conditioning). Sampling assumed uniform hospital ID with preserved label prevalence. Diagnostic classifier is patch-based ResNet-152. Synthetic-to-real ratio: 50:50.
- Chest radiology (CheXpert for training, ChestX-ray14 for OOD evaluation): Training data: 201,055 CheXpert images (male 119,352; female 81,703). OOD evaluation: 17,723 NIH ChestX-ray14 (male 10,228; female 7,495). Two DDPMs: low-res 64×64 and upsampler to 224×224, trained on labeled ID data only; no unlabeled or OOD data available during training. Conditioning on diagnostic label only. The optimal diagnostic training used purely synthetic images (synthetic-to-real: 100:0). BiT-ResNet-152 backbone with separate binary heads per condition.
- Dermatology: 27-class multiclass setting with metadata attributes (sex, age, Fitzpatrick skin tone). Three evaluation datasets: ID holdout (n=1,349), OOD1 (n=6,639; Australian, high-risk heavy), OOD2 (n=3,900; Colombian teledermatology). Generative pipeline: 64×64 DDPM plus upsampler to 256×256 trained on labeled ID data and unlabeled data from ID and OOD domains (no OOD labels used). Conditioning on label or label+demographic attribute (zero-padded for unlabeled). Optimal diagnostic training mix: 75% synthetic, 25% real. To study bias severity, training datasets were artificially skewed along sex, skin tone and age, with multiple skew levels; separate diffusion models were trained per skew setting.
Models and training: DDPMs per Ho et al. with classifier-free guidance; UNet backbones; standard hyperparameters (see Extended Data Table 2). Upsampler preprocessing included bilinear upsampling with anti-alias, Gaussian noise/blur, quantization, normalization. Diagnostic backbones: ResNet-152 for histopathology; BiT-ResNet-152 for radiology; BiT-ResNet-101 for dermatology. Heuristic augmentations baselines included ImageNet-style color/geom transforms and RandAugment. Additional baselines: different pretraining (JFT, ImageNet-21K), oversampling and focal loss. All methods trained/fine-tuned with matched steps and batch sizes; results averaged over five seeds.
Analysis: Distribution shifts were quantified via maximum mean discrepancy (MMD) on penultimate-layer features across domains (ID train, ID eval, OOD, Generated). PCA-based analysis quantified representation compression. Dermatology synthetic image realism and diagnosability were assessed by board-certified dermatologists rating 488 synthetic images across common and high-risk classes; real-image benchmarks included 789 real images rated by 101 dermatologists.
Key Findings
- Across modalities, adding synthetic data improved robustness and fairness relative to strong baselines under in-distribution and OOD evaluation.
- Table 1 summary: Synthetic-to-real ratios and improvements: Histopathology (50:50) achieved 48.5% relative improvement in top-1 accuracy vs baseline without augmentations, with a 30.0% absolute reduction in the in-distribution fairness gap (best–worst hospital). Radiology (100:0; purely synthetic) achieved 5.2% relative ROC-AUC improvement with a 0.031 absolute reduction in the OOD AUC fairness gap (sex). Dermatology (75:25) achieved 27.3% relative improvement in high-risk sensitivity and a 0.044 absolute reduction in the OOD fairness gap (sex).
- Dermatology detailed results: Under more skewed training, label+property conditioning yielded 27.3% better high-risk sensitivity in-distribution and 63.5% OOD compared to baseline, while closing the OOD fairness gap by 7.5×. Heuristic augmentations improved accuracy but often harmed fairness OOD; oversampling modestly helped fairness; synthetic data improved both accuracy and fairness, especially OOD.
- Radiology: Methods using synthetic data improved the AUC–fairness trade-off for sex (and race in-distribution), outperforming baselines including JFT and ImageNet-21K pretraining, RandAugment, and focal loss; all methods trained/fine-tuned under identical regimes.
- Histopathology label efficiency: Synthetic data consistently outperformed baselines across varying labeled counts in-distribution and in the low-data OOD regime. Combining color augmentation with synthetic data gave ~5% OOD improvement in high-data regime and ~4.3% in low-data regime. Achieved similar ID performance to baselines trained with 1,000 labeled samples using only 1–10 labeled samples (≈3× label efficiency in low-data regions).
- Expert evaluation of synthetic dermatology images: Of 488 synthetic images per rater, 50.0 ± 12.6% deemed diagnosable; top-1 accuracy 56.0 ± 11.9% and top-3 accuracy 67.7 ± 12.5% on synthetic images, comparable to real-image performance (top-1 54.0 ± 21.1%, top-3 67.1 ± 22.7%) across 789 real images rated by 101 dermatologists.
- Distribution alignment: Learned augmentations reduced domain mismatch; MMD decreased on average by 18.6% vs heuristic augmentations (statistically significant). Synthetic data also led to more compact representations, requiring 5.4% fewer principal components to retain 90% variance, suggesting reduced reliance on spurious domain-specific correlations.
- In some settings (histopathology, radiology), training downstream classifiers purely on generated data still outperformed real-only baselines; in dermatology, real data remained more essential.
Discussion
The study addresses the dual challenge of OOD robustness and subgroup fairness by using diffusion-generated augmentations to rebalance attributes while preserving disease prevalence. Conditioning on labels and relevant properties (hospital ID or demographics) enables targeted enrichment of underrepresented subgroups. Results across histopathology, radiology and dermatology show consistent gains in diagnostic performance and reduced fairness gaps, with the largest benefits under distribution shifts. Analyses suggest mechanisms for improvement: synthetic images provide canonical, diverse exemplars that better align feature distributions across domains and encourage learning disease-specific features rather than spurious domain cues, improving generalization. Label efficiency indicates potential for substantial performance gains with limited labeled data, pivotal in clinical contexts with rare conditions or scarce expert labels. Although synthetic data can, in some tasks, replace real data during classifier training, they are best viewed as a regularizer and complement to real data. The approach is scalable and potentially synergistic with privacy-preserving techniques (e.g., federated learning) if private synthetic data can be generated, though privacy risks remain. Overall, generative augmentation can improve fairness transfer across distribution shifts, a notoriously difficult goal in medical ML.
Conclusion
This work demonstrates that conditional diffusion models can generate high-quality, steerable augmentations that improve diagnostic accuracy and statistical fairness for medical imaging classifiers under distribution shifts. By preserving disease prevalence while uniformly sampling sensitive attributes, the method enriches underrepresented subgroups without penalizing overrepresented ones. Improvements were observed across three distinct modalities, with particularly strong OOD gains and notable label efficiency. Future directions include: advancing generative model quality and privacy guarantees to mitigate memorization and bias; reducing reliance on predefined categorical sensitive attributes (e.g., moving toward representation-based conditioning); accounting for unobserved characteristics and social determinants; exploring whole-slide analysis in pathology; and leveraging complementary augmentation regimes (including ensembling) to further reduce underserved individuals.
Limitations
- Synthetic data are not a replacement for robust, representative data collection; they should complement real data with rigorous external evaluation.
- Risks include generative model biases or poor quality exacerbating disparities, failure to generate certain (label, attribute) combinations, or producing incorrect images that mislead downstream classifiers.
- Synthetic images may appear realistic leading to overconfidence; transparency and tagging of synthetic images are important for reuse.
- Evaluation conclusions are limited by dataset diversity and representativeness; OOD validation is essential. Patch-level evaluation in histopathology may have limited clinical applicability versus whole-slide analysis.
- Sensitive attributes may be missing or inadequately captured; binary sex labels do not encompass gender diversity; Fitzpatrick scale has limitations for darker skin tones; structural inequities encoded in data can be propagated.
- Privacy concerns remain; without guarantees, synthetic data may memorize or leak sensitive training information.
Related Publications
Explore these studies to deepen your understanding of the subject.