Medicine and Health
Improving model fairness in image-based computer-aided diagnosis
M. Lin, T. Li, et al.
The study addresses the growing concern that deep learning models used for medical image-based diagnosis may encode and amplify biases across protected attributes such as race, sex, and age, leading to under- and over-diagnosis in certain groups. Fairness here is defined as the absence of prejudice toward individuals or groups based on inherent or acquired characteristics. While prior work has identified biases in medical imaging AI, methods to reduce such bias often degrade model performance and have rarely been evaluated on large, diverse datasets. The authors aim to reduce model decision bias at both individual and intersectional subgroup levels by optimizing a metric aligned with clinical ranking use-cases. They evaluate fairness and performance across four public datasets (COVID-19 detection on MIDRC chest X-rays, thorax abnormality detection on MIMIC-CXR, POAG detection on OHTS optic discs, and late AMD detection on AREDS fundus images), hypothesizing that a training approach targeting marginal pairwise equal opportunity can reduce disparities while preserving overall AUC.
The paper situates its contribution within literature demonstrating high-performing medical imaging AI alongside ethical concerns of bias and fairness in healthcare ML. Prior studies have documented biases by race, sex, and age in medical AI (e.g., underdiagnosis in underserved populations). Existing fairness-improving approaches (e.g., constraint-based group fairness, pruning, confounder-free training, adversarial reweighting) often reduce overall performance and are seldom validated on large, real-world datasets. Conventional fairness metrics such as equalized odds and demographic parity focus on binarized decision disparities (FNR/FPR) and require thresholds, which may not align with clinical ranking and resource allocation. Pairwise Fairness (marginal pairwise equal opportunity) for bipartite ranking has been proposed as a scale-invariant, threshold-free alternative better suited for clinical decision support. The authors build on this notion to directly optimize fairness in medical image classification.
Datasets: Four large, publicly available cohorts were used. (1) MIDRC: chest X-ray repository for COVID-19 diagnosis; 77,887 images from 27,799 individuals with age, sex, and race metadata. (2) MIMIC-CXR: chest radiographs labeled via CheXpert; 212,567 PA/AP images from 227,827 studies with self-reported age, sex, race. (3) OHTS: optic disc images for primary open-angle glaucoma (POAG) diagnosis; 37,399 images from 1,636 participants; gold-standard labels from masked certified readers. (4) AREDS: color fundus photographs for late AMD; 66,060 images from 4,566 patients with sex, age, and genotypes (CFH rs1061170, ARMS2 rs10490924). Race was excluded for AREDS due to very small Black subgroup (<3.7%).
Fairness metric: Marginal pairwise equal opportunity (Pairwise Fairness) computes the probability that a randomly selected positive sample from a subgroup is ranked above a randomly selected negative sample from the entire dataset, equivalent to an AUC-like ranking measure per subgroup. Pairwise Fairness Difference (PFD) is defined as the max minus min Pairwise Fairness across subgroups; larger PFD indicates greater disparity.
Models: Primary backbone was DenseNet-201 (ImageNet-pretrained) for AREDS, OHTS, MIMIC-CXR; DenseNet-121 (CheXpert-pretrained) for MIDRC. To test generalizability, ResNet-152 was also evaluated on MIDRC and OHTS. Final layers were replaced with a 2-output fully connected layer (abnormal vs normal).
Proposed training objective: Instead of binary cross-entropy, training optimizes marginal ranking loss for the subgroup with the lowest current Pairwise Fairness within each batch. For each batch: (a) compute Pairwise Fairness per subgroup; (b) select subgroup with the minimum value; (c) form pairs between positive samples from that subgroup and negatives from the entire training set; (d) apply margin ranking loss over all pairs. This directly corrects mis-ordered rankings, emphasizing the worst-off group to promote equitable improvements and reduce PFD while maintaining overall performance.
Preprocessing and training: All images resized to 224×224×3. For MIDRC, DICOMs converted to JPG, normalized to [0,255], inverted if necessary, histogram equalization applied, saved at quality 95. Data augmentations: random rotations (0–10 degrees), horizontal/vertical flips. Optimizer: Adam with learning rate 1e-4; batch size 96; train for 20 epochs; best model by dev-set AUC saved. Implementation: PyTorch. Hardware: Intel Core i9-9960X CPU, NVIDIA Quadro RTX 6000 GPU.
Experimental protocol: MIDRC, AREDS, OHTS split at patient level into 80% train and 20% hold-out test; MIMIC-CXR used official splits. Experiments repeated five times to report means and standard deviations. Evaluation metrics: overall AUC and PFD (difference between max and min subgroup Pairwise Fairness). Relative change between proposed and baseline defined as (proposed − baseline)/baseline, reported for AUC and PFD.
Baseline: Standard deep convolutional network trained with binary cross-entropy loss matching each backbone.
- Across four tasks and cohorts, the proposed marginal ranking loss consistently reduced fairness disparities (PFD) while maintaining overall AUC relative to a binary cross-entropy baseline. In many cases AUC improved.
- Dataset/task scales: MIDRC COVID-19 (77,887 images; 27,799 individuals), MIMIC-CXR thorax abnormality (212,567 images; 227,827 studies), OHTS POAG (37,399 images; 1,636 individuals), AREDS Late AMD (66,060 images; 4,566 individuals).
- Quantitative relative changes (proposed vs baseline) aggregated over five runs (means): • COVID-19 (MIDRC): Age AUC −2.00%, PFD −40.25%; Sex AUC −0.96%, PFD −53.79%; Race AUC −2.44%, PFD −39.73%; Age–Race PFD −47.69% with AUC −2.54%. • Thorax abnormality (MIMIC-CXR): Age AUC −0.01%, PFD −35.74%; Sex AUC −0.73%, PFD −35.33%; Race AUC −1.21%, PFD +31.70%; Age–Sex PFD −49.24% with AUC −0.92%. • POAG (OHTS): Age AUC +1.42%, PFD −53.82%; Sex AUC +0.72%, PFD −35.74%; Race AUC +2.32%, PFD −35.10%; Age–Sex PFD −43.85% with AUC +0.34%. • Late AMD (AREDS): Age AUC +0.02%, PFD −25.22%; Sex AUC +0.06%, PFD +5.00%; CFH AUC +0.15%, PFD −29.06%; ARMS2 AUC +0.15%, PFD −49.73%; Age–CFH PFD −28.37% with AUC +0.06%.
- Overall, 15 cases showed PFD reductions; 12 exceeded 35% reduction. Most AUC relative changes were within about 1% (some exceptions: modest AUC decreases in MIDRC and MIMIC-CXR subgroups; increases in OHTS and AREDS).
- Subgroup vulnerability patterns (lower AUC vs counterparts): • MIDRC COVID-19: male, age ≥75, and Other races had lower AUCs. • MIMIC-CXR thorax abnormality: age ≥60, male, and Black had lower AUCs. • OHTS POAG: age <60, female, and Other races had lower AUCs. • AREDS Late AMD: age <65 had lowest AUC; sex AUCs comparable; genotype CFH TT and ARMS2 GG had lowest AUCs.
- Intersectional groups exhibited amplified disparities; proposed method reduced PFDs while keeping AUC comparable: MIDRC age–race, MIMIC-CXR age–sex, OHTS age–sex, AREDS age–CFH.
- Generalizability: Using ResNet-152 on MIDRC and OHTS reproduced the DenseNet findings—lower PFD and comparable or higher AUCs across attributes and intersectional groups.
- Data imbalance effects observed: higher disease prevalence or smaller subgroup sample sizes associated with lower AUC and increased disparities (e.g., OHTS age groups, MIDRC race distribution).
The findings support the hypothesis that directly optimizing marginal pairwise equal opportunity can reduce subgroup disparities in medical image classification while maintaining overall utility. By focusing training on the subgroup with the lowest current Pairwise Fairness and using margin ranking loss to correct mis-ordered positive–negative rankings, the method improves fairness across both individual and intersectional attributes. This aligns with clinical use where risk scores are used for triage and decision support, making pairwise ranking a more appropriate fairness target than threshold-dependent metrics. The study emphasizes advantages of PFD over traditional fairness metrics: it is scale- and threshold-invariant, evaluates bipartite ranking relevant to clinical prioritization, and better captures disparities in how probability scores are used. Results across large, diverse datasets reveal pervasive biases and show that fairness improvements do not necessarily entail performance loss; AUC mostly remained within ±1% while PFD often dropped by >35%. Analyses highlight the role of data imbalance in inducing bias—unequal prevalence and sample size across subgroups correlate with lower AUC and higher disparities. While oversampling can sometimes help, the proposed training strategy more consistently reduces PFD with minimal AUC trade-off. Intersectional disparities were larger than single-attribute disparities, often involving age, pointing to disease epidemiology and dataset imbalance as contributing factors. The method also proved robust when subgroup disparities were already small, further reducing PFD, and generalized across backbones (DenseNet, ResNet).
The study introduces a training approach that optimizes marginal pairwise equal opportunity to improve fairness in medical image-based diagnosis. Across four large-scale tasks, the method substantially reduced disparities among individual and intersectional subgroups while maintaining overall AUC, addressing concerns about biased AI in clinical imaging. The approach is model-agnostic and generalizable across backbones and datasets, suggesting suitability for clinical deployment to promote equitable outcomes. Future research should evaluate and mitigate calibration bias in predicted probabilities, extend the method to continuous attributes and multi-class settings, and continue investigating strategies that balance fairness with performance under real-world data imbalance constraints.
- The evaluation focused on binarized classifiers and did not assess calibration of predicted probabilities, which may reflect over- or under-confidence in specific subgroups.
- Some subgroup analyses were limited by data imbalance and small sample sizes (e.g., AREDS Black subgroup too small to analyze), potentially affecting generalizability of subgroup-specific conclusions.
- Fairness improvements were primarily measured via Pairwise Fairness and PFD; other fairness notions (e.g., equalized odds) were not concurrently optimized, and in one case (MIMIC-CXR race; AREDS sex) PFD did not decrease.
- The approach was evaluated on four public datasets; broader validation across additional modalities, institutions, and prospective clinical settings is needed.
Related Publications
Explore these studies to deepen your understanding of the subject.

