logo
Loading...
Racial Disparities in Pulse Oximetry Cannot Be Fixed With Race-Based Correction

Medicine and Health

Racial Disparities in Pulse Oximetry Cannot Be Fixed With Race-Based Correction

N. Patwari, D. Huang, et al.

This study by Neal Patwari, Di Huang, and Francesca Bonetta-Misteli reveals a concerning racial bias in pulse oximeters, which leads to a higher incidence of occult hypoxemia among patients of color. It highlights the crucial need to address the variance in pulse ox measurements between Black and white patients, emphasizing that simply adjusting hypoxemia thresholds won't suffice for equitable care.... show more
Introduction

Pulse oximeters have been central to COVID-19 care and influence triage, treatment eligibility, and reimbursement decisions. Decades of studies have demonstrated racial disparities in pulse oximetry accuracy, with higher positive bias for patients racialized as Black, especially at lower oxygenation levels, leading to greater odds of occult hypoxemia compared to white patients. This contributes to inequitable access to care and potentially worse outcomes for patients of color. The study distinguishes statistical bias (mean error of SpO2 vs SaO2) from discriminatory impact, and investigates whether race-based correction could equalize hypoxemia detection performance. The authors hypothesize that disparities are not limited to mean bias but also include differences in error variance and distribution shape, which would prevent equalizing performance through race-based thresholds alone.

Literature Review

Prior work over three decades documents higher SpO2 readings relative to SaO2 in patients of color, particularly Black patients, and increased odds of occult hypoxemia (reported odds ratios ~1.4–3.0). Several papers discuss the dependence of bias on oxygenation level and question race-based corrections in clinical algorithms due to ethical, practical, and historical concerns. Clinical commentaries have requested more detailed statistical characterization (beyond means) of pulse oximeter errors by race. Regulatory and methodological discussions highlight insufficient representation of darkly pigmented subjects in device validation and the need to account for skin pigmentation in studies.

Methodology

Data source: eICU Collaborative Research Database (eICU-CRD), 2014–2015 ICU records from 208 hospitals with >139k patients. Extracted 335k SaO2 measurements (arterial blood gas, ABG) from 73k patients and ~140 million SpO2 measurements from 196k patients (vitalPeriodic, physicalExam, nurseCharting tables). For each SaO2, selected the temporally closest SpO2 for the same patient; retained pairs if measured within T=10 minutes and race was not “other/unknown.” Sensitivity checks with T=5 and T=20 minutes yielded similar results. Final dataset: 218k (SaO2, SpO2) pairs; racial distribution of pairs: white 80.8%, Black 11.3%, Asian 1.9%, Hispanic 5.2%, Native American 0.8%. Due to sparse data below SpO2 87% for some groups, analyses stratifying by SpO2 range were limited to SpO2 ≥87%. Metrics and definitions: SpO2 error = SpO2 − SaO2. Statistical bias = E[SpO2 − SaO2]. Stratified analyses by measured SpO2 ranges: Low [87–91], Medium [92–96], High [97–100]. Hypoxemia defined by SaO2 < 88%. Statistical analyses: Compared mean errors using two-sided Welch’s t-test vs white group (unequal variances). Compared variances using one-sided F-tests (Asian vs white, Black vs white). Assessed tail risk (P[error > 10]) and tested differences with one-sided tests of equal proportions (normal approximation). Modeled hypoxemia detection using a single-threshold detector on SpO2: decide hypoxemia if SpO2 < γ. Evaluated probability of detection (true positive rate) and probability of false alarm (false positive rate) across γ values (half-integer thresholds to align with integer SpO2). Explored whether a single or race-dependent γ could equalize performance across racial groups.

Key Findings
  • Dataset: 218k matched (SaO2, SpO2) pairs; pairs from white patients constituted 80.8% of data.
  • Overall positive bias across all groups (mean 1.71%), indicating SpO2 tends to overestimate SaO2.
  • Mean bias by race (Table 1): white 1.58% (median 1.00%), Black 2.60% (median 1.70%), Asian 2.47% (median 1.30%), Hispanic 1.47% (median 1.20%), Native American 1.81% (median 1.40%). Black and Asian vs white differences were highly significant (Welch’s t-test p < 1e−13); Hispanic and Native American differences were not significant (p > 0.05).
  • Bias vs SpO2 level: For all groups, mean bias decreased as SpO2 decreased; racial disparity in bias increased at lower SpO2. Black–white bias gap ~0.93 at high SpO2, ~1.70 at low/medium SpO2.
  • Variance differences: Standard deviation of SpO2 error decreased with increasing SpO2 for all groups, but was consistently higher for Black and Asian vs white patients; variance differences significant (one-sided F-test p < 0.001).
  • Heavier tails for Black patients: Probability of large error (SpO2 − SaO2 > 10): white 0.0486, Black 0.0813 (≈67% higher than white, p < 0.001), Asian 0.0537 (p = 0.068), Hispanic 0.0384, Native American 0.0407.
  • Hypoxemia detection (SaO2 < 88%) using single-threshold γ on SpO2: No single γ yields identical detection and false-alarm probabilities for Black and white patients. Example: γ=88.5 gives detection 30% and false alarm 2.5% for white vs detection 22% and false alarm 2.8% for Black. Race-dependent thresholds cannot simultaneously equalize both metrics; e.g., γ=90.5 for Black raises detection to 29.4% but false alarm remains higher (4.0% vs 2.5% for white at γ=88.5). For Asian vs white, performance curves nearly overlap; a ~1.0 shift in γ (or subtracting ~1 from Asian SpO2) approximately aligns performance across a range, though limited by smaller Asian sample size.
Discussion

Findings demonstrate that racial disparities in pulse oximetry are not solely differences in mean error but also include larger variance and heavier error tails for patients racialized as Black. Consequently, race-based corrections or race-dependent thresholds cannot equalize hypoxemia detection performance between Black and white patients. Potential mechanisms include optical attenuation by melanin, especially at red wavelengths (~650 nm), which reduces signal-to-noise ratio and increases variance in SpO2 estimates, and the social construction of race in the U.S., which groups a broader range of skin pigmentation under “Black,” increasing within-group variance of measurement errors. These results imply that equitable clinical monitoring requires addressing both bias and variance in device design and calibration, collecting and analyzing skin color alongside race/ethnicity, ensuring adequate representation of patients of color in validation studies, and refining regulatory standards to detect and mitigate disparities before deployment. While Asian vs white disparities in detection performance appear potentially correctable by a simple shift, this does not generalize to Black vs white and is limited by smaller sample sizes for Asian patients.

Conclusion

Using a large multicenter ICU database, the study shows that SpO2 overestimates SaO2 on average across all groups and that Black patients experience both higher mean bias and greater variance in SpO2 errors than white patients. Because disparities exist in both mean and variance, no single threshold or race-based correction can equalize hypoxemia detection performance between Black and white patients. Achieving equitable pulse oximetry will require improved device design and calibration that reduce both bias and variance across skin tones, routine collection of skin color data, over-representation of patients of color in validation studies, and strengthened regulatory requirements. Future work should quantify the role of skin pigmentation directly, control for clinical confounders, and design sensing approaches robust to skin-dependent optical effects.

Limitations
  • Retrospective analysis using clinically collected data; potential unmeasured confounders (e.g., organ dysfunction scores such as SOFA) were not controlled and could influence associations.
  • Race/ethnicity categories are limited and socially constructed; lack of direct skin color measurements constrains mechanistic interpretation and may conflate race with pigmentation effects.
  • Dataset is skewed toward white patients (80.8% of pairs), limiting precision for Asian and Native American groups, especially at low SpO2 (<87%), and widening confidence intervals.
  • Matching window of 10 minutes (though sensitivity analyses at 5 and 20 minutes were similar) may introduce temporal mismatch noise.
  • Device makes/models and probe placements are heterogeneous and not controlled, potentially adding variability.
  • Findings on approximate correctability for Asian vs white are tentative due to ≈20× fewer Asian data points.
Listen, Learn & Level Up
Over 10,000 hours of research content in 25+ fields, available in 12+ languages.
No more digging through PDFs, just hit play and absorb the world's latest research in your language, on your time.
listen to research audio papers with researchbunny