Medicine and Health
Quantifying disparities in intimate partner violence: a machine learning method to correct for underreporting
D. Shanmugam, K. Hou, et al.
Discover PURPLE, an innovative machine learning approach developed by Divya Shanmugam, Kaihua Hou, and Emma Pierson, that accurately estimates the prevalence of underreported health conditions like intimate partner violence. By addressing underreporting's challenges, PURPLE reveals critical insights into demographic disparities in health data, ultimately providing more plausible estimates than traditional methods.
~3 min • Beginner • English
Introduction
There are substantial disparities in women's health by race, age, socioeconomic status, and other factors, and policymakers often rely on relative prevalence ratios to identify and prioritize groups disproportionately affected. Underreporting of conditions—a pervasive problem in women’s health—distorts observed diagnoses and can misrepresent true disparities, particularly when diagnosis sensitivity varies across groups. Intimate partner violence (IPV) is a motivating example: only about a quarter of true cases are correctly diagnosed and diagnosis rates differ by race, risking biased disparity estimates if uncorrected. Existing approaches from epidemiology often require unavailable external information (e.g., test sensitivity/specificity, multiple tests, priors), while standard positive–unlabeled (PU) learning methods rely on separability or positive-subdomain assumptions unlikely to hold in clinical data. The paper introduces PURPLE, a method to estimate relative prevalence under underreporting using only observed diagnoses, symptoms, and group membership, under three assumptions (no false positives, random diagnosis within groups, and constant p(y=1|x) across groups). The goal is to enable accurate estimation of relative prevalences even when absolute prevalence is unidentifiable, and to quantify IPV disparities across demographic groups in large emergency department datasets.
Literature Review
The work situates within two bodies of literature: (1) Epidemiological methods for prevalence estimation under imperfect tests (e.g., latent class models, multiple tests, priors), which often require external ground truth or additional testing information not typically available for underreported women’s health conditions. (2) Machine learning methods for PU learning that assume only some positives are labeled and unlabeled data mix positives and negatives. Many PU methods recover prevalence under restrictive separability or positive-subdomain assumptions (i.e., regions with only positives), which are unrealistic in health settings where symptoms rarely imply a condition with 100% certainty (e.g., IPV-related symptoms are nonspecific). The authors instead adopt an assumption common in domain adaptation and some healthcare contexts—covariate shift implying constant p(y=1|x) across groups—to replace separability assumptions, and they provide assumption checks and robustness analysis. They also compare against PU baselines (Negative/NTC, KM2, DEDPUL, BBF) and discuss limitations of epidemiologic approaches requiring external diagnostics information.
Methodology
Overview: PURPLE (Positive Unlabeled Relative Prevalence Estimator) estimates relative prevalence between groups using observed diagnoses s, features/symptoms x, and group g. Key insight: estimating p(y=1|x) up to an unknown constant factor suffices to estimate the relative prevalence p(y=1|g=a)/p(y=1|g=b). Model: p(s=1|x,g) = p(y=1|x) · p(s=1|y=1,g). Assumptions: (1) No false positives (p(s=1|y=0)=0). (2) Random diagnosis within groups (Selected Completely at Random within g): p(s=1|y=1,g) = c_g, independent of x. (3) Constant p(y=1|x) across groups (covariate shift; p(y=1|x,g)=p(y=1|x)). Estimation: Parameterize p(y=1|x) with a logistic model (single-layer neural network/logistic regression) and p(s=1|y=1,g) with group-specific constants c_g. Fit by minimizing cross-entropy between predicted and empirical p(s=1|x,g). Identifiability is up to a multiplicative constant (β) on p(y|x) and reciprocal on c_g, but this cancels in relative prevalence. Relative prevalence estimator: ratio of mean predicted p(y=1|x) across groups: [Σ_{i:g_i=a} ŷ_i]/[Σ_{i:g_i=b} ŷ_i], where ŷ_i = p̂(y=1|x_i). Derivation shows constant-factor estimation of p(y|x) suffices since constants cancel between numerator and denominator. Implementation: PyTorch single-layer NN with logistic activation (equivalent to logistic regression), Adam optimizer (lr=0.001, eps=1e-8, no weight decay), early stopping on validation cross-entropy. L1 regularization over high-dimensional ICD features; λ∈{1e-2,1e-1,1,10,100,0} selected by maximizing AUC against s on validation. Data and experiments: - Synthetic (Gauss-Synth): two 5D Gaussians for groups with logistic labeling and group-specific diagnosis frequencies; separable and non-separable variants. - Semi-synthetic (MIMIC-semi-synth): MIMIC-IV ICD one-hot features with simulated y from suspicious symptom sets (common symptoms; high relative prevalence symptoms; endometriosis-correlated; IPV-related head/neck/facial injury codes), then s from group-specific diagnosis frequencies. - Real datasets: MIMIC-IV ED (293,297 ED visits; Boston-area hospital) and NEDS 2019 (33.1M ED visits, weighted to 143M nationally). Inclusion: female, age ≥18. Features: one-hot ICD codes (exclude positivity-defining codes from x). Positives s=1: ICD codes indicating IPV (e.g., E967.3 “Battering by an intimate partner”, V61.11 “Counseling for a victim of spousal or partner abuse”; full list in Supplement). Groups g: MIMIC-IV ED—race/ethnicity (Black, white, Asian, Hispanic/Latino), insurance (Medicare, Medicaid), marital status (Legally Unmarried, Married, Divorced); NEDS—race/ethnicity, insurance status, income quartile (by ZIP median household income), and urban/rural county designation (population categories including central/fringe metropolitan). Splits: five randomized train/validation/test splits; each patient in only one split. Assumption checks: (a) Compare PURPLE’s constrained model fit to an unconstrained model p(s=1|x,g) allowing x–g interactions; similar performance supports assumptions; worse fit suggests violations. (b) Group-wise calibration of predicted p(s=1|x,g). Robustness: If p(y=1|x) differs by a constant multiplicative factor across groups (p(y=1|x,g=A)=α p(y=1|x,g=B), α>1), PURPLE’s estimate is a lower bound on the true relative prevalence. Sensitivity analyses show modest degradation under violations of no-false-positives and random-diagnosis-within-groups. Baselines: Negative/NTC (assumes no underreporting), KM2 (kernel mean embedding mixture), DEDPUL (density heuristic on NTC outputs), BBF (bootstrap-based positive subset). Oracle (uses true y) included as unattainable upper bound.
Key Findings
Validation: - Synthetic data: PURPLE accurately recovers relative prevalence in both separable and non-separable settings, outperforming PU baselines (Negative/NTC, KM2, BBE/BBF, DEDPUL), which can err substantially, especially when classes are non-separable. PURPLE’s accuracy is stable across varying differences in p(x|g). - Semi-synthetic MIMIC-based data: Across four symptom regimes (common, high relative prevalence, endometriosis-related, IPV-related injuries), prior methods often err by >4x; PURPLE remains consistently accurate. Real-world IPV disparities (female, age ≥18): - Insurance status: IPV is more prevalent among Medicaid patients than non-Medicaid: NEDS relative prevalence 2.44 ± 0.07; MIMIC-IV ED 2.65 ± 0.31. IPV is less prevalent among Medicare patients: NEDS 0.37 ± 0.01; MIMIC-IV ED 0.38 ± 0.04. - Race/ethnicity: White patients have the lowest relative prevalence among four race groups in both datasets. In NEDS, white vs non-white relative prevalence is 0.82 ± 0.02, indicating higher prevalence among non-white overall. In MIMIC-IV ED, racial differences are noisier due to smaller sample and ordering is not fully consistent across datasets. - Marital status (MIMIC-IV ED): Higher IPV prevalence among Legally Unmarried patients: relative prevalence 1.48 ± 0.21. - Geography (NEDS): Higher IPV prevalence in central metropolitan counties with population >1 million: relative prevalence 1.18 ± 0.02. - Income (NEDS): IPV prevalence decreases with income: bottom quartile relative prevalence 1.16 ± 0.02 vs top quartile 0.87 ± 0.03. Importance of underreporting correction: Using observed diagnoses alone (p(s=1|g)) yields inconsistent trends with income; PURPLE’s correction aligns with prior literature showing IPV decreases with income. The analysis suggests IPV is less likely to be correctly diagnosed in lower-income women, mirroring broader underdiagnosis patterns in conditions like dementia, asthma, and depression.
Discussion
The study addresses the challenge of estimating disparities when outcomes are underreported by introducing PURPLE, which estimates relative prevalence without requiring absolute prevalence or restrictive separability assumptions. By assuming no false positives, random diagnosis within groups, and constant p(y=1|x) across groups, the method learns p(y|x) up to scale and accurately infers relative prevalence. Synthetic and semi-synthetic validations demonstrate strong performance compared to PU baselines, particularly in realistic non-separable health data. Applying PURPLE to large ED datasets reveals higher IPV prevalence among Medicaid patients, non-white patients (with some dataset-dependent variability), legally unmarried individuals, residents of lower-income ZIP codes, and metropolitan counties. Correcting for underdiagnosis yields more plausible socioeconomic trends (e.g., decreasing IPV with increasing income) consistent with prior work, underscoring the necessity of modeling underreporting for equitable health policy. While the constant p(y|x) assumption may not universally hold (e.g., age dependencies), provided checks help assess applicability, and a lower-bound guarantee ensures conservative disparity estimates under certain violations. The cross-dataset comparison highlights the importance of large samples and replication to robustly assess racial disparities.
Conclusion
This work introduces PURPLE, a machine learning method to estimate relative prevalence under underreporting using only observed diagnoses, symptoms, and group membership, without requiring separability assumptions or external diagnostic parameters. PURPLE outperforms existing PU-based approaches on synthetic and semi-synthetic health data and reveals robust IPV disparities across demographic groups in two large ED datasets. Correcting for underdiagnosis produces more credible disparity patterns (notably with income), informing better-targeted interventions. Future work includes applying PURPLE to other underreported women’s health conditions (e.g., polycystic ovarian syndrome, endometriosis, traumatic brain injury) and to non-health domains with underreporting (e.g., police misconduct, hate speech), extending modeling capacity for interactions/nonlinearities, and further validating assumptions and robustness across diverse settings.
Limitations
- Assumption dependence: Results rely on (i) no false positives, (ii) random diagnosis within groups (diagnosis probability independent of x given y and g), and (iii) constant p(y=1|x) across groups. Violations can bias estimates; although diagnostics (model fit vs unconstrained model, group-wise calibration) and robustness analysis (lower-bound guarantee under multiplicative deviations) mitigate risks, they cannot rule out all misspecifications. - Absolute prevalence unidentifiable: PURPLE estimates relative, not absolute, prevalence when only positive and unlabeled data are available. - Population scope: Analyses are conditional on ED utilization among adult females; findings may not generalize to populations not interacting with EDs or to males, for whom symptom–IPV relationships may differ. - Data and sample size: Some subgroup estimates (e.g., race in MIMIC-IV ED) are noisier due to smaller sample sizes; demographic variable availability differs between datasets. - Potential diagnostic coding errors: IPV coding may include some false positives or misclassification; sensitivity analyses show performance degrades with increased false positives. - Random-diagnosis-within-group assumption may be partially violated if diagnosis probabilities depend on specific symptom patterns; simulations indicate small errors grow with stronger dependence.
Related Publications
Explore these studies to deepen your understanding of the subject.

