logo
ResearchBunny Logo
Leveraging electronic health records to identify risk factors for recurrent pregnancy loss across two medical centers: a case-control study

Medicine and Health

Leveraging electronic health records to identify risk factors for recurrent pregnancy loss across two medical centers: a case-control study

J. Roger, F. Xie, et al.

Discover groundbreaking insights into recurrent pregnancy loss (RPL) from a major case-control study conducted by leading researchers including Jacquelyn Roger and Feng Xie. This research identifies key risk factors and paves the way for future clinical interventions.

00:00
00:00
~3 min • Beginner • English
Introduction
Pregnancy loss affects nearly one third of clinically recognized pregnancies, and recurrent pregnancy loss (RPL), defined as two or more losses, affects an estimated 5–6% of ever-pregnant individuals and up to 16% of parous individuals. Known causes include maternal age, chromosomal and uterine anomalies, antiphospholipid antibody syndrome, and uncontrolled endocrine/metabolic conditions, yet about half of RPL cases remain unexplained. Prior studies linking additional factors to RPL have generally been hypothesis-driven with limited scope. Electronic health records (EHRs) provide longitudinal, multimodal data that can be mined to discover broader phenotypic associations. While EHR studies have examined diseases following RPL or idiopathic RPL correlates, none have focused on identifying potential risk factors occurring before or near RPL onset. This study aims to use large-scale EHR data from two medical centers to identify diagnoses associated with RPL, generate etiologic hypotheses, compare patterns by age, evaluate potential confounding by healthcare utilization, and validate findings across centers.
Literature Review
Existing literature identifies multiple contributors to RPL including advanced maternal age, chromosomal abnormalities, uterine anomalies, antiphospholipid syndrome, and endocrine/metabolic disorders. Additional associations have been reported with lifestyle factors, genetic variation, infertility-related diagnoses (e.g., endometriosis, PCOS), hereditary thrombophilias, infections, environmental exposures, and paternal factors such as sperm DNA fragmentation. Large-scale EHR analyses have shown increased long-term cardiovascular and gastrointestinal disease risk after RPL in Denmark, and idiopathic RPL associations with tubulointerstitial nephritis, infertility, and ectopic pregnancy in UK Biobank. Evidence on menstrual abnormalities and miscarriage risk has been mixed and often underpowered. The potential roles of the vaginal/uterine microbiome and metabolic syndrome in pregnancy loss are suggested but not fully resolved. Collectively, prior work underscores multifactorial RPL etiology and the need for broader, data-driven risk factor discovery.
Methodology
Design: Case-control association study using de-identified OMOP-formatted EHRs from two independent centers (UCSF: 6,400,834 patients, 1982–2022; Stanford: 3,604,034 patients, 1999–2022). Patients: RPL cases had an RPL diagnosis or evidence of at least two pregnancy losses (with rules ensuring distinct events). Controls had uncomplicated live-birth and were excluded for any history of loss or specified adverse pregnancy outcomes. Additional demographic, data quality, and sufficiency filters were applied. Exposure ascertainment: ICD9/9-CM/10/10-CM diagnoses recorded before RPL onset (or first live-birth for controls) through one year after were included to capture relevant evaluations around the time of outcome. ICDs were aggregated to Phecodes to harmonize across coding systems and reduce sparsity/specificity. Gender data: To better capture gender identity beyond OMOP gender_id, the authors queried gender-diverse concepts and updated gender identity assignments; inclusion required female gender_id or any incident pregnancy record. Analyses: 1) Main association analysis tested each candidate Phecode’s association with RPL using generalized additive models (GAMs) with covariates: maternal age (smoothing spline), race, and ethnicity; p-values were FDR-adjusted (Benjamini–Hochberg). 2) Age-stratified analyses compared associations in patients <35 vs ≥35 at first RPL/birth record. 3) Healthcare utilization sensitivity analyses re-ran GAMs adding a covariate for number of visits within the study window to evaluate potential confounding by utilization. Visualization: UMAP was applied to one-hot encoded non-pregnancy diagnoses to summarize patient-level disease patterns. Coordinate distributions were compared by outcome and by age strata using Mann–Whitney U tests; additional coloring by demographic and utilization metrics was provided. External validation: All analyses were reproduced independently in Stanford OMOP EHR using identical phenotyping and analysis pipelines. Inter-center validation included overlaps of significant diagnoses, correlations of odds ratios, and comparison plots. IRB approvals were obtained at both institutions. Code and phenotype concept lists were made available in supplementary materials and GitHub.
Key Findings
Cohorts: UCSF included 3,840 RPL and 17,259 controls; Stanford included 4,656 RPL and 36,019 controls (total 8,496 RPL; 53,278 controls). RPL patients were older than controls at both sites. Fewer RPL patients were identified as Hispanic/Latino compared with controls (UCSF: 11.7% vs 16.6%; Stanford: 16.4% vs 29.3%). At UCSF, median visits were similar (RPL 42.5 vs Control 41); at Stanford, RPL had higher utilization (31 vs 14 visits). Candidate diagnoses: UCSF tested 1,612 Phecodes (120 significant: 51 positive, 69 negative). Stanford tested 1,662 Phecodes (367 significant: 330 positive, 37 negative). There were 1,576 shared Phecodes across centers; 88 were significant at both, with highly correlated odds ratios (Spearman r=0.946, p<0.001). Among these 88, 42 were positive in both, 34 negative in both, and 12 discordant. Overlaps of positives (42/1,576) and negatives (34/1,576) were each greater than expected by chance (hypergeometric p<0.001 for both). Notable positive associations (validated across centers): - Menstrual abnormalities: absent/infrequent menstruation, excessive/frequent menstruation, irregular menstrual cycles/bleeding, other menstrual disorders (UCSF ORs 1.67–2.67; Stanford ORs 2.74–4.95; all p<0.001). - Infertility-related: infertility, endometriosis, polycystic ovaries (PCOS), ovarian dysfunction, ovarian failure (UCSF ORs 2.15–5.07; Stanford ORs 2.73–9.69; all p<0.001). - Additional positives included infections (e.g., vaginitis/vulvovaginitis, pelvic inflammatory disease), immunologic conditions, ovarian cyst, and metabolic syndrome. Negative associations (validated): Primarily diagnoses related to childbirth and pregnancy, likely reflecting higher frequency among live-birth controls. Discordant associations: Some mental health and glucose-related diagnoses were significant at both centers but in opposite directions, likely reflecting site-specific screening practices. Age-stratified analyses: Odds ratios tended to be higher in patients <35 than ≥35. - UCSF: Among 1,419 diagnoses tested in both strata, 128 were significant in at least one; ORs were highly ordinally correlated (Spearman r=0.889, p<0.001), with 111/128 higher in <35. - Stanford: Among 1,512 diagnoses, 342 significant in at least one; modest correlation across strata (r=0.365, p<0.001); 239/342 higher in <35. Healthcare utilization sensitivity: - UCSF: With visit count added, 138/1,612 significant (42 positive, 96 negative). ORs between main and sensitivity analyses were extremely correlated (r=0.997), with median 13% decrease in ORs. - Stanford: With visit count added, 162/1,662 significant (56 positive, 106 negative). ORs remained very correlated with main results (r=0.912) but showed a median 49% decrease in ORs. - Sustained validation: Of the 42 inter-center validated positive associations from the main analysis, 33 remained positive and significant at both sites after utilization adjustment; all 34 validated negatives remained significant and negative. In utilization-adjusted models at both centers, 90 diagnoses were significant with very high OR correlation (r=0.944): 33 positive, 56 negative, 1 discordant. UMAP: At UCSF, coordinate distributions differed significantly by outcome and by age strata; at Stanford, they differed by outcome but not by age strata.
Discussion
This large, two-center EHR study identified a spectrum of diagnoses associated with RPL before or near onset, replicating known risk factors and revealing potentially novel associations. Menstrual abnormalities showed robust positive associations with RPL; given their ease of screening and plausible mechanisms (e.g., luteal insufficiency, thyroid/ovarian dysfunction, or uterine structural causes), these findings suggest clinical value in menstrual history and monitoring. Infertility and infertility-related diagnoses, including endometriosis, PCOS, and ovarian dysfunction/failure, were positively associated with RPL, supporting a possible link between impaired conception and pregnancy maintenance and aligning with prior mixed evidence. Associations with vaginitis/vulvovaginitis and pelvic inflammatory disease, together with prior literature, point to the potential role of the vaginal and uterine microbiome in pregnancy loss. Metabolic syndrome was strongly associated with RPL at both sites; its interplay with PCOS and inflammatory/oxidative pathways merits further mechanistic investigation. Negative associations were largely childbirth-related conditions, consistent with their higher frequency among live-birth controls. Site-discordant findings (e.g., mental health and glucose-related diagnoses) likely reflect different screening practices and underscore the value of multi-center validation. Age-stratified results showed stronger associations in younger patients (<35), consistent with the increasing baseline effect of age on miscarriage risk diminishing the relative influence of other factors among older patients. Sensitivity analyses demonstrated that healthcare utilization can materially affect effect sizes—substantially at Stanford where utilization differed by outcome, and minimally at UCSF—highlighting the need to assess utilization in EHR-based studies and to consider potential collider bias if adjusting for it. The persistence of most inter-center validated findings after utilization adjustment supports their robustness across differing utilization patterns. Overall, these findings generate hypotheses on RPL etiologies across menstrual, endocrine, metabolic, infectious, and immunologic domains and emphasize the utility of external validation in EHR-based association studies.
Conclusion
Across two large EHR datasets, the study identified 48 diagnoses with significant positive associations with RPL at both centers, including multiple menstrual abnormalities and infertility-related diagnoses (endometriosis, PCOS, ovarian dysfunction/failure). Results replicated established risk factors and highlighted potentially novel contributors such as metabolic syndrome and infections implicating the vaginal/uterine microbiome. Age-stratified analyses suggested stronger associations in patients under 35, and sensitivity analyses showed healthcare utilization can influence effect sizes, with most validated associations remaining robust after adjustment. This work provides an externally validated, scalable framework for EHR-based RPL risk factor discovery and offers phenotype definitions and code for reproducibility. Future research should investigate biological mechanisms underlying these associations, assess prediction of high-risk patients, explore targeted interventions, and evaluate generalizability across diverse populations and care settings.
Limitations
Potential misclassification includes unrecorded prior losses among controls, which could attenuate associations. Some associations lacked clear clinical rationale and may reflect chance findings from multiple testing or residual confounding, including unmeasured social and behavioral factors. Both cohorts derive from academic centers in the same region, limiting generalizability; external validation beyond these settings is needed. Differential evaluation and screening between RPL and control patients for specific diagnoses could bias associations; while a utilization sensitivity analysis was performed, diagnosis-specific data provenance would require targeted studies. Adjusting for utilization may introduce collider bias and changes in effect sizes reflect a mix of confounding and noncollapsibility rather than pure confounding. EHR gender fields are limited; although augmented methods were used, misclassification is still possible.
Listen, Learn & Level Up
Over 10,000 hours of research content in 25+ fields, available in 12+ languages.
No more digging through PDFs, just hit play and absorb the world's latest research in your language, on your time.
listen to research audio papers with researchbunny