Introduction
Pregnancy loss, encompassing miscarriage and stillbirth, is a common complication, affecting nearly one-third of clinically recognized pregnancies. Recurrent pregnancy loss (RPL), defined as two or more pregnancy losses, occurs in an estimated 5-6% of all pregnant individuals and up to 16% of parous women. RPL causes significant emotional distress for families and is associated with a poorer prognosis for future pregnancies. The etiology of RPL is multifaceted and incompletely understood, with known contributing factors including advanced maternal age, chromosomal abnormalities, uterine anomalies, antiphospholipid syndrome, and hormonal or metabolic imbalances. Prior research has explored lifestyle factors, genetic variations, infertility diagnoses, thrombophilias, infections, and environmental exposures as potential contributors, yet roughly half of RPL cases remain unexplained. Electronic health records (EHRs) provide a rich source of longitudinal patient data that can be leveraged to identify previously unrecognized risk factors through computational analysis. While previous studies have used EHR data to examine disease incidence after RPL or characterize lifetime phenotypic associations in idiopathic RPL, there is a lack of EHR-based studies focused specifically on identifying potential risk factors for RPL before or near its onset. This study aimed to fill this gap by conducting a large-scale case-control study using de-identified EHR data from two independent medical centers: the University of California, San Francisco (UCSF), and Stanford University. By comparing the frequency of over 1600 candidate diagnoses in RPL and live-birth patients, we sought to identify novel associations and generate hypotheses for further investigation into RPL etiologies.
Literature Review
Existing literature on RPL highlights its multifactorial nature and the challenges in identifying causative factors. Numerous studies have investigated various potential risk factors, including:
* **Maternal Factors:** Advanced maternal age, chromosomal abnormalities (numerical or segmental), congenital uterine abnormalities, antiphospholipid antibody syndrome, uncontrolled hormonal or metabolic conditions.
* **Lifestyle Factors:** Diet, smoking, alcohol consumption, and exercise levels.
* **Genetic Factors:** Genetic variations associated with increased risk.
* **Infertility-Associated Diagnoses:** Endometriosis, polycystic ovary syndrome (PCOS), and other conditions related to infertility.
* **Thrombophilias:** Hereditary clotting disorders.
* **Infections:** Various infections implicated in pregnancy loss.
* **Environmental Exposures:** Air pollution and exposure to certain chemicals.
* **Paternal Factors:** Sperm DNA fragmentation and paternal health.
Despite these studies, a significant portion of RPL cases remain without an identifiable cause, emphasizing the need for further research to elucidate the underlying mechanisms.
Methodology
This case-control study utilized de-identified EHR data from UCSF and Stanford, encompassing 8496 RPL patients and 53278 control patients with live births. Patient selection involved rigorous criteria to ensure accurate identification of RPL and control groups, including specific OMOP concept lists for pregnancy outcomes (provided in Supplementary File 1). To address limitations in sex and gender data in EHRs, a combined approach was used, incorporating gender_id and additional diagnostic codes to improve gender categorization. A total of over 1600 candidate diagnoses were analyzed, spanning a timeframe from before RPL onset (or first live birth for controls) up to one year after the event. This extended window allowed the inclusion of relevant diagnostic evaluations that might occur following pregnancy outcomes. The ICD diagnostic codes were mapped to Phecodes to facilitate aggregation of similar diagnoses and enable cross-comparison between ICD-9 and ICD-10 codes. Association analyses were conducted using generalized additive models (GAMs), adjusting for maternal age, race, and ethnicity. Multiple testing correction was applied using the Benjamini-Hochberg method to control the false discovery rate. For visualization, UMAP (Uniform Manifold Approximation and Projection) was employed to reduce the dimensionality of diagnosis data and reveal patterns in patient diagnostic histories. Age-stratified analyses were conducted to compare associations in younger (<35 years) and older (≥35 years) patients. A sensitivity analysis assessed the impact of healthcare utilization on the association results by including the number of patient visits as a covariate in the models. Finally, external validation was performed by replicating the analyses in the Stanford EHR database to assess the robustness of the findings.
Key Findings
The study identified several positive associations between diagnoses and RPL that were validated across both UCSF and Stanford EHR databases. Key findings include:
* **Menstrual Abnormalities:** A constellation of menstrual irregularities, including absent/infrequent, excessive/frequent, and irregular menstruation, were strongly associated with RPL, showing significantly higher odds ratios in both cohorts. These findings are noteworthy given the typically insignificant results of previous smaller studies.
* **Infertility-Associated Diagnoses:** Both a general “infertility” diagnosis and more specific diagnoses like endometriosis, PCOS, ovarian dysfunction, and ovarian failure were positively associated with RPL. This supports the longstanding, albeit debated, connection between infertility and RPL.
* **Infections:** Conditions such as vaginitis/vulvovaginitis and pelvic inflammatory disease were positively associated with RPL, suggesting a possible role of the vaginal and uterine microbiome in pregnancy loss risk.
* **Metabolic Syndrome:** A strong positive association was observed between metabolic syndrome and RPL in both cohorts.
* **Age Stratification:** Odds ratios for many RPL-associated diagnoses were higher in patients younger than 35 compared to those 35 years and older, indicating potential age-dependent effects.
* **Healthcare Utilization Sensitivity Analysis:** Controlling for healthcare utilization (number of visits) generally led to modest reductions in odds ratios, particularly in the Stanford cohort, suggesting a potential confounding effect of utilization on some associations. However, most validated associations remained significant after adjusting for utilization.
The study also identified negative associations, mainly related to diagnoses associated with later-stage pregnancies (i.e., those occurring primarily in the control group), and a few discordant results between the two centers that might be attributable to center-specific screening practices or other variations in clinical care.
Discussion
This large-scale, multi-center study provides substantial evidence supporting the association of several known and novel diagnoses with RPL. The replication of known RPL associations, such as chromosomal anomalies and endocrine conditions, validates the methodology and strengthens confidence in the newly identified associations. The strong and consistent association found between RPL and various menstrual abnormalities is particularly noteworthy. Given the ease of monitoring and screening, menstrual abnormalities may serve as easily accessible indicators of underlying risk factors, potentially leading to earlier interventions. The association with infertility-related diagnoses further underscores the complex interplay between these conditions. The finding of associations with infections and metabolic syndrome opens new avenues for research into the role of inflammation, oxidative stress, and hormonal imbalances in RPL pathogenesis. The age-stratified analysis suggests that the relative contribution of different risk factors may vary across age groups. The sensitivity analysis highlights the importance of considering healthcare utilization as a potential confounder in EHR-based studies, with the magnitude of its influence varying across datasets. The overall consistency of results between the two medical centers strengthens the generalizability of the findings, although further validation in diverse populations remains important.
Conclusion
This study represents the first large-scale, externally validated EHR analysis focusing on diagnoses preceding and near RPL onset. It identifies several known and novel potential risk factors, highlighting menstrual abnormalities, infertility-associated conditions, infections, and metabolic syndrome. The findings emphasize the multifactorial nature of RPL and provide a foundation for further research into the underlying mechanisms. Future studies should investigate the identified associations in more detail, explore potential biological pathways, and investigate the development of predictive models to identify women at high risk for RPL.
Limitations
Despite its strengths, this study has limitations. The reliance on EHR data may lead to underreporting of some pregnancy losses or misclassification of diagnoses. The study population, drawn from two academic medical centers in a specific region, may not fully represent the general population. While the study controlled for several confounders, residual confounding from unmeasured social and behavioral factors might remain. Diagnosis-specific differences in evaluation frequencies could also influence results, although the sensitivity analysis partly addressed this concern. The large number of comparisons increases the possibility of false positives, although multiple testing correction was applied. Finally, the exact counts for some diagnoses were redacted for patient de-identification. Future research incorporating more detailed data on social determinants of health and utilizing more sophisticated statistical methods could further enhance our understanding of RPL.
Related Publications
Explore these studies to deepen your understanding of the subject.