logo
ResearchBunny Logo
Introduction
Electronic health records (EHRs) are increasingly valuable for research, offering statistical power for large-scale analyses, diverse feature sets (clinical risk factors, lab results, notes, imaging), and repeated measures for disease progression modeling. However, EHR data is susceptible to biases like ascertainment bias (data acquired based on clinical need) and selection bias due to missingness. Sampling all individuals with complete data amplifies these biases, leading to spurious associations and poor generalizability. In contrast, sampling individuals with regular primary care can reduce ascertainment bias by providing longitudinal data outside illness contexts. Analyzing unstructured data (free-text notes) can also reduce missing data bias. Both strategies aim to create EHR samples resembling traditional research cohorts, improving the validity of established analysis methods. This study developed the Community Care Cohort Project (C3PO), a multi-institutional EHR-based cohort for cardiovascular disease research. Its two main goals were to mitigate ascertainment bias and minimize data missingness. A deep natural language processing (NLP) model was developed to recover vital signs from unstructured notes. The study compared C3PO's effectiveness to that of Convenience Samples (from the same EHR but without the longitudinal primary care requirement). The hypothesis was that risk scores derived from prospective cohorts would perform better in C3PO, indicating reduced bias.
Literature Review
The introduction section extensively cites prior research highlighting the power and limitations of EHR data for research. Studies demonstrating the potential of EHRs for various analyses, including epidemiological and genetic studies, are referenced. Conversely, studies focusing on the inherent biases within EHR data, particularly ascertainment and selection biases, are also reviewed, emphasizing the need for robust methods to mitigate these biases. The introduction establishes a clear need for the methods employed in the current study to improve the quality and generalizability of research conducted using EHR data.
Methodology
The Community Care Cohort Project (C3PO) was constructed using data from the Mass General Brigham (MGB) multi-institutional EHR, encompassing 3.6 million individuals with at least one ambulatory visit between 2000 and 2018. To approximate a community-based cohort, individuals were selected based on receiving regular primary care within the MGB network. Rule-based heuristics were developed and validated to identify primary care office visits using CPT codes and a manually curated list of clinic locations. Individuals with at least two primary care visits 1-3 years apart were included. The start of follow-up was defined as the second visit of the earliest qualifying pair. Ethical approval was obtained through the MGB Institutional Review Board. Cohort validation involved two methods: comparison to an existing MGH primary care practice registry (showing substantial overlap) and manual chart review of a random sample of algorithm-positive and -negative individuals (demonstrating excellent inter-rater reliability and high positive predictive value). A data ingestion pipeline (JEDI) was created to integrate diverse EHR data types into a unified, indexed file system (HDF5). Data included demographics, vital signs, notes, lab results, medications, and diagnostic reports. Individuals aged <18 or ≥90, or those with missing demographic data, were excluded, resulting in 520,868 individuals in the final C3PO cohort. Two established cardiovascular risk prediction models were used: the Pooled Cohort Equations (PCE) for myocardial infarction/stroke and the CHARGE-AF score for atrial fibrillation. Exposure data (age, sex, race, height, weight, blood pressure, smoking status, prevalent diseases) were extracted from the EHR. Outcomes were defined using validated EHR-based classification schemes. Convenience samples were created from the same EHR, including all individuals with complete data for each risk score, regardless of primary care visits. Natural Language Processing (NLP) using a fine-tuned Bio + Discharge Summary BERT model was employed to recover missing vital signs from unstructured notes. A rule-based regular expression approach was used initially to label vital signs, followed by fine-tuning of the BERT model. The model's performance was evaluated using separate evaluation and testing sets and post-processing steps were applied to harmonize units and impose physiological constraints. Ablation studies were conducted to assess the impact of individual post-processing steps. Statistical analyses involved calculating cumulative incidence using the Kaplan-Meier method, incidence rates, Cox proportional hazards models, and assessments of model discrimination (c-index) and calibration (GND test, calibration slope, ICI). Calibration was assessed both before and after recalibration to the sample baseline hazard. Analyses were performed in Python and R.
Key Findings
C3PO comprised 520,868 individuals (mean age 48, 61% women) with a median follow-up of 7.2 years. NLP increased the availability of vital signs from 54.9% to 68.8%, a 31% reduction in missingness. NLP-derived vital signs showed excellent correlation with structured data (Pearson r 0.95-0.99). In C3PO, atrial fibrillation and myocardial infarction/stroke incidence were lower, and risk models were better calibrated compared to the Convenience Samples. Specifically, calibration error for myocardial infarction/stroke ranged from 0.012-0.030 in C3PO versus 0.028-0.046 in Convenience Samples, while atrial fibrillation calibration error was 0.028 in C3PO versus 0.036 in Convenience Samples. The PCE model demonstrated strong association with incident MI/stroke (hazard ratio per 1-SD increase 2.04-2.51), moderate discrimination (c-index 0.724-0.768), and some miscalibration in both C3PO and Convenience Samples, although miscalibration was less in C3PO. CHARGE-AF showed strong association with incident AF (HR per 1-SD 2.56), moderate discrimination (c-index 0.782), and substantial underestimation of AF risk in both cohorts, but again, less so in C3PO. Cumulative risk curves showed an abrupt rise in incident diagnoses shortly after the start of follow-up in the Convenience Samples but not in C3PO.
Discussion
This study demonstrates that intentional sampling of individuals from a large, multi-institutional EHR based on longitudinal primary care, combined with NLP to recover missing data, improves the validity of EHR-based prediction. C3PO, with over half a million individuals, more closely resembles traditional epidemiological cohort studies, potentially leading to more generalizable findings. Compared to Convenience Samples, C3PO showed less bias and greater data density. NLP reduced vital sign missingness by 31%. The lower incidence rates and better model calibration in C3PO compared to the Convenience Samples suggest that C3PO's design is effective in reducing ascertainment bias. The abrupt increase in incident diagnoses in the Convenience Samples might indicate misclassification of prevalent disease as incident disease. The consistent performance of established risk models in C3PO, aligning with expectations from their original validation studies, further supports the reduced bias. While model calibration in C3PO wasn't perfect, the findings provide strong evidence that EHR samples enriched for individuals receiving longitudinal primary care can significantly reduce bias in developing disease-related models. The study acknowledges potential biases associated with selecting a primary care population (e.g., higher likelihood of insurance) and the limitations of EHR data in general (residual indication bias). The authors also note that the C3PO sampling design is adaptable to various epidemiological analyses.
Conclusion
The study successfully developed C3PO, an EHR-based resource with over half a million individuals, leveraging longitudinal primary care data and NLP to reduce bias. This approach significantly improves the quality and generalizability of research using EHRs. The JEDI pipeline, used for data processing, is publicly available, facilitating future research efforts. While limitations remain, the results demonstrate the potential of well-designed EHR cohorts to significantly advance clinical modeling and discovery.
Limitations
The study acknowledges several limitations. Despite efforts to reduce bias by selecting individuals with regular primary care, residual indication bias is unavoidable in EHR data. Although NLP reduced vital sign missingness, considerable missingness remains for other features. The NLP model's performance in other datasets is unknown. Misclassification of exposures and outcomes is possible, and the C3PO selection process might not easily extend to other EHRs. The predominantly White population limits generalizability to diverse racial groups. The study's observational nature prevents causal inferences.
Listen, Learn & Level Up
Over 10,000 hours of research content in 25+ fields, available in 12+ languages.
No more digging through PDFs—just hit play and absorb the world's latest research in your language, on your time.
listen to research audio papers with researchbunny