Medicine and Health
Cohort design and natural language processing to reduce bias in electronic health records research
S. Khurshid, C. Reeder, et al.
Discover how electronic health record (EHR) datasets can overcome biases and missing data through innovative sampling and natural language processing methods. This groundbreaking research conducted by esteemed authors reveals a significant reduction in missing vital signs and improved risk model calibration, enhancing the generalizability of EHR research.
~3 min • Beginner • English
Introduction
Electronic health record (EHR) databases enable large-scale discovery and prediction by offering rich, longitudinal clinical data across diverse modalities. However, EHR data are prone to biases, notably ascertainment bias from clinically driven data capture and selection bias related to missingness, which can impair generalizability. The authors propose two design strategies to reduce bias: (1) constructing a cohort of individuals receiving longitudinal primary care to approximate a prospective community-based cohort, and (2) using natural language processing (NLP) to recover missing data from unstructured notes. They developed the Community Care Cohort Project (C3PO) from the Mass General Brigham EHR and evaluated its validity by deploying established cardiovascular risk models (Pooled Cohort Equations for MI/stroke and CHARGE-AF for atrial fibrillation), comparing performance against convenience samples constructed without a longitudinal primary care requirement. The hypothesis was that risk scores derived in traditional cohorts would perform more favorably and be better calibrated in C3PO, indicating reduced bias.
Literature Review
Prior work highlights the strengths of EHRs for epidemiology, genomics, and longitudinal modeling, but also emphasizes vulnerabilities to selection, ascertainment, and measurement biases. Convenience sampling, often used to maximize power, can exacerbate these biases and missingness, leading to spurious associations and poor external validity. Established cardiovascular risk scores like the Pooled Cohort Equations (PCE) and CHARGE-AF have shown consistent performance in community cohorts, making them suitable benchmarks to detect bias when applied in EHR samples. Advances in NLP, particularly domain-adapted transformer models such as BioBERT and clinical BERT variants, have demonstrated strong performance on clinical information extraction, suggesting potential to reduce missingness by extracting structured variables from free text.
Methodology
Cohort construction: Using the Mass General Brigham (MGB) EHR (3.6M individuals with ≥1 ambulatory visit, 2000–2018), the authors identified primary care visits via CPT codes and a curated list of 431 primary care clinic locations. Inclusion required ≥2 primary care visits 1–3 years apart; follow-up began at the second qualifying visit to allow baseline ascertainment and reduce prevalent-incident misclassification. Individuals aged <18 or ≥90 at baseline and those with missing demographics were excluded, yielding 520,868 participants (C3PO). Validation included overlap with an MGH primary care registry (up to 93.3% overlap without temporal/age criteria) and blinded manual chart review of algorithm-positive/negative cases with excellent inter-rater agreement (kappa 0.78–1) and PPV ≥85%.
Data infrastructure: The JEDI Extractive Data Infrastructure (HDF5-based) integrated demographics, vitals, labs, medications, diagnostic tests, and free-text notes. Egress pipelines produced long- and wide-format analytic files. JEDI is publicly available.
NLP for vital sign recovery: To reduce missingness in baseline vital signs (height, weight, systolic and diastolic blood pressure), the team fine-tuned a Bio + Discharge Summary BERT model using weak labels from a regex-based pipeline. Training data comprised 34,310 notes from 900 individuals (116,644 labeled vital instances). Evaluation and test sets were built from separate 50-patient samples. The model was trained for 5 epochs (categorical cross-entropy), compared against original BERT and a regex baseline. Inference was run on 9,522,262 notes from 401,826 patients with notes in the 3 years pre-baseline. Post-processing included token extension, unit harmonization (kg, cm, mmHg), physiologic plausibility constraints (height 91–305 cm; weight 20–450 kg; SBP 50–300 mmHg; DBP 20–200 mmHg), and filtering of “optimal weight” phrases. Accuracy/yield ablation and manual validation (200 values) supported high accuracy.
Risk models and outcomes: Two validated scores were implemented: Pooled Cohort Equations (PCE) for 10-year MI/stroke risk and CHARGE-AF for 5-year AF risk. Baseline exposures (age, sex, race, height, weight, BP, smoking, lipids, comorbidities) were derived from structured EHR, with NLP values used when tabular values were missing (within 3 years of baseline; height from any time). Outcomes: AF was defined using a validated EHR algorithm (PPV 92%); MI and stroke required ≥2 ICD codes from validated sets (PPV ≥85%). Analyses were restricted to published age ranges (PCE 40–79; CHARGE-AF 46–90) and disease-free at baseline.
Comparator cohorts: Convenience Samples were built from the same source EHR by including all individuals with complete score components (no primary care requirement). Baseline was the earliest time all components were present within a 3-year window. Individuals without follow-up or with prevalent outcome at baseline were excluded. Follow-up ended at last encounter.
Statistical analysis: For longitudinal analyses, person-time ended at event, death, last encounter, age 90, or August 31, 2019. Performance metrics included HR per 1-SD increase (Cox models), discrimination (IPC-weighted c-index), and calibration via GND test, calibration slope, and Integrated Calibration Index (ICI). Recalibration to the cohort baseline hazard was performed. Kaplan–Meier cumulative risk and stratified curves by predicted risk were produced. Bootstrap CIs (500–1000 iterations) were used for ICI and comparisons. Analyses used Python 3.8 and R 4.0.
Key Findings
Cohort: C3PO included 520,868 individuals (mean age 48 years; 61% women), median follow-up 7.2 years (Q1 2.6, Q3 12.9), with a median 30 office visits and 13 primary care visits. Convenience Samples had shorter follow-up and fewer visits.
NLP recovery: At baseline, 286,009 individuals (54.9%) had all four vitals (height, weight, SBP, DBP) from tabular data; after NLP, this increased to 358,411 (68.8%), a 31% reduction in missingness. NLP-extracted vitals agreed closely with tabular values obtained the same day: Pearson r for height 0.99, weight 0.97, SBP 0.95, DBP 0.95 (all p<0.01). Bland–Altman analyses showed good agreement (e.g., height limits of agreement −2.97 to 2.99 cm; weight −8.64 to 9.29 kg; SBP −9.85 to 9.67 mmHg; DBP −6.29 to 6.24 mmHg) without systematic bias. Compared to regex, the NLP model yielded more extractions with high accuracy.
MI/stroke (PCE) in C3PO: N=198,184; 49,289 (24.9%) would have been excluded without NLP recovery. Ten-year cumulative MI/stroke risk 8.0% (95% CI 7.8–8.1), incidence 8.4/1000 person-years (95% CI 8.2–8.5). PCE HR per 1-SD ranged 2.04–2.51 across sex/race strata; c-index 0.724–0.768. Calibration showed ICI 0.012–0.030; GND χ² 21–487; calibration slopes 0.60–0.88 (best calibration in Black men: ICI 0.012, slope 0.88). Recalibration did not consistently improve fit.
MI/stroke in Convenience Sample: N=340,226; paradoxically higher 10-year risk 10.6% (95% CI 10.5–10.7) and incidence 11.7/1000 person-years (95% CI 11.5–11.8) despite lower baseline comorbidity. Early sharp rise in incidence shortly after baseline suggested prevalent disease misclassified as incident. Discrimination similar (c-index 0.727–0.770), but calibration worse: ICI 0.028–0.046; GND χ² 36–1797; slopes 0.56–0.87. Recalibration did not fully correct miscalibration (e.g., recalibrated ICI up to 0.047).
AF (CHARGE-AF) in C3PO: N=174,644; 38,528 (22.1%) would have been excluded without NLP. Five-year AF events 7,877; cumulative risk 5.8% (95% CI 5.7–6.0), incidence 12.1/1000 person-years (95% CI 11.8–12.3). HR per 1-SD 2.56 (95% CI 2.50–2.61); c-index 0.782 (95% CI 0.777–0.787). Original CHARGE-AF underestimated risk (ICI 0.028; GND χ² 1,856); recalibration improved calibration (ICI 0.019; slope 0.77).
AF in Convenience Sample: N=501,272; higher 5-year risk 6.9% (95% CI 6.9–7.0) and incidence 15.1/1000 person-years (95% CI 14.9–15.3). Early incidence spike post-baseline again observed. Discrimination similar (c-index 0.781), but calibration worse (ICI 0.036; slope 0.69); remained less favorable after recalibration (ICI 0.028).
Overall: Selecting individuals with longitudinal primary care and using NLP to recover missing vitals reduced missingness, mitigated apparent ascertainment bias, reduced early incidence spikes, and yielded better-calibrated risk prediction than convenience sampling from the same EHR. NLP primarily increased effective sample size and precision without materially changing point estimates of model performance.
Discussion
By emulating a community-based cohort through selection of patients receiving longitudinal primary care and defining baseline at the second qualifying visit, C3PO reduced ascertainment bias and the misclassification of prevalent disease as incident events. Compared with convenience samples, C3PO exhibited lower early incidence spikes and better calibration of established risk models (PCE, CHARGE-AF) while maintaining similar discrimination, aligning more closely with performance seen in traditional cohorts. Deep-learning NLP substantially reduced vital sign missingness and improved statistical power with high agreement to structured data, illustrating that recovering actual values from unstructured notes is a practical strategy to address non-random missingness in EHR research. The JEDI pipeline facilitates scalable, modular integration of diverse EHR data, enabling future statistical and machine learning models with improved generalizability. Remaining miscalibration even within C3PO suggests room for further optimization via advanced recalibration, reweighting, or methods explicitly addressing residual biases and cohort-specific baseline hazards.
Conclusion
The study introduces C3PO, a half-million–person EHR cohort constructed to reduce bias by sampling individuals with longitudinal primary care and by recovering missing data through deep-learning NLP. Compared with convenience sampling, C3PO showed more plausible event trajectories and better-calibrated risk predictions for MI/stroke and AF, while NLP reduced vital sign missingness by about one-third with high accuracy. The publicly available JEDI pipeline supports scalable, harmonized data processing to enable broad discovery using heterogeneous EHR data. Future work should compare cohort design–based bias mitigation with alternative methods (e.g., inverse probability weighting), extend NLP extraction to other clinical variables (labs, imaging-derived measures), refine recalibration strategies, and evaluate generalizability across more diverse populations and external health systems.
Limitations
Key limitations include: (1) residual indication and selection biases inherent to EHR data despite primary care–based sampling; (2) focus on clinical risk scores to assess bias rather than broader frameworks (e.g., phenome-wide or genetic associations); (3) persistent missingness for other features (e.g., cholesterol), with extraction of some data types (labs, imaging) being more complex; (4) uncertain performance and transportability of the NLP model across external datasets; (5) potential misclassification of exposures and outcomes despite validated algorithms; (6) reliance on EHR codes and a curated list of in-network primary care locations, which may not generalize to other systems; (7) predominantly White cohort limiting generalizability to different racial/ethnic compositions; and (8) observational design precluding causal inference.
Related Publications
Explore these studies to deepen your understanding of the subject.

