Medicine and Health

Med-BERT: pretrained contextualized embeddings on large-scale structured electronic health records for disease prediction

L. Rasmy, Y. Xiang, et al.

Discover how Med-BERT, an innovative contextualized embedding model tailored for electronic health records, revolutionizes disease prediction accuracy. Developed by Laila Rasmy, Yang Xiang, Ziqian Xie, Cui Tao, and Degui Zhi, this research showcases significant performance enhancements across clinical databases, paving the way for cost-efficient AI in healthcare.

00:00

~3 min • Beginner • English

Index

Introduction

The study addresses the challenge that deep learning models for disease prediction from EHRs require large labeled datasets, which are often unavailable due to rarity of conditions, data quality issues, and governance barriers. The research question is whether a BERT-style, self-supervised pretraining on massive structured EHR data can learn contextualized representations that transfer to downstream clinical prediction tasks, especially when fine-tuning data are limited. The authors propose Med-BERT, adapting BERT’s architecture and pretraining strategy to the structured, hierarchical, and temporally sparse modality of EHRs, and evaluate its ability to improve disease prediction accuracy and generalizability across institutions.

Literature Review

Prior transfer learning in EHR has included reusing internal layers of trained deep models (e.g., RNNs) with mixed generalizability, and static embedding approaches (word2vec, GloVe) that lack deep contextualization. Contextualized models from NLP (ULMFiT, ELMo, BERT, XLNet) have transformed language tasks and begun appearing in clinical domains. Two closest EHR works are BEHRT and G-BERT: BEHRT used masked LM pretraining on CPRD primary care data with a small vocabulary and evaluated diagnosis prediction; G-BERT integrated ontologies via a graph neural network and modified masked LM on MIMIC-III ICU data. Limitations of these works include smaller vocabularies and cohorts, different or non-standard evaluation metrics, and limited cross-dataset validation. Med-BERT differs by using a much larger, widely adopted ICD-9/10 vocabulary (~82k codes), a much larger, multi-institutional pretraining cohort, domain-specific input embeddings, dual pretraining tasks, and evaluation on clinically stringent phenotypes with cross-dataset testing.

Methodology

Data sources: Pretraining used Cerner Health Facts (v2017), a de-identified multi-institutional EHR database (>600 sites, ~68M patients overall). The Med-BERT pretraining cohort comprised 28,490,650 patients. Downstream evaluations used two tasks: (1) diabetes-to-heart failure (DHF) in Cerner (DHF-Cerner), and (2) pancreatic cancer (PaCa) in Cerner (PaCa-Cerner) and Truven MarketScan (PaCa-Truven, v2015 claims). Cohort characteristics included average age ~41 (pretraining) and 61–65 (evaluation), average visits per patient 7–19, and ICD-10 presence varying by dataset. EHR representation: Each patient is a sequence of visits; each visit is a list of clinical diagnosis codes (ICD-9/10). Codes within a visit were ordered using priority information (e.g., principal vs secondary diagnoses). Patients with fewer than three diagnosis codes or inconsistent records were excluded. The vocabulary was the union of ICD-9 and ICD-10 diagnosis codes (~82,603 tokens in pretraining). Model architecture and inputs: Med-BERT adopts a transformer architecture as in BERT with multi-head self-attention and stacks of transformer layers. Input embeddings comprise: (1) code embeddings (diagnosis token), (2) serialization embeddings (relative priority order within a visit), and (3) visit embeddings (position of each visit within the patient sequence). Unlike NLP BERT, [CLS]/[SEP] tokens were not used. For sequence-level predictions, outputs from codes within visits are aggregated and passed through a feed-forward layer to form sequence representations. Pretraining tasks: Two self-supervised tasks were used: (a) Masked Language Modeling (masked LM) over diagnosis codes (predict masked codes from bidirectional context; codes masked with standard strategies, e.g., majority replaced by [MASK]); and (b) prolonged length-of-stay (LOS) prediction (binary classification of whether the patient had any hospitalization with LOS > 7 days, defined at the patient level). These tasks capture code-level and sequence-level contextual dependencies. Training details: Maximum sequence length 512 tokens; masked one diagnosis code per patient per input sequence in masked LM; optimizer AdamW with a 5e-5 learning rate; dropout 0.1; implemented with TensorFlow BERT code; trained on Nvidia Tesla V100 (32 GB) for ~3 weeks, ~45 million steps; pretraining split approximately 70/20/10 for train/validation/test. Fine-tuning and evaluation: For each downstream task, a prediction head (e.g., simple feed-forward layer or RNN such as GRU/Bi-GRU/RETAIN rolling over token embeddings) is attached on top of the pretrained Med-BERT. Both Med-BERT and head parameters are fine-tuned via gradient descent. Cohorts are randomly split 70/10/20 into train/validation/test, ensuring test patients are not in the pretraining set. Primary metric is AUC. Baselines include: GRU, Bi-GRU, RETAIN; static word2vec-style embeddings (t-W2V) with base models; logistic regression (LR) and random forest (RF); and an untrained Med-BERT (randomly initialized) with identical architecture to assess the effect of pretraining.

Key Findings

- Pretraining scale and coverage: 28,490,650 patients; ICD-9/10 vocabulary ~82,603 codes; average 8 visits per patient in pretraining. - Across tasks, Med-BERT boosts AUC over base models by 1.21–6.14%. • DHF-Cerner: GRU 83.93 to GRU+Med-BERT 85.14 (+1.21); Bi-GRU 82.82 to 85.39 (+2.57). • PaCa-Cerner: GRU 78.26 to 82.13 (+3.87); Bi-GRU 76.09 to 82.23 (+6.14). • PaCa-Truven: GRU 78.17 to 83.07 (+4.90); Bi-GRU 76.79 to 80.57 (+3.78). - Med-BERT-only (FFL head) achieved competitive AUCs: 85.18 (DHF-Cerner), 81.67 (PaCa-Cerner), 79.98 (PaCa-Truven). - Untrained Med-BERT underperforms pretrained: e.g., DHF-Cerner 82.76, PaCa-Cerner 15.16, PaCa-Truven 75.98, highlighting the benefit of pretraining. - Compared with static embeddings (t-W2V), Med-BERT consistently yields larger or comparable gains when added to the same base models. - Small-sample efficiency: With limited fine-tuning data, Med-BERT can achieve AUC comparable to or exceeding models trained on datasets up to 10× larger (e.g., Bi-GRU+Med-BERT on 5,000 samples ≈ Bi-GRU-only on >50,000 samples), and in some settings boosted AUC by >20% at very small sample sizes. - Cross-dataset generalization: Significant gains on PaCa-Truven (claims data) indicate transferability beyond the Cerner pretraining source. - Interpretability: Attention visualizations show medically meaningful, context- and visit-dependent code relationships across layers, demonstrating contextualized representations.

Discussion

Med-BERT effectively transfers knowledge from massive, unlabeled EHR data to downstream disease prediction, improving accuracy and data efficiency. The model is particularly beneficial in low-data regimes (hundreds to a few thousand samples), enabling competitive performance where conventional deep models struggle. By using ICD-9/10 codes and multi-institutional pretraining, Med-BERT generalizes to external datasets (e.g., Truven), supporting deployment across institutions. Attention visualizations reveal layer-wise syntactic and semantic patterns, enabling interpretability and offering insights into temporal and contextual dependencies across visits. While LR can be competitive at very small sample sizes (<500), Med-BERT’s advantages grow with more data, aligning with clinical needs where moderate sample sizes are attainable. Overall, Med-BERT demonstrates that contextualized embeddings tailored to EHR structure can serve as a general-purpose foundation for diverse clinical prediction tasks.

Conclusion

The study presents Med-BERT, a BERT-style contextualized embedding model tailored to structured EHRs and pretrained on ~28.5M patients. Domain-specific input embeddings and dual pretraining objectives (masked LM and prolonged LOS) enable robust contextual representation learning. Extensive evaluations on diabetes-to-heart failure and pancreatic cancer prediction across Cerner and Truven datasets show consistent AUC improvements, strong performance with limited training data, cross-dataset generalizability, and interpretable attention patterns. Med-BERT can reduce data collection and labeling burdens and accelerate clinical AI development. Future work includes incorporating additional modalities (medications, procedures, labs, time intervals) and exploring alternative contextualized pretraining approaches and task-specific interpretability tools.

Limitations

- Inputs limited to diagnosis codes (ICD-9/10); other modalities (medications, procedures, labs, vitals) were not included in this version. - Time intervals between visits were not modeled explicitly, potentially missing temporal dynamics. - For very small training sizes (<500), traditional models (e.g., logistic regression) can outperform deep models even with Med-BERT, limiting applicability in ultra–low-data settings. - Some implementation details (e.g., exact masking scheme proportions) are inherited from BERT but not fully elaborated; additional design choices may impact performance. - Evaluation focused on two prediction tasks; broader task diversity and prospective validation remain future work.

Related Publications

Explore these studies to deepen your understanding of the subject.

Medicine and Health

TransformEHR: transformer-based encoder-decoder generative model to enhance prediction of disease outcomes using electronic health records

Z. Yang, A. Mitra, et al.

Medicine and Health

Natural language processing of multi-hospital electronic health records for public health surveillance of suicidality

R. Bey, A. Cohen, et al.

Transportation

Calibrated confidence learning for large-scale real-time crash and severity prediction

M. R. Islam, D. Wang, et al.

Engineering and Technology

A tailored, electronic textile conformable suit for large-scale spatiotemporal physiological sensing *in vivo*

I. Wicaksono, C. I. Tucker, et al.

Listen, Learn & Level Up

Over 10,000 hours of research content in 25+ fields, available in 12+ languages.

No more digging through PDFs, just hit play and absorb the world's latest research in your language, on your time.

listen to research audio papers with researchbunny