Medicine and Health

TransformEHR: transformer-based encoder-decoder generative model to enhance prediction of disease outcomes using electronic health records

Z. Yang, A. Mitra, et al.

TransformEHR is a cutting-edge model developed by authors Zhichao Yang, Avijit Mitra, Weisong Liu, Dan Berlowitz, and Hong Yu that predicts disease outcomes from electronic health records. With impressive results in forecasting pancreatic cancer and self-harm in PTSD patients, it promises a leap forward in clinical interventions.

00:00

~3 min • Beginner • English

Index

Introduction

The study addresses the challenge of predicting future diseases and outcomes from longitudinal EHR data, where patients often present with multiple co-occurring and correlated conditions. Prior EHR transformer models (e.g., MedBERT, BEHRT, BERT variants) commonly use masked code prediction objectives that predict only a fraction of ICD codes within a visit, potentially missing interdependencies among multi-morbid conditions. The research question is whether a generative encoder-decoder transformer pretrained to predict the complete set of ICD codes for the next visit, conditioned on longitudinal history, can better capture disease co-occurrence structures and improve downstream clinical predictions, including difficult tasks such as pancreatic cancer onset and intentional self-harm risk among PTSD patients. The purpose is to develop and evaluate TransformEHR with a novel pretraining objective and architecture, and to assess its generalizability across internal and external datasets. The importance lies in enabling accurate multi-disease predictions, improving early detection (e.g., pancreatic cancer), informing interventions for rare outcomes (e.g., self-harm), and facilitating transfer learning for institutions with limited data.

Literature Review

Early EHR prediction efforts used traditional ML models (logistic regression, SVMs, random forests, gradient boosting) for single outcomes (e.g., CHF, sepsis mortality, ventilation, type 2 diabetes, PTSD onset). With larger cohorts and compute, deep learning models, especially transformer-based approaches, have become state-of-the-art. BERT-based EHR models (e.g., MedBERT, BEHRT, BERTIm) rely on pretrain-finetune paradigms with masked language modeling adapted to EHRs (randomly masking a fraction of codes). However, such objectives may not reflect clinical reality where multiple, correlated conditions co-occur within a visit, and bidirectional encoder-only models may not align with the temporal past-to-future prediction use case. Generative sequence-to-sequence frameworks (e.g., transformer encoder-decoder with cross-attention) have shown advantages in sequence modeling tasks, motivating their adaptation to EHRs to predict complete future visit code sets and better model comorbidity structures.

Methodology

Data and cohorts: Pretraining used 6,475,218 patients from the US Veterans Health Administration (VHA) across >200 facilities, from 2016-01-01 to 2019-12-31. For evaluation, two disease/outcome agnostic prediction (DOAP) datasets were constructed: (1) common diseases/outcomes (10 ICD-10 codes with prevalence ≥2%); (2) uncommon diseases/outcomes (10 ICD-10 codes with prevalence 0.4%–0.5%, per FDA orphan guidelines). External generalizability was assessed on a non-VHA ICU cohort (MIMIC-III subset with ICD-10 coding), totaling 29,247 patients. Additional downstream tasks: pancreatic cancer onset and intentional self-harm among PTSD patients. Input representation: Longitudinal EHR sequences were organized at the visit level. Predictors included demographics (gender, age, race, marital status) and ICD-10-CM diagnostic codes grouped by visit. Within a visit, ICD codes were ordered by provider-assigned priority (primary, secondary, etc.). Multi-level trainable embeddings were used: code embeddings for individual ICD codes, visit embeddings for visit identity, and time embeddings. Time was represented via positional embeddings applied either to absolute visit dates or to relative day differences from the last visit. The final visit token representation was the sum of code, visit, and time embeddings. Pretraining objective (visit masking): Unlike prior EHR-BERT approaches that randomly mask a fraction of codes within visits, TransformEHR masks all codes in a single future visit and trains to generate the complete set of ICD codes for that visit, conditioned on prior visits (past-to-future). This objective encourages modeling of inter-code dependencies within a visit and longitudinal dependencies across visits. Architecture: TransformEHR is a transformer encoder-decoder model. The encoder is bidirectional (BERT-style) over the longitudinal EHR context. The decoder is unidirectional (left-to-right), generating the codes of the masked future visit sequentially in order of code priority (primary first, then secondary, etc.). Cross-attention allows the decoder to attend to encoder representations (relevant past visits/codes), and decoder self-attention allows conditioning on already generated codes to predict subsequent codes within the same visit. Pancreatic cancer cohort: ICD-10-CM C25 (3-digit) defined pancreatic cancer. Cases included 4,639 patients aged ≥45 with first pancreatic cancer diagnosis 12–36 months after their last visit and without prior cancer diagnoses. Controls included 3,089 patients aged ≥45 without any cancer diagnosis. Intentional self-harm among PTSD cohort: From the 5% split not used in pretraining, patients with PTSD were identified; first-time intentional self-harm cases were ascertained from VHA sources (CDW and SPAN) using ICD-10-CM codes mapped from validated ICD-9 definitions. The observed self-harm rate among PTSD patients was 1.9%. All inpatient and outpatient ICD codes were included. Training details: Maximum sequence length 512 ICD codes; during pretraining, 15% of visits were randomly dropped from inputs to prevent memorization. Hyperparameters included warmup ratio 0.1, learning rate 1e-3, dropout 0.1, weight decay 1e-3. Pretraining used 4 Nvidia Tesla P40 GPUs, approximately 6 days of training, ~280k steps, batch size ~48. Baseline BERT models were pretrained on the same cohort for fairness. Finetuning appended a task-specific linear classification head for binary predictions (disease/outcome presence). TransformEHR used six layers in both encoder and decoder to match parameter counts with baselines. Models were finetuned with five random seeds; train/validation/test split was 70/10/20; one-sided Student’s t-tests assessed significance. Each finetuning task completed within ~12 hours on a Tesla P40. Ablations: Components assessed included (1) visit masking vs. code masking; (2) encoder-decoder vs. encoder-only; (3) time embedding strategies. Visit masking outperformed code masking (improvement 95% CI: 2.52%–2.96%).

Key Findings

- Pretraining objective and architecture: TransformEHR’s encoder-decoder with cross-attention and visit-masked generative pretraining improved EHR prediction across tasks. - DOAP (future visit ICD prediction): TransformEHR improved AUROC over BERT for both categories: +3.96% on 10 common diseases/outcomes and +5.92% on 10 uncommon diseases/outcomes, including both new and recurrent occurrences. - Pancreatic cancer onset: TransformEHR improved area under the precision-recall curve by ~2% versus the prior model (p < 0.001); AUROC improved from 76.89 to 78.64 (as reported in the discussion). - Intentional self-harm among PTSD patients: TransformEHR improved AUROC by 24% versus BERT (13.34 to 16.67; 95% CI: 15.11–18.23; p = 0.007) and outperformed LSTM and logistic regression. PPV reached 8.8% among the top 10% highest-risk PTSD patients, exceeding cost-effectiveness thresholds for clinical screening; PPV ranged approximately 3.14%–8.50% across 10%–60% thresholds. - Subgroup consistency: AUROC was consistent across genders, ages, races, and marital statuses; no significant differences across age strata. - Impact of history length: Attention-based models (TransformEHR, BERT) benefited from longer longitudinal histories (more than five prior visits), with TransformEHR AUROC improving by about 19% compared to using only short histories. - Generalizability: Internally, performance on unseen VHA facilities was statistically similar to seen facilities. Externally (non-VHA ICU cohort), pretraining improved TransformEHR AUROC by 2.3% (95% CI: 0.8%–3.6%, p = 0.005), showing stronger transfer than BERT’s pretraining gains. - Ablation: Visit masking yielded significant gains over code masking (95% CI: 2.52%–2.96% improvement).

Discussion

The results support the hypothesis that a generative encoder-decoder transformer pretrained to predict complete next-visit ICD code sets better captures comorbidity structures and temporal dependencies than encoder-only masked-code models. Cross-attention enables selective focus on clinically relevant past visits/codes for each predicted code, and decoder self-attention leverages already predicted primary diagnoses to inform secondary/rarer codes, improving uncommon code prediction. The model demonstrated substantial gains on both common and uncommon conditions, and on difficult tasks such as early pancreatic cancer onset and intentional self-harm risk among PTSD patients, where higher PPV indicates practical utility for targeted clinical interventions. Attention-based architectures benefited from longer longitudinal histories, aligning with the need to incorporate rich temporal context in EHRs. The strong internal and external generalizability indicates robust transfer learning, which is valuable for institutions with limited data. Overall, TransformEHR advances multi-disease and outcome prediction by aligning the pretraining objective with the clinical use case of past-to-future prediction and by modeling intra-visit code dependencies.

Conclusion

TransformEHR introduces a novel visit-masked, encoder-decoder generative transformer for EHR prediction that learns to generate complete next-visit ICD code sets from longitudinal histories. It establishes new state-of-the-art performance across multiple tasks, notably improving prediction for uncommon conditions and challenging outcomes such as pancreatic cancer onset and intentional self-harm among PTSD patients, while demonstrating strong internal and external generalizability. The approach enables multi-disease predictions from a single pretrained model and facilitates effective finetuning with limited data, offering practical benefits for clinical screening and intervention systems. Future work will incorporate additional EHR modalities (procedures, medications, labs, clinical notes), expand to more diseases and outcomes, refine outcome definitions (e.g., self-harm coding nuances), and address computational challenges associated with larger vocabularies and embeddings.

Limitations

- Input modalities: Pretraining and prediction used diagnostic ICD codes and demographics; omission of procedures, medications, labs, and unstructured note-derived phenotypes may limit performance. Integrating these would enlarge the vocabulary and embedding matrices, increasing computational cost in memory and time. - Task scope: Single-disease and single-outcome evaluations were limited (pancreatic cancer onset; intentional self-harm among PTSD). Broader coverage is needed. - Outcome definition: Intentional self-harm identification relied on ICD-10-CM codes mapped from ICD-9; while validated, some events may be missed or misclassified. Contextual factors around PTSD and self-harm could improve accuracy. - Computational constraints: Scaling to larger vocabularies and multi-modal inputs requires increased GPU resources. - Some reporting artifacts in tables suggest potential inconsistencies; rigorous standardization of metrics and thresholds across experiments would improve interpretability.

Related Publications

Explore these studies to deepen your understanding of the subject.

Medicine and Health

Real-time prediction of COVID-19 related mortality using electronic health records

P. Schwab, A. Mehjoo, et al.

Medicine and Health

Med-BERT: pretrained contextualized embeddings on large-scale structured electronic health records for disease prediction

L. Rasmy, Y. Xiang, et al.

Education

Developing a BOPPPS (Bridge-in, Objectives, Pre-assessment, Participatory Learning, Post-assessment and Summary) model combined with the OBE (Outcome Based Education) concept to improve the teaching outcomes of higher education

Z. Xu, L. Ge, et al.

Computer Science

Reliability of Supervised Machine Learning Using Synthetic Data in Health Care: Model to Preserve Privacy for Data Sharing

D. Rankin, M. Black, et al.

Listen, Learn & Level Up

Over 10,000 hours of research content in 25+ fields, available in 12+ languages.

No more digging through PDFs, just hit play and absorb the world's latest research in your language, on your time.

listen to research audio papers with researchbunny