logo
ResearchBunny Logo
TransformEHR: transformer-based encoder-decoder generative model to enhance prediction of disease outcomes using electronic health records

Medicine and Health

TransformEHR: transformer-based encoder-decoder generative model to enhance prediction of disease outcomes using electronic health records

Z. Yang, A. Mitra, et al.

TransformEHR is a cutting-edge model developed by authors Zhichao Yang, Avijit Mitra, Weisong Liu, Dan Berlowitz, and Hong Yu that predicts disease outcomes from electronic health records. With impressive results in forecasting pancreatic cancer and self-harm in PTSD patients, it promises a leap forward in clinical interventions.

00:00
Playback language: English
Introduction
The widespread adoption of electronic health records (EHRs) has spurred the development of data mining techniques for predicting clinical diseases and outcomes. Early work used regression and traditional machine learning (ML) models, but deep learning models, particularly transformer-based models, have shown superior performance. These deep learning models often employ a pretraining-finetuning paradigm: pretraining on large datasets to learn better representations of clinical features (like ICD codes), followed by finetuning on specific tasks. Existing models like MedBERT, BEHRT, and BERTIM have limitations in their pretraining objectives, typically predicting only a fraction of ICD codes per visit. This study addresses this limitation by proposing a new pretraining strategy that predicts the complete set of disease codes and outcomes at each visit, leveraging the complex interrelations among different diseases and outcomes. This new approach is crucial because many diseases are correlated (e.g., obesity, diabetes, hypertension), and their collective influence on disease trajectories is significant. The study aims to improve clinical predictive modeling by using a novel pretraining objective and exploring a generative encoder-decoder framework, addressing the limitations of existing encoder-only models.
Literature Review
Prior research has successfully used longitudinal EHRs to predict various clinical outcomes using methods ranging from regression analysis and traditional machine learning (ML) techniques like support vector machines, random forests, and gradient boosting to predict single diseases or outcomes such as congestive heart failure, septic mortality, mechanical ventilation, type 2 diabetes, and PTSD. With the advent of large datasets and computational resources, deep learning models, particularly transformer-based models, have emerged as state-of-the-art (SOTA) approaches. These often utilize a pretrain-finetune paradigm, where models are first trained on large EHR datasets and then further trained on specific tasks. However, existing SOTA models like MedBERT, BEHRT, and BERTIM, have limitations in their pretraining objectives, often focusing on predicting a subset of ICD codes within each visit, neglecting the correlations between multiple diseases or outcomes that a patient might experience concurrently. This study aims to advance the state-of-the-art by addressing these limitations.
Methodology
TransformEHR utilizes a transformer-based encoder-decoder architecture. The encoder processes longitudinal EHR data, including demographic information and ICD-10-CM codes, representing each visit with visit embedding, each ICD code with code embedding, and time with positional embedding or relative time embedding. The encoder uses a bidirectional approach (left-to-right and right-to-left), capturing both past and future contexts. The decoder generates the complete set of ICD codes for a future visit given the encoder's representation and previous visits' information. This is achieved using a novel pretraining objective: predicting all ICD codes of a future visit, unlike previous models which focused on a fraction of codes. The model employs cross-attention to identify relevant ICD codes from previous visits, generating codes sequentially based on their priority (primary diagnosis first). A visit embedding is incorporated to integrate provider information. The model is pretrained on a large cohort (6.5 million patients) of longitudinal EHR data from the Veterans Health Administration (VHA), followed by finetuning on specific tasks. The study evaluates TransformEHR on disease/outcome agnostic prediction (DOAP) datasets for common and uncommon diseases, including pancreatic cancer and intentional self-harm among PTSD patients. Generalizability is assessed using internal (unseen VHA data) and external (MIMIC-III dataset) datasets. Ablation studies are conducted to evaluate the individual contributions of the visit masking, encoder-decoder architecture, and time embedding. Comparisons are made with baseline models such as logistic regression, LSTM, and BERT, focusing on AUROC, AUPRC, sensitivity, and positive predictive value (PPV).
Key Findings
TransformEHR demonstrates superior performance compared to existing methods. In the disease/outcome agnostic prediction (DOAP) task, it shows a 3.96% relative increase in AUROC for common diseases/outcomes and a 5.92% increase for uncommon ones compared to BERT. For specific disease predictions, TransformEHR significantly improved AUROC for pancreatic cancer and intentional self-harm among PTSD patients. Specifically, for intentional self-harm in PTSD patients, TransformEHR achieved an AUROC of 16.67 (p=0.007), a 24% improvement over BERT. The model achieved a PPV of 8.8% for intentional self-harm prediction among the top 10% of high-risk PTSD patients, exceeding the threshold for cost-effectiveness in suicide prevention interventions. Ablation studies confirmed the positive impact of visit masking, encoder-decoder architecture, and time embedding. Generalizability evaluations using internal (unseen VHA data) and external (MIMIC-III) datasets demonstrate TransformEHR's strong transfer learning capabilities.
Discussion
TransformEHR's superior performance is attributed to its novel pretraining objective, which captures the complex interrelationships between different diseases and outcomes, and its encoder-decoder architecture, which allows for effective use of past and future context. The use of cross-attention and decoder self-attention enables the model to focus on relevant past information and generate a complete set of ICD codes for a future visit. The high PPV for intentional self-harm prediction highlights TransformEHR's potential for developing effective clinical screening and intervention tools. The model's generalizability suggests its applicability across diverse healthcare settings with varying data resources. The results underscore the importance of pretraining in improving the quality of latent representations and the effectiveness of using longitudinal EHR data for accurate prediction of complex clinical events. The improvements seen in predicting uncommon diseases highlight the model's ability to handle imbalanced datasets, a significant challenge in clinical prediction.
Conclusion
TransformEHR offers a significant advancement in clinical outcome prediction using EHR data. Its superior performance, generalizability, and potential for practical application in suicide prevention make it a promising tool for improving healthcare. Future work should explore incorporating additional data sources (procedure codes, medications, lab results, free-text notes), expanding the range of diseases and outcomes considered, and addressing limitations in the representation of intentional self-harm in ICD codes. Exploring further the cost-effectiveness and integration into clinical workflows would also be valuable.
Limitations
The study has several limitations. The model currently only uses diagnostic ICD codes and demographic information; incorporating other data types (procedure codes, medications, lab results, free-text notes) could improve performance. The evaluation of single disease prediction was limited to pancreatic cancer, and intentional self-harm prediction was restricted to PTSD patients. The representation of intentional self-harm using ICD codes might not capture all cases perfectly. The computational cost of training the model with a larger vocabulary is significant, requiring substantial computational resources.
Listen, Learn & Level Up
Over 10,000 hours of research content in 25+ fields, available in 12+ languages.
No more digging through PDFs, just hit play and absorb the world's latest research in your language, on your time.
listen to research audio papers with researchbunny