Medicine and Health

Vision-language foundation model for echocardiogram interpretation

M. Christensen, M. Vukadinovic, et al.

Discover the groundbreaking research by Matthew Christensen, Milos Vukadinovic, Neal Yuan, and David Ouyang on EchoCLIP, a cutting-edge model for echocardiography that elevates cardiac imaging interpretation. With its remarkable ability to assess cardiac function and identify devices, this model is set to transform how clinicians work with echocardiograms.

00:00

Playback language: English

Index

Introduction

Echocardiography, or cardiac ultrasound, is a crucial non-invasive method for evaluating heart function and diagnosing heart disease. While artificial intelligence (AI) has shown promise in improving the accuracy of echocardiographic measurements and disease diagnoses, existing AI approaches often focus on narrow tasks, requiring specific training for each. This limitation hinders their generalizability and scalability. Vision-language foundation models, which leverage representation learning on large image and text datasets, offer a potential solution. These models learn to encode images and text into compact representations, enabling them to generalize beyond predefined tasks. However, the application of these models in biomedical imaging has been hampered by the limited availability of large, annotated clinical datasets. This study addresses this challenge by developing EchoCLIP, a foundation model trained on a massive dataset of echocardiograms and their corresponding expert interpretations.

Literature Review

Recent advancements in AI have led to the development of vision-language foundation models that excel at generalizing beyond narrowly defined tasks. These models learn rich representations from large image and text datasets, enabling zero-shot performance on various downstream tasks. In the biomedical field, such models have been applied to organize biological and medical datasets, including modality-specific models for chest X-rays, retinal imaging, and pathology images. However, the application of these models to echocardiography has been hindered by the limited size and diversity of available datasets. Prior AI models for echocardiography were often trained on datasets with far fewer examples, typically in the range of hundreds or thousands compared to the millions used in this research. This paper demonstrates the potential of a large, diverse dataset of echocardiograms and clinician interpretations in training a powerful foundation model.

Methodology

EchoCLIP was trained on a dataset of 1,032,975 echocardiogram videos and corresponding expert text reports, sourced from over a decade of clinical imaging at Cedars-Sinai Medical Center. The dataset comprises data from 224,685 echocardiography studies across 99,870 patients. A method for compressing echocardiography reports was developed to simplify the matching of clinical text assessments to images. The model employs a ConvNeXt-Base image encoder and a Byte-Pair Encoding (BPE) text tokenizer. A long-context variant, EchoCLIP-R, was also developed using a custom tokenizer based on common echocardiography concepts. The model's performance was evaluated on various benchmark tasks, including assessing cardiac function (left ventricular ejection fraction (LVEF) and pulmonary artery pressure (PAP)), identifying implanted intracardiac devices, identifying unique patients across multiple videos, and identifying clinical transitions (heart transplants and cardiac surgery). The model's interpretability was explored using PromptCAM, a modified class activation mapping method, and UMAP for visualizing embeddings.

Key Findings

EchoCLIP demonstrated strong performance across a range of tasks without explicit task-specific training. In assessing cardiac function, EchoCLIP achieved a mean absolute error (MAE) of 7.1% when predicting LVEF in an external validation dataset. For implanted device identification, EchoCLIP achieved area under the curve (AUC) scores of 0.84, 0.92, and 0.97 for pacemakers, percutaneous mitral valve repair, and artificial aortic valves, respectively. EchoCLIP-R accurately identified unique patients across multiple videos (AUC of 0.86), identified clinical transitions such as heart transplants (AUC of 0.79) and cardiac surgery (AUC 0.77), and enabled robust image-to-text search. The study also showed that EchoCLIP prioritizes important image features relevant to the associated text, highlighting the model's ability to learn semantically meaningful imaging features.

Discussion

The results demonstrate the feasibility and effectiveness of training vision-language foundation models on large, diverse datasets of echocardiography studies and expert interpretations. EchoCLIP's performance on external validation data showcases its generalizability and robustness to domain shift. The ability of EchoCLIP-R to perform tasks challenging for human clinicians highlights the potential of foundation models in streamlining clinical workflows. The interpretability analysis further confirms the model's ability to learn semantically meaningful features, enhancing trust and facilitating clinical adoption. The study's success in leveraging a large clinical database minimizes the need for laborious manual labeling, offering a scalable solution for developing advanced AI models in cardiology.

Conclusion

This research showcases the potential of vision-language foundation models for echocardiogram interpretation. EchoCLIP and EchoCLIP-R demonstrate superior performance on various tasks, including cardiac function assessment, device identification, and patient identification across time. The use of a large clinical database significantly reduces the need for manual labeling, offering a scalable approach for developing AI models in cardiology. Future work could focus on incorporating video encoders, using multiple echocardiographic views, and implementing automatic report generation to further enhance the model's capabilities.

Limitations

The study has several limitations. The model currently uses an image encoder instead of a video encoder, potentially missing motion-based information. Only the apical-four-chamber view was used, limiting the information captured. Future iterations should address these limitations by incorporating video encoders and data from multiple views. Additionally, the generalizability of the findings to other populations and clinical settings needs further investigation. The external validation dataset also contains echocardiograms collected using a different methodology. Further research is needed to investigate how to address these limitations and refine EchoCLIP's performance.

Related Publications

Explore these studies to deepen your understanding of the subject.

Medicine and Health

A scoping review of large language model based approaches for information extraction from radiology reports

D. Reichenpfader, H. Müller, et al.

Biology

ProtGPT2 is a deep unsupervised language model for protein design

N. Ferruz, S. Schmidt, et al.

Medicine and Health

Towards building multilingual language model for medicine

P. Qiu, C. Wu, et al.

Medicine and Health

A generalised computer vision model for improved glaucoma screening using fundus images

A. K. Chaurasia, G. Liu, et al.

Listen, Learn & Level Up

Over 10,000 hours of research content in 25+ fields, available in 12+ languages.

No more digging through PDFs, just hit play and absorb the world's latest research in your language, on your time.

listen to research audio papers with researchbunny