Medicine and Health
Vision-language foundation model for echocardiogram interpretation
M. Christensen, M. Vukadinovic, et al.
Echocardiography is the most common non-invasive modality for evaluating cardiac structure and function and is central to diagnosis, risk stratification and treatment assessment in cardiology. Prior AI approaches in echocardiography have typically targeted narrow, task-specific objectives and have not leveraged vision-language foundation models that can generalize across tasks. Advances in representation learning with large-scale image–text data have enabled foundation models that perform zero-shot tasks with robustness beyond conventional CNNs. However, medical foundation models are often constrained by dataset size and heterogeneity. This study introduces EchoCLIP, a vision-language foundation model trained to align echocardiogram images with expert cardiologist interpretations, aiming to perform diverse interpretation tasks without task-specific training and to evaluate generalization across internal and external cohorts. The work also investigates a long-context variant, EchoCLIP-R, for improved report-level retrieval and temporal clinical reasoning, and proposes a multimodal interpretability approach (PromptCAM).
The paper situates EchoCLIP within prior work on AI for echocardiography that has improved measurement accuracy and disease detection but remained task-specific. It references the emergence of vision-language foundation models (e.g., CLIP) that learn transferable representations from image–text pairs and generalize to zero-shot tasks. In biomedical domains, related foundation models have been developed for chest X-rays, retinal imaging, wearable biosignals, and pathology, as well as large language models encoding clinical knowledge. A key bottleneck highlighted is the limited size and diversity of medical imaging datasets used for foundation model training, often relying on publicly available data that may not capture the full spectrum of clinical presentations. The study leverages large-scale clinical reporting databases to address these constraints and align imaging with clinician-generated text interpretations.
Data curation: Echocardiogram data were sourced from the Cedars-Sinai Medical Center echocardiography laboratory (2011–2022). From apical-four-chamber views, 1,032,975 video–text pairs were assembled from 224,685 studies across 99,870 patients. DICOM images were retrieved from a vendor-neutral archive, linked to structured cardiologist reports (Syngo Dynamics), de-identified, and converted to 224×224 RGB AVI videos. Splits by patient: training 921,981 videos (84,990 patients), validation 10,000 videos (5,358 patients), internal test 100,994 videos (10,001 patients). External validation used a random subset (n=5,000) of EchoNet-Dynamic (Stanford Healthcare).
Model design and training: EchoCLIP uses a ConvNeXt-Base image encoder and a decoder-only transformer text encoder (CLIP-like) initialized from LAION-400M. Text tokenization for EchoCLIP employed GPT-2 BPE with a 77-token context limit; models were trained on report snippets. Training followed the CLIP contrastive objective to align paired image and text embeddings (unit-normalized before cosine similarity), over 50 epochs with batch size 1,024 on two NVIDIA RTX A6000 (48 GB) GPUs (~2 weeks). Learning rate warmed up to 5e-5 over 2,000 steps then cosine decayed to zero. A random frame per video per epoch served as augmentation. Best checkpoint selected by lowest mean cross-modal retrieval rank (MCMRR) on validation.
Long-context text encoding (EchoCLIP-R): A custom template-based tokenizer compressed highly structured echocardiography reports using regex-driven phrase normalization, mapping text to a compact vocabulary (~770 tokens) plus number/severity tokens. Mean tokenized report length reduced from 530.3±154.7 (BPE) to 63.8±26.7 tokens, enabling full-report inputs and improving retrieval at the expense of some classification/regression performance.
Task adaptation without supervised fine-tuning: For zero-shot classification, task-specific positive prompts were embedded and cosine similarity with video embeddings computed, averaging across multiple semantically equivalent prompts and across frames (first 20 frames, stride 2). For regression (e.g., LVEF), prompt variants spanning candidate values (e.g., 0–100%) were embedded; per-frame similarities were ranked, with the top 20% of candidate values retained and aggregated, using the median as the final prediction. For EchoCLIP-R, full-report sampled prompts from validation data were used (instead of short phrases) to match training distribution and improve performance.
Retrieval and similarity analyses: Retrieval measured image-to-report and report-to-image ranking on the internal test set of 21,484 unique reports/videos, reporting mean rank and MCMRR (average of both directions). Patient matching and clinical change detection used cosine similarity between EchoCLIP-R embeddings across pairs of studies, evaluating AUC for same-patient identification and pre-/post-procedure discrimination (cardiac surgery and orthotopic heart transplant). Temporal similarity trajectories were computed within ±200-day windows relative to procedures.
Interpretability: PromptCAM, a cosine similarity-based saliency mapping approach, highlights image regions most associated with prompted text concepts (e.g., TAVR, Impella, Pacemaker, MitraClip). UMAP was applied to image embeddings to visualize clustering by disease states and measurements.
Evaluation: Zero-shot performance was assessed for device identification, chamber dilation, LV hypertrophy, pericardial effusion/tamponade, heart failure with reduced EF, LVEF thresholds, and quantitative tasks (LVEF, PAP). Metrics included AUC, MAE, sensitivity, specificity (often at Youden’s index), and cross-modal retrieval ranks.
- Scale and training: EchoCLIP trained on 1,032,975 video–text pairs from 224,685 studies (99,870 patients) using contrastive image–text learning without task-specific labels.
- Zero-shot device identification (internal test): AUCs—MitraClip 0.97 (0.97–0.98), TAVR 0.92 (0.91–0.92), pacemaker/defibrillator leads 0.84 (0.84–0.85).
- Chamber size and pathology (internal test): Severe dilation AUCs—right atrium 0.97 (0.97–0.98), right ventricle 0.92 (0.91–0.92), left ventricle 0.92 (0.92–0.93), left atrium 0.91 (0.90–0.92). Tamponade AUC 0.96 (0.94–0.98). Severe LV hypertrophy AUC 0.82 (0.81–0.83). Extended Data Table 1 reports sensitivities/specificities (e.g., MitraClip sensitivity 0.89, specificity 0.94; tamponade sensitivity 0.86, specificity 0.95).
- Cardiac function (LVEF): Zero-shot MAE 8.4% (internal) and 7.1% (external EchoNet-Dynamic). Threshold classification AUCs: LVEF<50% 0.89–0.90; <40% 0.93–0.94; <30% 0.95–0.97.
- Pulmonary artery pressure (PAP): MAE 10.8 mm Hg (internal) and 10.8 mm Hg (external).
- Retrieval (EchoCLIP-R): Mean image-to-report rank ~209/21,484 and report-to-image ~203/21,484 (top 1% retrieval); top-10 retrieval rates ~33–34%. MCMRR: CLIP baseline 10,743.0; EchoCLIP 571.3; EchoCLIP-R 206.1 (best).
- Longitudinal similarity and clinical transitions (EchoCLIP-R): Same-patient identification AUC 0.86 (0.85–0.87). Pre-/post-cardiac surgery AUC 0.77 (0.75–0.79). Pre-/post-orthotopic heart transplant AUC 0.79 (0.76–0.82). Cosine similarity decreased after major procedures and correlated with changes in reported LVEF across studies.
- Generalization and robustness: Performance was similar across age, sex, and image quality strata; external validation confirmed generalizability for LVEF and PAP.
- Interpretability: PromptCAM localized clinically relevant regions for devices (TAVR, Impella, pacemaker, MitraClip). UMAP revealed clusters aligned with cardiovascular diseases and measurements.
The study demonstrates that aligning large-scale echocardiography videos with expert text interpretations yields a vision-language foundation model capable of accurate zero-shot interpretation across diverse tasks without supervised fine-tuning. EchoCLIP generalizes to an external healthcare system with LVEF estimation accuracy comparable to blinded human assessments and prior supervised models, suggesting robustness to domain shifts. EchoCLIP-R extends capabilities to long-context retrieval and temporal reasoning: matching images to comprehensive reports, recognizing the same patient across studies, and detecting clinically meaningful changes after major interventions. The interpretability approach (PromptCAM) indicates that the model has learned semantically meaningful imaging features tied to specific clinical concepts. Collectively, these results support the viability of medical foundation models trained on clinician-generated reporting to automate preliminary echocardiography interpretation, potentially improving access and efficiency, particularly in resource-limited or emergent settings.
EchoCLIP establishes a foundation model for echocardiography that learns from over 1 million image–text pairs to perform zero-shot device detection, structural and functional assessment, and quantitative estimation (LVEF, PAP), with strong internal and external performance. EchoCLIP-R enhances long-context understanding for retrieval and longitudinal similarity analysis, enabling patient identification across studies and detection of clinical transitions. The work contributes a scalable training paradigm leveraging routine clinical reports and introduces PromptCAM for multimodal interpretability. Future directions include integrating video encoders, incorporating multiple echocardiographic views and Doppler data, adapting the model for visual question answering, and developing automatic report generation, alongside rigorous validation toward regulatory approval and clinical deployment.
- Image-based encoder used instead of a video encoder; motion cues critical to echocardiography were not fully exploited.
- Only the apical-four-chamber view was used, excluding information from other standard views and Doppler modalities.
- EchoCLIP’s text encoder (BPE) required report snippets due to context limits; although EchoCLIP-R addressed this, it traded off some classification/regression performance for retrieval.
- Differences in institutional measurement techniques (e.g., LVEF estimation methods) may affect cross-site performance, though observed errors are within clinical variability.
- Specialist models may outperform on narrowly defined tasks despite EchoCLIP’s broad generalization.
- Training dataset is not publicly shareable due to identifiability concerns, potentially limiting external reproducibility (though external validation was performed).
Related Publications
Explore these studies to deepen your understanding of the subject.

