logo
ResearchBunny Logo
A foundation model for clinical-grade computational pathology and rare cancers detection

Medicine and Health

A foundation model for clinical-grade computational pathology and rare cancers detection

E. Vorontsov, A. Bozkurt, et al.

Discover Virchow, a groundbreaking foundation model developed by a team of experts including Eugene Vorontsov and Kristen Severson, that excels in pan-cancer detection. Trained with vast data from over 100,000 patients, it achieves outstanding accuracy with rare cancer variants, paving the way for innovative applications in computational pathology.

00:00
00:00
~3 min • Beginner • English
Introduction
Pathologic analysis of tissue is central to cancer diagnosis and treatment. The field is transitioning from glass slides to digitized whole-slide images (WSIs), enabling computational pathology to support diagnosis, characterization and discovery. Early AI systems focused on workflow augmentation, and recent studies aim to infer prognosis, therapeutic response and genomic alterations directly from routine hematoxylin and eosin (H&E) WSIs, potentially reducing reliance on immunohistochemistry (IHC) and sequencing. Foundation models trained with self-supervised learning on large image corpora have transformed vision tasks by learning generalizable embeddings without curated labels. In pathology, a successful foundation model should capture a wide spectrum of histomorphologic patterns (cellular morphology, tissue architecture, staining characteristics, nuclear morphology, mitoses, necrosis, inflammation, neovascularization and biomarker expression) to enable broad WSI-level predictions, including detection and subtyping of common and rare cancers, biomarker quantification, cell/event counting and therapy response prediction. The authors introduce Virchow, a million-image-scale pathology foundation model trained on ~1.5 million H&E WSIs from ~100,000 patients across 17 tissues, using a 632M-parameter ViT-H trained with DINOv2. The study evaluates Virchow embeddings in pan-cancer detection across nine common and seven rare cancers, generalization to external sites and unobserved tissues, comparison against other foundation models and against clinical-grade specialist products, as well as digital biomarker prediction and tile-level linear probing benchmarks.
Literature Review
The paper situates Virchow among prior advances in self-supervised and foundation models. Scaling laws in AI suggest performance improves with model and data size. In natural images, models trained on millions to hundreds of millions of images achieve strong generalization, often using vision transformers. In pathology, recent works have trained self-supervised models on 30,000–400,000 WSIs with 28–307M parameters, showing in-domain SSL features outperform natural image pretraining and that performance scales with data and model size. Methods such as contrastive learning, masked image modeling (iBOT, MAE) and DINO/DINOv2 have been adapted to pathology. Earlier clinical-grade computational pathology models often used multiple-instance learning with ImageNet pretraining on datasets of ~10,000 WSIs for specific tissues (e.g., prostate, breast, lymph node). This work advances the scale to 1.5 million WSIs, employs ViT-H with DINOv2, and compares to contemporary pathology foundation models (UNI, Phikon, CTransPath), as well as to commercial specialist detectors, to assess generalization, data efficiency and performance on rare cancers and biomarkers.
Methodology
Dataset and preprocessing: The training corpus comprises 1,488,550 de-identified H&E WSIs from 119,629 patients licensed from MSKCC, scanned at ×20 (0.5 μm/px) using Leica scanners, spanning 17 tissue groups and both biopsy (63%) and resection (37%) specimens. WSIs were downsampled 16× to detect foreground tissue via HSV thresholds; non-overlapping 224×224 tiles containing ≥25% tissue were collected. Approximately 13 billion candidate tiles were available; Virchow was trained on 2 billion tiles sampled with replacement. Model and self-supervised training: Virchow uses a ViT-H/14 (632M parameters) trained with DINOv2’s student–teacher self-supervised algorithm. Two global and multiple local crops per tile are used; the student matches the teacher’s CLS token for global representations and masked patch tokens for local representations. The teacher is an EMA of the student. Training used AdamW (β1=0.9, β2=0.999), fp16, 131,072 prototypes, a teacher temperature schedule of 0.04–0.07 over 186k iterations, and a reciprocal sqrt LR schedule with 495k warmup and linear cooldown in the last 819,200 iterations. Distributed batches sampled 1 WSI per GPU with 256 tiles per WSI. Embeddings: For each 224×224 tile, a Virchow embedding is defined as the concatenation of the CLS token and the mean of the remaining 256 patch tokens, yielding a 2,560-d vector. Pan-cancer detection task: Weakly supervised multiple-instance learning with the Agata aggregator converts per-tile embeddings into specimen-level predictions. Agata employs a memory-efficient cross-attention variant with learned queries and GELU-projected keys/values to weight tiles linearly in the number of tiles. Training used cross-entropy and AdamW (LR 3e-4) for 25 epochs; for each specimen, gradients were backpropagated through only the slide with highest predicted cancer probability. Baseline aggregators were trained identically for embeddings from UNI, Phikon and CTransPath. Training data: 89,417 slides across 40,402 specimens from the MSKCC corpus with specimen-level cancer labels extracted via rule-based NLP from clinical reports (block-level labels for prostate). Evaluation data: 22,932 slides from 6,142 specimens across 16 cancer types, with internal (MSKCC; 3,033 specimens) and external (3,109 specimens) cohorts; cancers were stratified into common (breast, prostate, lung, colon, skin, bladder, uterus, pancreas, head & neck) and rare (liver, stomach, brain, ovary, cervix, testis, bone). Some tissues (cervix, testis, head & neck) were unobserved during Virchow or aggregator training. Metrics: AUC and specificity at 95% sensitivity with 95% CIs; pairwise DeLong’s test (AUC) and Cochran’s Q followed by McNemar’s test (specificity), Holm-corrected. Clinical-grade comparisons: The Virchow-based pan-cancer model was compared against specialist commercial models Paige Prostate, Paige Breast, and Paige Breast Lymph Node (BLN) on their product testing datasets and on rare variant benchmarks curated by pathologists. Training set sizes for the products significantly exceeded tissue-specific counts available to the pan-cancer model. AUCs and stratified analyses (e.g., macrometastases vs micrometastases) were reported with statistical testing. Error analysis: A pathologist reviewed errors at an operating point targeting ~95% sensitivity and ~85% specificity per tissue on a pan-tissue benchmark (2,419 slides across 18 tissues). Free-text annotations were categorized into failure modes for false negatives and false positives. Biomarker prediction: Case-level binary classification using Agata on embeddings predicted biomarker status from H&E slides corresponding to blocks used for MSK-IMPACT sequencing or IHC/FISH. Biomarkers: Colon-MSI (dMMR/MSI-H), Breast-CDH1 biallelic loss, Bladder-FGFR alterations, Endometrial-PTEN oncogenic mutations, Lung-EGFR oncogenic mutations, Prostate-AR amplification, Gastric-HER2 IHC/FISH positivity, Skin-BRAF V600, Ovarian-FGA ≥30%. Datasets were split into train/test without patient overlap; due to small sizes, multiple learning rates were tried and the best test AUC reported per embedding type to benchmark representational quality. 95% CIs via DeLong; head-to-head statistical significance generally underpowered. Tile-level linear probing: To assess embedding separability without aggregators, frozen-encoder linear classifiers were trained on Virchow and baselines (UNI, Phikon, CTransPath, DINO, PLIP, NatImg DINOv2) using a standardized protocol (SGD, cosine LR from 0.01 to 0 over 12,500 iterations; batch size 4,096; Z-scored embeddings; val-based checkpoint selection; no augmentation). Benchmarks included an internal PanMSK cancer vs non-cancer tile task (in-distribution) and public OOD datasets: CRC (normalized and non-normalized test), Camelyon17-WILDS, MHIST, TCGA TIL, PCam, MIDOG, and TCGA CRC-MSI. Metrics included (weighted) F1, accuracy and balanced accuracy with 95% CIs via bootstrapping; McNemar’s test for significance. Qualitative feature analysis: On the CoNSeP dataset, tile embeddings were extracted on 4×4 grids of resized images; PCA on tile features per image and thresholding of principal components revealed emergent semantic separation (e.g., malignant epithelium, inflammatory cells). Software, data and code: Training and analysis used Python, PyTorch, Torchvision, PyTorch Lightning, scikit-learn, cucim, OpenSlide, Pillow. The Virchow model is available for non-commercial research on Hugging Face; an SDK for downstream WSI tasks is provided. Proprietary MSKCC data access is by request; public benchmark datasets are cited and linked.
Key Findings
- Virchow embeddings enabled the strongest pan-cancer detection overall and across most individual cancer types versus UNI, Phikon and CTransPath. - Overall specimen-level AUCs (all cancers): Virchow 0.950; UNI 0.940; Phikon 0.932; CTransPath 0.907 (all pairwise differences P<0.0001). Specificity at 95% sensitivity (all cancers): Virchow 0.725; UNI 0.689; Phikon 0.629; CTransPath 0.523. - Rare cancers: Virchow AUC 0.937 vs UNI 0.925, Phikon 0.917, CTransPath 0.878. Some rare tissues (cervix, bone) were harder for all models (AUC < 0.9), but Virchow led (cervix 0.875 vs 0.830, 0.810, 0.753; bone 0.841 vs 0.813, 0.822, 0.728). - Generalization: Similar AUCs on internal (MSKCC) and external consultation data; Virchow outperformed baselines in both settings. Virchow also led on tissues unobserved during training (e.g., cervix, testis, head & neck). Approximately 19.5% of evaluation specimens included unobserved tissues. - Scaling: Pan-cancer detection performance scaled approximately log-linearly with model parameter count and showed gains with more training tiles, with diminishing returns at the largest data scales. - Clinical-grade comparison: Despite far fewer tissue-specific training specimens and no product-style label curation, the Virchow-based pan-cancer model achieved AUCs approaching specialist models: prostate 0.980 (vs Paige Prostate 0.995; P<0.05), breast 0.985 (vs Paige Breast 0.992; P<0.01), BLN 0.971 (approaching). It significantly outperformed Paige BLN for macrometastases (0.999 vs 0.994; P<0.05) and showed no significant difference for several other BLN or stratified breast comparisons. On rare variants, the pan-cancer model often surpassed specialists, including some lymphoma detections in prostate/lymph node and rare breast histologies (e.g., adenoid cystic carcinoma, carcinoma with apocrine differentiation, metaplastic subtypes, secretory carcinoma). - Error analysis: False negatives were frequently due to minimal cancer foci (45.2%), borderline malignancies (11.9%), very subtle features (9.5%), treatment effects, extensive necrosis, or artifacts. False positives commonly reflected precursor lesions without invasion (53.2%), artifacts (17.0%), reactive stromal/lymphoid (14.9%) or epithelial changes (11.7%), and rare benign neoplasms (3.2%). - Biomarkers: Across nine digital biomarkers, Virchow achieved the top AUC on 7/9, top-2 on 8/9 and top-3 on 9/9 tasks. Representative AUCs (Virchow): Breast-CDH1 0.986; Colon-MSI 0.958; Bladder-FGFR up to ~0.90; Endometrial-PTEN ~0.85; Lung-EGFR ~0.85; Prostate-AR ~0.85; Ovarian-FGA ~0.78–0.85; Gastric-HER2 ~0.96; Skin-BRAF ~0.90 (exact per-task values plotted; CIs via bootstrapping; statistical power limited for head-to-head significance). - Tile-level linear probes: Virchow matched or surpassed baselines on 7/8 tile benchmarks, including OOD tasks (Camelyon17-WILDS; CRC without stain normalization), showing minimal performance drop under stain-processing shifts (~−0.005 weighted F1). - Unsupervised feature analyses showed emergent semantic separation in embeddings (e.g., malignant epithelium, inflammatory cells), supporting interpretability.
Discussion
The study set out to test whether a large, in-domain self-supervised foundation model can provide generalizable, data-efficient image embeddings for diverse clinical pathology tasks. Results demonstrate that Virchow markedly improves pan-cancer detection across tissues, including rare cancers and external/OOD data, while approaching specialist clinical product performance with less tissue-specific supervision. Gains persisted for unobserved tissues, suggesting strong transfer and robustness. The foundation model also underpinned competitive to superior biomarker prediction from routine H&E, highlighting the potential to reduce additional testing and accelerate treatment decisions. Scaling analyses support that increased model capacity and training data continue to yield measurable benefits, consistent with broader foundation model trends. Qualitative and tile-level results indicate that Virchow learns meaningful, semantically structured features that contribute to robustness (e.g., resilience to stain normalization differences) and provide a basis for interpretability. Collectively, the findings support the hypothesis that sufficiently scaled pathology foundation models can serve as versatile backbones for a wide range of downstream clinical tasks, improving performance particularly where labeled data are scarce (rare cancers/variants and niche biomarkers).
Conclusion
Virchow, a 632M-parameter ViT-H trained with DINOv2 on 1.5 million H&E WSIs, delivers state-of-the-art, clinically relevant performance across pan-cancer detection, including rare cancers and external sites, and strongly competitive biomarker prediction directly from routine histology. The pan-cancer detector built on Virchow approaches specialist commercial products overall and surpasses them on certain rare variants, despite using fewer tissue-specific labels and simpler aggregation. These results position large-scale pathology foundation models as practical building blocks for clinical AI, particularly in data-limited scenarios. Future work should: (1) extend from tile-level to slide-level pretraining to further boost data efficiency and capture multi-scale context; (2) explore pathology-specific self-supervised objectives and augmentations better aligned to histopathology’s long-tailed entities, limited color space and scale characteristics; (3) study aggregator architectures and training strategies tailored to gigapixel WSIs; (4) investigate data curation, balancing and distillation to maintain rare pattern coverage while reducing redundancy; and (5) address deployment constraints via model compression/distillation and hardware-aware design for clinical settings.
Limitations
- Training data originated from a single center (MSKCC) with limited scanner diversity, which may constrain domain breadth. - Embeddings were generated at the tile level (×20, 0.5 μm/px) and require a trained aggregator for slide/specimen predictions; multi-scale or slide-level pretraining was not explored. - The pan-cancer training labels lacked the quality control and subpopulation enrichment typical of commercial products; while this underscores data efficiency, it may also limit absolute performance. - Large model size poses deployment challenges; distillation and optimization may be necessary for clinical use. - Due to the scale, the study could not comprehensively assess data balancing, curation or distillation strategies; preserving rare features while reducing redundancy remains challenging. - Statistical power was limited for some rare variant and biomarker comparisons, reducing the ability to claim head-to-head significance in all cases.
Listen, Learn & Level Up
Over 10,000 hours of research content in 25+ fields, available in 12+ languages.
No more digging through PDFs, just hit play and absorb the world's latest research in your language, on your time.
listen to research audio papers with researchbunny