Medicine and Health
Pan-cancer diagnostic consensus through searching archival histopathology images using artificial intelligence
S. Kalra, H. R. Tizhoosh, et al.
This groundbreaking research conducted by Shivam Kalra, H. R. Tizhoosh, and their colleagues investigates how AI-powered image search can significantly enhance diagnostic accuracy in histopathology. With nearly 30,000 whole-slide images analyzed, their innovative majority voting approach demonstrates a path toward improved consensus in cancer subtype diagnosis.
~3 min • Beginner • English
Introduction
The study addresses whether content-based image search of whole-slide images (WSIs) can support diagnostic consensus in pathology by retrieving visually similar, previously diagnosed cases. With widespread digitization of pathology slides and advances in AI, most prior work has focused on supervised classification and segmentation to aid diagnosis. However, variability in visual assessment among pathologists is well documented, and algorithmic classifiers may not directly facilitate consensus-building or virtual peer review. Content-based image retrieval (CBIR), an unsupervised approach that searches using image pixels rather than text, offers decision support by returning similar images and associated metadata without making direct diagnostic decisions. The purpose of this work is to validate a large-scale, AI-enabled image search engine (Yottixel) on the TCGA archive to determine if majority voting among retrieved similar cases can establish a computational consensus for cancer type and subtype recognition across multiple organs. The significance lies in potentially reducing inter- and intra-observer variability through search-driven consensus, complementing pathologist workflows with immediate access to visually similar, curated cases.
Literature Review
CBIR systems in medicine have been explored for decades, with early work focusing on general medical imaging. Recent advances in deep learning have revitalized image retrieval research, including supervised and unsupervised approaches leveraging convolutional neural networks and hashing. With the advent of digital pathology and high-resolution WSIs, research has begun to target histopathology-specific image search and analysis. The Yottixel system builds on prior work in deep feature extraction, clustering, and barcode-based indexing (e.g., Radon barcodes, deep barcodes) to enable scalable, efficient retrieval tailored to pathology. Compared to classifier-centric AI studies that require extensive labeled training and yield deterministic outputs, CBIR emphasizes similarity-based retrieval and can assist with ambiguous or borderline cases by surfacing visually related, evidenced diagnoses.
Methodology
Data: The study used the TCGA WSI archive (Genomic Data Commons), starting with 30,072 WSIs. After quality exclusions (e.g., poor staining, low resolution, out-of-focus/unreadable regions), 29,120 WSIs at 20x magnification were processed (about 6 TB compressed). The dataset spans 25 anatomic sites and 32 cancer subtypes, including 17,425 frozen section slides and 11,579 permanent H&E slides; 26,564 neoplastic and 2,556 non-neoplastic specimens. TCGA subtype codes and patient counts are listed in the paper (e.g., BRCA 1097, GBM 604, UCEC 558, etc.).
Indexing and search (Yottixel):
- Tissue extraction: Segment tissue regions from WSI to create binary masks isolating tissue from background.
- Mosaicking: Tile tissue regions into fixed-size patches at fixed magnification (e.g., 500×500 µm at 20x). Apply k-means clustering to group patches into classes. Uniformly sample 5–20% of patches per cluster to form a representative mosaic per WSI (typically 70–100 patches).
- Feature mining: Pass mosaic patches through pretrained CNNs (ImageNet-based). Use activations from late pooling/first fully connected layers as feature vectors (~1,000–4,000 features per patch).
- Bunch of barcodes (BoB): Convert patch feature vectors to binary barcodes using the MinMax algorithm, producing a set ('bunch') of barcodes per WSI for storage in the index. Pairwise comparisons use Hamming distance between barcodes for efficient retrieval.
Validation protocol:
- Leave-one-patient-out (LOPO): When querying with a WSI, exclude all slides from the same patient to avoid anatomic duplicates.
- Horizontal search (cancer-type recognition): Query against the entire repository irrespective of primary site to validate organ/type recognition and explore use cases such as origin search for metastases.
- Vertical search (subtype recognition): Restrict search to WSIs from the same primary site to identify cancer subtypes; only organs with at least two subtypes were evaluated.
Metrics and analysis:
- Report top-n hit rates (top-3, top-5, top-10) and conservative majority-n accuracy/recall (majority-5, majority-10; also majority-7/20 in vertical search analyses). A search is successful by majority-n only if the majority among the top-n results matches the query’s type/subtype.
- Correlation analysis: Compute correlation between number of patients per category and consensus accuracy; apply Cox–Stuart trend test for monotonic trends.
- Visualization and error analysis: t-SNE of pairwise distances (3,000 randomly sampled slides) to inspect clustering and outliers; confusion heatmap of relative subtype frequencies among top-10 results; chord diagram to visualize inter-type relationships.
Compute: High-performance storage and GPUs were used to index ~20 million 1,000×1,000-pixel tiles, creating ~3 million barcodes for real-time search. Reproducibility tests showed stable results; minor non-determinism from k-means does not affect hit rates or majority votes once indexing converges.
Key Findings
- Large-scale feasibility: Successfully indexed and searched ~29,120 WSIs (25 anatomic sites, 32 subtypes), ~20 million tiles, and ~3 million barcodes from almost 11,000 patients.
- Majority-vote consensus: Majority-n voting among top search results provides conservative, clinically appropriate consensus compared to standard top-n hit rates.
- Performance (horizontal search, permanent slides; top-10 hit rate): Brain 98.99%, Pulmonary 98.46%, Prostate/Testis 97.43%, Breast 95.96%, GI 95.54%, Urinary tract 95.41%, Gynecological 95.28%, Endocrine 94.55%, Liver/pancreaticobiliary 93.85%, Head & neck 90.55%, Melanocytic 88.20%, Mesenchymal 87.37%, Hematopoietic 84.61 (with corresponding majority-5/10 accuracies and recalls reported).
- Performance (horizontal search, frozen sections; top-10 hit rate): Brain 97.44%, Gynecological 97.60%, Pulmonary 95.34%, GI 95.12%, Breast 93.44%, Prostate/Testis 91.92%, Urinary tract 90.25%, Endocrine 84.78%, Melanocytic 83.83%, Liver/pancreaticobiliary 81.48%, Hematopoietic 78.45%, Head & neck 70.88%, Mesenchymal 56.37.
- Vertical search (subtype consensus) highlights:
- Frozen sections majority-5 accuracy exceeding 90% for KIRC, GBM, COAD, UCEC, PCPG; examples include PRAD 98.33%, SKCM up to 99.56% (majority-20), THYM 97.58%, LIHC 93.36%.
- Diagnostic slides majority-5 accuracy exceeding 90% for GBM (91.18%), LGG (89.77 close to 90%), UCEC (92.22%), KIRC (91.66), COAD (76.14; increases with larger n), ACC (93.83%), PCPG (88.77 close to 90%), PRAD (98.43%), SKCM (99.57%), THYM (98.87%). Reported headline examples: permanent slides prostate adenocarcinoma 98%, skin cutaneous melanoma 99%, thymoma 100%; frozen sections bladder urothelial carcinoma 93%, kidney renal clear cell carcinoma 97%, ovarian serous cystadenocarcinoma 99%.
- Data size effect: Accuracy improves with more available patients/images. Positive correlations between patient count and consensus accuracy were observed: vertical search r≈0.5456 (frozen) and r≈0.5974 (diagnostic); horizontal search r≈0.7780 (frozen) and r≈0.7201 (diagnostic). Cox–Stuart trend tests supported an upward monotonic trend (p-values > 0.95 for all settings).
- Optimal majority window: Majority of top-7 often yielded highest accuracy; retrieving too many images can degrade accuracy for rare subtypes due to scarcity of correct matches among larger n.
- Error structure: Confusion heatmap shows expected overlaps (e.g., READ vs COAD; LUAD vs LUSC; LIHC among CHOL results). Chord diagram reveals morphological relationships (e.g., adenocarcinomas across organs; LGG–GBM; urothelial resemblance to squamous tumors). t-SNE demonstrates coherent clustering by subtype with plausible outliers.
- Privacy and practicality: BoB barcodes are non-reversible, supporting privacy; once indexed, real-time search is computationally efficient and feasible for clinical environments.
Discussion
The findings demonstrate that CBIR-driven retrieval of visually similar, previously diagnosed WSIs can provide a robust, conservative computational consensus via majority voting, addressing the core question of aiding diagnostic agreement in pathology. Horizontal search validates organ/type recognition across the pan-cancer TCGA dataset, while vertical search shows high consensus accuracies for many subtypes when sufficient cases exist, supporting the notion of 'virtual peer review'. The search framework avoids direct algorithmic labeling, instead augmenting pathologist decision-making with evidence-based, visually matched cases and metadata. Observed confusions largely reflect genuine morphologic similarities (e.g., LUAD vs LUSC; COAD vs READ) and high-grade tumor patterns, aligning with clinical differentials. Accuracy correlates positively with dataset size, underscoring the need for comprehensive, well-characterized archives. Visualization (t-SNE) and heatmaps confirm coherent subtype grouping and interpretable error patterns. Overall, the approach effectively leverages large WSI repositories to mitigate inter- and intra-observer variability and can be integrated into diagnostic workflows without compromising patient privacy.
Conclusion
This pan-cancer validation shows that an AI-enabled CBIR system (Yottixel) can index and search very large WSI archives and achieve high majority-vote consensus for cancer type and subtype recognition, particularly when there are large numbers of evidently diagnosed cases per category. The study introduces a conservative majority-based evaluation appropriate for clinical support and highlights a strong positive relationship between dataset size and achievable consensus accuracy. Future work should: (1) expand curated archives across more tissue types, especially hematopathology; (2) perform detailed subtype-specific consensus studies on carefully curated datasets; (3) conduct comprehensive discordance assessments with and without computational consensus in clinical settings; and (4) explore methodological refinements (e.g., thresholding strategies, normalization of distances) and use of single-patch searches for fine-grained tasks such as grading and mitotic counts.
Limitations
- Dataset constraints: TCGA includes many frozen sections with potential artifacts and compromised morphology; some preparation details unspecified; limited representation of hematopathology (few lymph node cases); potential research/institutional selection biases; exclusion of tumors post-neoadjuvant therapy limits generalizability.
- Class imbalance and rarity: Some subtypes have few patients, reducing consensus accuracy and causing accuracy drops when increasing n in majority voting due to scarcity of correct matches.
- Similarity vs classification: Evaluation treats retrieval as classification via majority voting, potentially undervaluing nuanced similarity relationships and anatomic proximities; histologic similarity perceived by experts may not perfectly align with Hamming distance-based measures.
- Magnification/resolution effects: Certain distinctions (e.g., DLBC vs undifferentiated non-hematopoietic tumors) may require multiple magnifications and ancillary studies beyond 20x image features.
- Non-determinism: Minor non-determinism from k-means during indexing, though convergence minimizes impact.
Related Publications
Explore these studies to deepen your understanding of the subject.

