
Medicine and Health
Prediction of tumor origin in cancers of unknown primary origin with cytology-based deep learning
F. Tian, D. Liu, et al.
Discover the groundbreaking TORCH deep-learning method for predicting tumor origins in cancers of unknown primary site, employing cytological images. Developed by a team of expert authors, TORCH not only showcases remarkable accuracy but also significantly enhances diagnostic scores for junior pathologists, potentially improving survival rates for CUP patients.
~3 min • Beginner • English
Introduction
Cancers of unknown primary (CUP) are metastatic malignancies without an identifiable origin after standard workup, comprising approximately 3–5% of cancers. Many CUP cases initially present with pleural or peritoneal effusions where cytology is often the first or only feasible diagnostic modality, especially in late-stage patients who cannot undergo surgery or core biopsy. While cytology can differentiate malignant from benign effusions and sometimes broad histologic types (for example, adenocarcinoma versus squamous), precise site-of-origin prediction remains challenging. Immunohistochemistry panels can localize the primary in fewer than 30% of CUP cases, leaving a substantial clinical need. Accurate origin prediction is crucial for tailoring therapy and improving outcomes. This study aims to develop and validate a cytology-based deep learning system (TORCH) that identifies malignancy and predicts broad tumor origins (digestive, female reproductive including breast, respiratory, blood/lymphatic) from effusion smears, and to assess its potential clinical impact in CUP management.
Literature Review
Deep convolutional neural networks have shown expert-level performance across pathology tasks, including detection of breast cancer nodal metastases, Gleason grading in prostate biopsies, and prediction of EBV-associated gastric cancer from histology. Prior AI work on CUP origin prediction has focused on whole-slide histology images (for example, Lu et al., Nature 2021). However, cytology is frequently the only accessible modality in effusion-based metastases, and AI for cytology-based origin prediction has been rarely reported. Cytology faces challenges such as low cellularity, degeneration, atypia, and interobserver variability, leading to suboptimal accuracy. New approaches like AI-assisted image analysis may enhance detection and localization of origins in effusions, complementing existing immunohistochemistry and molecular profiling methods.
Methodology
Study design and datasets: Retrospective and prospective, multi-center study across four tertiary hospitals in China. From June 2010 to October 2023, 90,572 cytology smear images (76,183 patients) were collected; after excluding 24,808 malignant images without supporting evidence for primary origin and 8,544 blank/poor-quality images, 57,220 images from 43,688 patients remained. The data were split into training (29,883 images from 20,638 individuals) and five testing sets: three internal (Tianjin n=4,186; Zhengzhou n=6,234; Suzhou n=2,379; total 12,799) and two fully unseen external sets (Tianjin-P prospectively enrolled n=3,933; Yantai retrospectively enrolled n=10,605; total 14,538). Cases were stratified by supporting evidence into high- and low-certainty groups. Cytology originated from pleural (hydrothorax) and peritoneal (ascites) fluids; benign conditions included cirrhosis, nephrotic syndrome, etc. Malignant primary systems were grouped into four categories (digestive, female reproductive including breast, respiratory, blood/lymphatic) plus a benign class for the 5-way output.
Image processing: Cytology smears were acquired at ×200 or ×400. Original images (≈2797×1757 px) were tiled into 224×224 patches. Quality control removed blank and out-of-focus tiles. A bag-of-patches (multiple-instance learning, MIL) represented each image.
Feature extraction: Self-supervised contrastive learning (MoCo) with a ResNet-50 backbone and MLP projection learned representations from 1,360,892 TCGA histology patches (11,607 patients) and 29,883 cytology patches. Two feature extractors were obtained (cytology and histology). Patch features (typically 1,024–2,048-D) formed image-level feature matrices. Clinical variables (age, sex, sampling site: hydrothorax vs ascites) were embedded into feature vectors of matched dimension.
Model architecture: TORCH ensembles 12 models derived from four MIL backbones and three input configurations. Backbones: (1) Attention-based MIL (ABMIL), (2) ABMIL with multiple attention branches (AbMIL-MB), (3) Transformer-based MIL (TransMIL), and (4) TransMIL with cross-modality attention for fusing image and clinical features. Inputs per backbone: (a) cytology features + clinical variables, (b) histology features + clinical variables, (c) both feature types + clinical variables. Ensembling averaged class probability outputs. Tasks: (i) binary malignancy vs benignity (using P_cancer = 1 − P_benign), and (ii) multiclass origin prediction among four malignant systems.
Training: Self-supervised feature learning used SGD for 200 epochs (initial lr 0.015, weight decay 1e-5, cosine decay, batch 128). Downstream MIL models trained end-to-end with Adam (lr 2e-4, weight decay 1e-5, batch size 1) for 100 epochs, selecting the best epoch from 60 onward by validation loss. Implementation in PyTorch 1.12.1 on DGX A100.
Evaluation: Primary metric AUROC with DeLong 95% CIs; accuracy, sensitivity, specificity, precision, NPV via Clopper–Pearson. Multiclass performance via micro-averaged one-vs-rest ROC. Top-n (n=1,2,3) accuracy assessed for origin prediction. Stratified analyses by sampling site (hydrothorax vs ascites), carcinoma vs noncarcinoma, and adenocarcinoma vs nonadenocarcinoma. Ablation excluded clinical variables to assess impact; perturbation analysis randomized or flipped clinical fields to quantify dependence. Human comparison involved two junior and two senior pathologists interpreting 495 images (with clinical metadata), scored on a 0–2 scale; an additional 496 images tested junior performance with TORCH assistance. Inter-rater reliability used Fleiss’ kappa.
Clinical outcome association: For 391 CUP patients (uncertainty cases) treated between April 2020 and February 2023, oncologists classified initial treatments as concordant/discordant with TORCH-predicted origins based on guidelines and records. Overall survival was compared by Kaplan–Meier and log-rank tests; multivariable Cox models included age, sex, AI prediction type, specimen source, number of metastatic sites, and concordance. Karnofsky performance and RECIST response were summarized for 310 patients receiving palliative chemotherapy/targeted therapy.
Interpretability: Attention heatmaps localized salient patches; pathologists assessed whether heatmaps covered main tumor areas (coverage ≥60% and ≥80%).
Key Findings
- Overall performance across five testing sets (n=27,337): micro-averaged one-vs-rest AUROC 0.969 (95% CI 0.967–0.970). Internal sets AUROC: Tianjin 0.953 (0.949–0.958), Zhengzhou 0.962 (0.960–0.965), Suzhou 0.979 (0.976–0.983). External sets AUROC: Tianjin-P 0.958 (0.954–0.962), Yantai 0.978 (0.977–0.980).
- Malignancy detection (overall): AUROC 0.974 (0.972–0.976), accuracy 92.6% (92.2–92.9), sensitivity 92.8% (92.3–93.2), specificity 92.4% (92.0–92.8).
- Origin localization: Overall top-1/2/3 accuracy 82.6% / 95.9% / 98.9%. Per internal test sets, top-1 accuracy: Tianjin 76.3%, Zhengzhou 80.8%, Suzhou 87.4%. External lower-/upper-bound top-1 accuracy accounting for uncertainty cases: Tianjin-P 70.2–81.7%; Yantai 75.1–88.1% (without uncertainty: 79.3% and 86.3%, respectively).
- Category-specific metrics (combined): Female reproductive system AUROC 0.960 (0.958–0.962), accuracy 88.1% (87.7–88.5), sensitivity 92.5% (91.8–93.2), specificity 86.9% (86.4–87.3). Digestive, respiratory, blood/lymphatic, and benign metrics provided in Extended Data Tables (e.g., benign accuracy 92.6%, sensitivity 92.8%).
- Sampling site stratification: Higher micro-averaged AUROC in hydrothorax vs ascites (0.970 vs 0.966; P<0.001). Ascites showed higher AUROC for digestive (0.892 vs 0.775; P<0.001) and female reproductive (0.951 vs 0.945; P=0.012); hydrothorax higher for respiratory (0.929 vs 0.808; P<0.001). No significant differences for benign (0.972 vs 0.975; P=0.068) or blood/lymphatic (0.967 vs 0.951; P=0.122).
- Histologic subtype stratification: Comparable AUROC for carcinoma vs noncarcinoma (0.938 vs 0.939; P=0.244). Within carcinoma, adenocarcinoma AUROC 0.942 vs non-adenocarcinoma 0.925 (P=0.002).
- Ablation (removing age, sex, site) degraded performance: AUROC 0.969→0.925 (P<0.001), top-1 accuracy 82.6%→68.9% (P<0.001), top-2 95.9%→88.7% (P<0.001). Perturbation impact (top-1 accuracy drop fraction): age Δ=6.70%, sex Δ_sex=26.5%, site Δ_site=37.5%.
- AI vs pathologists (n=495): TORCH top-1 accuracy 78.8% (75.4–82.0) exceeded two juniors (42.6%, 44.0%) and two seniors (69.7%, 57.0%); all P<0.001. Across five categories, TORCH had higher accuracy (0.896 vs 0.813; P=0.038), sensitivity (0.880 vs 0.485; P<0.001), and precision (0.634 vs 0.486; P<0.001). Diagnostic score higher for TORCH (1.677 vs 1.265; P<0.001).
- Junior pathologists with TORCH assistance (n=496) improved top-1 accuracy from 43.3% to 62.3% (P<0.001), approaching seniors (63.3%; P=0.777), with notable gains in digestive system accuracy and female reproductive system sensitivity. Diagnostic score improved from 1.101 to 1.326 (P<0.001).
- Clinical outcomes in CUP (n=391): Treatment concordant with TORCH predictions associated with longer median OS (27 vs 17 months; log-rank P=0.006) and better Karnofsky scores at 3–6 months (52.1±18.8 vs 41.8±19.5; P<0.001). Among 310 treated with systemic therapy, concordant group showed better OS (P=0.032) and higher proportions of PR/SD vs PD. Concordance was independently favorable in multivariable Cox models (all patients HR 0.528, 95% CI 0.374–0.746, P<0.001; systemic therapy subset HR 0.498, 95% CI 0.336–0.737, P=0.001).
- Interpretability: Attention heatmaps captured main tumor regions with composite accuracy 87.7% (pathologist assessment; coverage thresholds reported).
Discussion
The study addresses the diagnostic challenge of localizing tumor origin in CUP patients presenting with pleural or peritoneal effusions, where cytology may be the only feasible modality. TORCH, a cytology-based deep learning ensemble integrating image and basic clinical features (age, sex, sampling site), robustly distinguished malignant from benign effusions and localized tumor origin to broad organ systems across multi-institutional internal and external datasets. Performance generalized across sites, specimen types, and diagnostic certainty levels, with particularly strong results for female reproductive and respiratory categories according to sampling site tendencies. Ablation and perturbation analyses demonstrate that clinical context, especially specimen site and sex, materially augments image-based predictions.
Compared with practicing pathologists, TORCH achieved higher accuracy, sensitivity, and precision in origin classification and improved junior pathologists’ performance when used as decision support, reducing interobserver variability. Importantly, concordance between TORCH-predicted origin and initial treatment correlated with improved overall survival and functional outcomes in CUP, suggesting potential clinical utility for guiding therapy selection when origins are uncertain. Attention heatmaps aligned with pathologist-identified tumor regions, offering interpretable cues on morphological determinants (e.g., glandular structures, papillary clusters, cytologic atypia) underlying predictions.
Overall, TORCH provides an effective ancillary tool for cytology-based origin prediction, bridging gaps where histology, immunohistochemistry, or molecular profiling are unavailable or inconclusive. Its benefit is most evident in real-world effusion cytology scenarios, supporting more personalized and potentially effective treatment strategies in CUP.
Conclusion
TORCH is a cytology-based deep learning framework that accurately differentiates malignant from benign effusions and predicts broad tumor origin categories from pleural and peritoneal cytology, generalizing across multiple centers and datasets. It outperforms or complements human experts, elevates junior pathologist performance, and its origin predictions align with improved survival when guiding treatment in CUP. These findings support TORCH as a practical adjunct for clinical decision-making in effusion cytology. Future work should include prospective randomized trials to confirm clinical benefit, expansion to rarer origins (e.g., mesothelium, urinary, nervous, bone/soft tissue), incorporation of richer clinical and multi-omics data, architectural enhancements leveraging spatial relationships among patches, and evaluation across diverse international populations to ensure generalizability.
Limitations
- Granularity: The model predicts origin at the organ-system level rather than specific primary sites; finer localization requires additional data (e.g., whole-slide histology or molecular profiling).
- Coverage of tumor types: Current model does not encompass several malignant entities with low effusion prevalence (e.g., mesothelioma, urinary, nervous, bone/soft tissue, melanoma, thymoma); limited data constrained inclusion.
- Population and setting: Data are from four Chinese centers; performance may vary in other regions, ethnicities, and practice patterns. Visual field selection and staining differences could affect generalizability.
- Input scope: Only age, sex, and specimen site were integrated with cytology images; other relevant clinical, radiologic, biomarker, and genomic variables were not modeled and could improve performance.
- Data scale and architecture: Despite strong results, training data volume remains modest compared with large-scale natural image tasks; model architecture may be further optimized, including modeling spatial relationships among patches and improved multimodal fusion.
- Cytology-specific constraints: Cytology provides less structural information than whole-slide tissue histology; sampling inadequacy, cell degeneration, and artifacts can degrade performance.
Related Publications
Explore these studies to deepen your understanding of the subject.