Medicine and Health

Deep learning detects premalignant lesions in the Fallopian tube

J. M. A. Bogaerts, J. Bokhorst, et al.

Discover groundbreaking advancements in the detection of tubo-ovarian high-grade serous carcinoma through an innovative deep-learning algorithm developed by Joep M. A. Bogaerts and colleagues. This powerful model achieves an impressive AUROC of 0.98, significantly enhancing the diagnostic process for pathologists. Join us in exploring how this research promises to improve cancer screening and diagnosis.

00:00

~3 min • Beginner • English

Index

Introduction

Tubo-ovarian high-grade serous carcinoma (HGSC), the most common epithelial ovarian cancer, is thought to originate in the fallopian tube where precursor lesions harboring TP53 mutations have been identified. The most clinically relevant precursor is serous tubal intraepithelial carcinoma (STIC), which exhibits cyto-morphological alterations comparable to HGSC and a characteristic immunohistochemical profile (aberrant p53 expression and increased Ki-67). Lesions resembling STIC but not meeting all diagnostic criteria are termed serous tubal intraepithelial lesions (STIL). Accurate identification of these precursors is crucial given their implications for prevention and management strategies, including opportunistic salpingectomy and risk-reducing salpingo-oophorectomy (RRSO) in high-risk individuals (e.g., BRCA1/2 carriers). However, diagnosis is challenging and subject to interobserver variability. This study aims to develop and evaluate a deep learning model to assist pathologists by detecting regions of aberrant epithelium consistent with STIC/STIL on H&E-stained whole slide images of fallopian tubes.

Literature Review

Methodology

Study design and data: A retrospective multi-institutional dataset was assembled comprising STIC/STIL cases (n=323) and benign controls (n=359). The data were split into training (n=169; 118 STIC/STIL, 51 controls), internal test (n=327; 131 STIC/STIL, 196 controls), and external test sets (n=186; 74 cases, 112 controls). The training and internal test sets were sourced from Radboudumc and PALGA; the external test set from Johns Hopkins University and Eurofins PAMM. One H&E-stained slide per specimen was used; IHC was available for subsets. Reference standard: For training and internal testing, a reference standard was established by expert review. Regions of interest (n=571 still images at 20×, 0.5 µm/pixel) from potential STIC/STIL cases (n=249) and randomly sampled controls (n=247) were reviewed using an online platform by a panel of experienced gynecologic pathologists (15 in total, grouped 2–3 per image). Interobserver agreement (Cohen’s kappa) was computed; agreement was high for aberrant vs. non-aberrant epithelium and lower for detailed categorical diagnoses (normal, STIC, suspicious for STIC, STIL, HGSC, other). Annotations: Digital whole slide images were annotated using ASAP. STIC and STIL regions were exhaustively annotated per reference labels. Other tissues (invasive carcinoma, normal epithelium, cystic epithelium, non-epithelial tissue) were annotated sparsely. No hand-drawn annotations were made in control slides. Model development: A two-phase pipeline based on U-Net was implemented. - Phase 1: Epithelium segmentation using U-Net with a MobileNet-v2 backbone to separate epithelium from non-epithelial tissue. - Phase 2: Detection/segmentation of aberrant epithelium (STIC/STIL) vs. normal epithelium, HGSC, and non-epithelial regions using a U-Net with a ResNet-50 backbone. Phase 1 outputs were used to refine coarse aberrant annotations and provide clear demarcations for Phase 2 training. Manual hard-negative mining was applied in both phases. Training parameters: Input patches were 512×512 px at 1.0 µm/pixel, sampled to balance aberrant and healthy tissue within mini-batches (batch size 10). Data augmentation included random flips, rotations, elastic deformations, blurring, gamma/brightness, color, and contrast changes. Networks were initialized with ImageNet-pretrained weights. Learning rate began at 1e-4 and was halved after 25 epochs without validation improvement. Training ran up to 150 epochs (500 iterations/epoch; 5000 patches/epoch) with early stopping after 50 epochs without validation loss improvement. Outputs were likelihood maps. Development used Python 3.8. Evaluation: Phase 1 outputs were visually inspected. For Phase 2, slide-level predictions were computed by identifying connected components of segmented regions with STIC/STIL probabilities, averaging probabilities per object, and taking the highest object probability as the slide-level score. ROC curves and AUROC with 95% CIs were computed using scikit-learn with bootstrapping (1000 iterations; NumPy 1.21; SciPy 1.7.1). Visual checks compared detected regions with morphology and IHC. Kappa statistics for the reference standard were calculated in IBM SPSS v27.

Key Findings

- Performance (ROC/AUROC): - Internal test set AUROC: 0.98 (95% CI: 0.96–0.99). - External test set AUROC: 0.95 (95% CI: 0.90–0.99). - Operating threshold analysis: - At F1-optimal threshold 0.35 (internal test set): sensitivity 0.98, specificity 0.93; slide-level errors: 3 false negatives and 13 false positives out of 327 slides. - External test set at threshold 0.35: sensitivity 0.98, specificity 0.67; errors: 2 false negatives and 37 false positives out of 186 slides. - Sensitivity-favoring threshold 0.30 (internal test set): sensitivity 1.00, specificity 0.82; errors: 0 false negatives, 36 false positives (i.e., +23 FPs vs. threshold 0.35). This would detect the two previously missed STIL cases. - Error analysis and visualization: - Visual overlays showed good alignment between predicted aberrant regions, morphology, and IHC (p53, Ki-67). - Internal false negatives at threshold 0.35 included two STIL and one STIC with concomitant HGSC (HGSC component detected). External false negatives (n=2) included one case debated as reactive epithelium vs STIC; HGSC component detected. - Common false positive scenarios: increased lymphocytes/inflammation, surface epithelial detachment, artifacts (e.g., mechanical damage, partial destaining). - Interobserver agreement (reference standard): - Aberrant vs non-aberrant epithelium: median κ = 0.86 (range approx. 0.81–0.97 across readers). - Full diagnostic categories: median κ = 0.53 (range approx. 0.35–0.64), highlighting diagnostic variability among experts.

Discussion

This work presents, to the authors' knowledge, the first deep learning model aimed at detecting STIC/STIL in H&E-stained whole slide images of fallopian tubes. Leveraging a two-phase U-Net pipeline and expert-derived reference standards across multiple international centers, the model achieved high discriminative performance on independent internal and external test sets. The algorithm’s heatmap-like outputs enable pathologists to visually verify detected regions alongside morphology and IHC, potentially improving screening efficiency and diagnostic confidence. Given the rarity and diagnostic difficulty of STIC, prioritizing sensitivity is reasonable in a screening context, even at the expense of specificity. Threshold tuning demonstrated that sensitivity can be maximized with manageable increases in false positives, which can be triaged by pathologists. The model complements, rather than replaces, pathologists by flagging regions of interest that warrant closer review. The multi-institutional nature of the dataset, with varying scanners and staining protocols, underscores the model’s robustness. Still, broader validation on additional datasets and protocols is essential. As understanding of precursor lesions and their relationship to early HGSC evolves, DL tools like this could aid clinical decision-making and standardize detection across institutions.

Conclusion

An innovative deep learning model was developed to automatically detect regions of aberrant epithelium consistent with STIC/STIL in fallopian tube whole slide images. The model demonstrated high AUROC on both internal and external test sets, and its interpretable outputs can assist pathologists in reliable detection and prioritization of suspicious regions. Future work should include larger-scale multicenter validations, expansion of training data (especially isolated STIL), technical refinements, and exploration of more granular classification across precursor lesion subtypes to support clinical decision-making.

Limitations

- Dataset size, while large for a rare lesion, remains modest for DL; isolated STIL cases are relatively few. - Data heterogeneity (multiple institutions, scanners, and staining protocols) though beneficial for robustness, necessitates further validation across additional centers and protocols to confirm generalizability. - Interobserver variability in diagnosing STIC/STIL is moderate; the model targets detection rather than fine-grained classification, reflecting the ambiguity even among experts. - Model performance depends on the operating threshold, creating a sensitivity-specificity trade-off; higher sensitivity increases false positives requiring manual review. - False positives often arise from inflammation, epithelial detachment, and artifacts; mitigating these may require additional training data or post-processing strategies.

Related Publications

Explore these studies to deepen your understanding of the subject.

Education

DeepLMS: a deep learning predictive model for supporting online learning in the Covid-19 era

S. B. Dias, S. J. Hadjileontiadou, et al.

Earth Sciences

A deep-learning estimate of the decadal trends in the Southern Ocean carbon storage

V. E. Zemskova, T. He, et al.

Medicine and Health

Recent Advancements and Perspectives in the Diagnosis of Skin Diseases Using Machine Learning and Deep Learning: A Review

J. Zhang, F. Zhong, et al.

$Unraveling the effect of collagen damage on bone fracture using in situ synchrotron microtomography with deep learning$

Engineering and Technology

Unraveling the effect of collagen damage on bone fracture using in situ synchrotron microtomography with deep learning

M. Sieverts, Y. Obata, et al.

Listen, Learn & Level Up

Over 10,000 hours of research content in 25+ fields, available in 12+ languages.

No more digging through PDFs, just hit play and absorb the world's latest research in your language, on your time.

listen to research audio papers with researchbunny