
Medicine and Health
Predicting biochemical recurrence of prostate cancer with artificial intelligence
H. Pinckaers, J. V. Ipenburg, et al.
This groundbreaking study explores deep learning's ability to accurately predict biochemical recurrence of prostate cancer post-surgery. With promising results from 685 patients, these innovative findings by Hans Pinckaers and colleagues suggest that machine learning can uncover tissue patterns that might surpass traditional grading systems.
~3 min • Beginner • English
Introduction
Prostate cancer is a common malignancy among men, for whom radical prostatectomy is a frequent curative treatment. Postoperative surveillance uses serum PSA, where a rise after initial undetectable levels indicates biochemical recurrence, a prognostic indicator for metastasis and prostate cancer mortality. Current risk stratification incorporates ISUP grade, diagnostic PSA, and TNM stage, but the ISUP grade—based on Gleason growth patterns—has limitations, including interobserver variability and coarse morphological categorization that may miss finer prognostic features. The authors hypothesize that deep learning can identify additional, within-grade morphological patterns predictive of recurrence from H&E-stained tissue and thereby augment prognostication beyond ISUP grade. The study aims to develop and validate a deep learning system (DLS) that predicts time to biochemical recurrence directly from tissue microarray (TMA) cores and to interpret learned patterns using concept-based explainability.
Literature Review
Deep learning has achieved state-of-the-art performance in image classification and has demonstrated the ability to infer genetic and prognostic information from histomorphology, including mutation prediction in lymphoma and lung cancer, and survival prediction in colorectal cancer and intrahepatic cholangiocarcinoma. Prior prostate cancer studies using biochemical recurrence endpoints either relied on manual feature selection limiting discovery (Leo et al.) or used learned slide encodings coupled with classical models without end-to-end training on outcomes (Yamamoto et al.), constraining feature discovery. The black-box nature of deep learning motivates explainability; Automatic Concept Explanations (ACE) can surface semantically coherent tissue patterns learned by networks, enabling interpretation of prognostic concepts.
Methodology
Study design and cohorts: The DLS biomarker was developed using a nested case-control study from Johns Hopkins Hospital (JHH) via the Prostate Cancer Biorepository Network. From 4860 men who underwent radical retropubic prostatectomy (1993–2001), 524 matched case-control pairs (724 unique patients) were assembled; matching used incidence density sampling on age at surgery, race, pathologic stage, and Gleason sum. Four TMA cores (0.6 mm) per patient from the highest-grade tumor nodule were extracted; after excluding cores with <25% tissue or lacking clear cancer cells, 2343 TMA spots from 685 unique patients remained (492 recurrence cases). Patients were split into a development set (268 matched pairs; 503 unique patients) and a held-out matched test set (91 pairs; 182 unique patients). Follow-up was to 2005 (median 4.0 years). Outcome definition: recurrence defined as PSA > 0.2 ng/mL on ≥2 occasions after undetectable post-prostatectomy nadir, or earlier events indicating recurrence (local recurrence, systemic metastases, or prostate cancer death).
External validation cohort: 204 patients from NYU Langone Medical Center (2001–2003), followed to 2019 (median 5 years). Recurrence defined as a single PSA ≥ 0.4 ng/mL, or PSA ≥ 0.2 ng/mL followed by increasing values. Four TMA subsamples per case were taken from the largest or any higher-grade focus; 620 TMA spots included.
Image acquisition and preprocessing: JHH slides scanned with Hamamatsu NanoZoomer-XR (0.23 µm/px); NYU with Leica Aperio AT2 (0.25 µm/px). TMA spots extracted using QuPath. TMAs resized to 1.0 µm/px and cropped to 768×768 pixels. Extensive augmentations applied (flips, rotations, warping, random crops, HSV color jitter, JPEG compression, elastic transforms, Gaussian blur, contrast/gamma/brightness changes, embossing, sharpening, Gaussian noise, cutout) using albumentations and fast.ai.
Labels: For cases with events, TMA spots labeled 0–4 by years to first event (0: within 1 year; 4: ≥4 years). Non-event cases labeled 4.
Model development: Convolutional neural networks (ResNet50-D pretrained on ImageNet) implemented in PyTorch. Optimization used Lookahead with RAdam (learning rate 2e-4), minibatch size 16, weight decay 7e-3, dropout p=0.15 before final FC, and EfficientNet-style stochastic depth (residual drop p=0.3). Hyperparameters tuned via Bayesian Optimization (details in Supplementary Notes 1). Model selection via concordance index on validation folds with moving average of weights over 5 epochs. Cross-validation: development set split into three folds using IterativeStratification (scikit-multilearn), respecting matched variables. Final patient-level prediction used the TMA spot with highest score. The final DLS is an ensemble of 15 CNNs per fold; the five best performers per fold formed the ensemble.
Statistical analysis: For JHH test set (nested case-control), conditional logistic regression estimated odds ratios (OR) per unit increase in the continuous DLS output (DLS multiplied by -1 so higher values indicate higher recurrence risk). Primary model included only the DLS; secondary multivariable model adjusted for non-matched variables: preoperative PSA, positive surgical margins, binned year of surgery, and a continuous ISUP grade term to correct residual differences (matching was on Gleason sum). Hazard ratios (HR) using Cox proportional hazards regression were computed for the NYU cohort (univariable and multivariable including DLS, ISUP grade, pathological stage, surgical margin status, and preop PSA). Proportional hazards assumption tested via correlation of residuals with follow-up rank (all p>0.01). Kaplan–Meier curves plotted for NYU (not applicable to JHH due to design). Missing PSA (1 JHH, 2 NYU) imputed by median.
Explainability (ACE): From the best-performing single CNN, TMA images were tiled into 256×256 patches (excluding >50% whitespace), padded to 768×768, and intermediate features from layer 42/50 were extracted. Dimensionality reduced by PCA (50 components), then k-means clustering (k=15) produced concept clusters. Concepts were ranked by the DLS score of originating cores to identify patterns associated with recurrence timing. For each concept, 25 examples were reviewed by a uropathology-specialized pathologist blinded to case characteristics and model predictions.
Key Findings
- Development and data: 2343 TMA spots from 685 unique JHH patients (492 recurrence cases); held-out matched test set: 91 pairs (182 patients). External validation: 204 NYU patients (620 TMA spots), 38 recurrences (19%).
- JHH test set (primary matched analysis): DLS marker strongly associated with recurrence, OR 3.28 (95% CI 1.73–6.23; p<0.005) per unit increase in risk score (range approximately 0–3, with two values slightly below 0).
- JHH test set (multivariable): Adjusting for preop PSA, positive surgical margins, year of surgery, and residual ISUP differences, OR 3.32 (95% CI 1.63–6.77; p=0.001) per unit increase.
- NYU external validation: Univariable HR 5.78 (95% CI 2.44–13.72; p<0.005) per unit increase; multivariable HR 3.02 (95% CI 1.10–8.29; p=0.03) adjusting for ISUP grade, pathological stage, surgical margins, and preop PSA. Kaplan–Meier curves using median and quartile-based thresholds showed clear separation between low- and high-risk groups.
- Explainability: ACE revealed semantically meaningful concepts correlating with recurrence timing. Adverse concepts predominantly displayed Gleason pattern 4 and some 5, including cribriform-like architecture; favorable concepts showed predominantly Gleason pattern 3 with well-formed glands. These align with expected histopathologic prognostic patterns, supporting model validity.
- Robustness: Strong and similar associations across two institutions with different scanners and staining protocols, indicating robustness to preanalytical variation.
Discussion
The study demonstrates that a deep learning-derived morphological biomarker from H&E-stained prostatectomy TMA cores captures prognostic information for biochemical recurrence beyond conventional ISUP grading. The DLS, trained end-to-end on time-to-recurrence labels, was strongly associated with recurrence risk in both an internal matched test set and an external cohort, addressing the hypothesis that within-grade morphological features contribute prognostically. Kaplan–Meier stratification suggests the biomarker is particularly effective at distinguishing men with relatively rapid recurrence within approximately five years post-surgery. Concept-based explanations identified patterns consistent with recognized prognostic morphologies—e.g., cribriform and disorganized pattern 4/5 associated with early recurrence, well-formed glands (pattern 3) with later recurrence—enhancing interpretability and clinical plausibility. The consistency across institutions supports generalizability. Collectively, these findings underscore the potential of AI-driven histomorphology to augment existing risk stratification and inform adjuvant treatment considerations.
Conclusion
A deep learning-based visual biomarker derived from TMA hotspots of prostatectomy specimens predicts the speed of biochemical recurrence and adds prognostic value beyond ISUP grade. In a matched JHH test set, the DLS achieved an OR of 3.32 (95% CI 1.63–6.77; p=0.001) per unit increase; in an external NYU cohort, a multivariable HR of 3.02 (95% CI 1.10–8.29; p=0.03). Explainability analyses revealed clinically interpretable concepts aligned with Gleason patterns. Future work should validate the biomarker on whole-slide prostatectomy sections across multifocal tumors, investigate endpoints closer to clinical utility such as time-to-metastasis and prostate cancer-specific mortality, and leverage more granular follow-up to employ survival-based loss functions for model training.
Limitations
- Study design: The nested case-control design in the JHH cohort precluded Kaplan–Meier visualization and Cox modeling in that set; matching was on Gleason sum with residual ISUP differences adjusted via a continuous term.
- Follow-up granularity and duration: JHH median follow-up was 4 years, and labels were binned into coarse 0–4 year categories, potentially limiting long-term prognostic discrimination and the use of survival-based loss functions. Few patients were at risk at longer times, reducing reliability of late separation in survival curves.
- Sampling constraints: Use of TMA cores from the highest-grade focus captures limited regions of often multifocal tumors, potentially missing more aggressive patterns elsewhere, including extraprostatic extension or perineural invasion. Generalization to whole-slide sections requires further validation.
- Morphologic covariates: Detailed data on cribriform growth and intraductal carcinoma were not consistently available for multivariable analyses; interobserver variability in these features is known.
- Endpoint choice: Biochemical recurrence, while common and earlier, is a surrogate; clinical endpoints such as metastasis and disease-specific mortality may be more directly useful but are harder to link to primary tumor morphology over long intervals.
- Preanalytical variability: Although robustness across two institutions was shown, broader validation across additional sites, staining protocols, and scanners is needed.
Related Publications
Explore these studies to deepen your understanding of the subject.