
Medicine and Health
Artificial intelligence unravels interpretable malignancy grades of prostate cancer on histology images
O. Eminaga, F. Saad, et al.
This groundbreaking research introduces an AI-driven grading system for prostate cancer, surpassing traditional methods in predicting patient outcomes. Conducted by a team of esteemed authors, the study demonstrates significant advancements in patient risk stratification, ensuring a brighter future for PCa patients.
~3 min • Beginner • English
Introduction
The study addresses the limitations of the current Gleason grading system, particularly interobserver variability that affects reproducibility and clinical decision-making. The authors propose replacing reliance on microscopic pattern-based grading with an AI-derived, outcome-calibrated grading system learned from long-term prognostic endpoints (biochemical recurrence and cancer-specific death) using histology images. The objective is to develop and validate a new, interpretable, calibrated risk-based grading system for prostate cancer that is independent of Gleason grade groups and demonstrates superior prognostic performance across multiple external cohorts.
Literature Review
The paper situates its contribution within prior AI efforts that aimed to replicate or assist Gleason grading using supervised learning. Previous studies (e.g., Bulten et al., Ström et al., Nagpal et al.) achieved high agreement with expert pathologists (e.g., Cohen’s quadratic kappa up to ~0.918; linear-weighted kappa ~0.833), and assisted methods improved pathologist concordance. However, these approaches inherit the Gleason system’s reader dependency and potential social/cognitive biases because ground truths stem from limited expert groups. The authors argue for an interpretable, calibrated model optimized for prognostic outcomes rather than reproducing Gleason patterns, thereby seeking to improve generalizability and clinical utility.
Methodology
Data: Multi-institutional datasets totaling 2647 radical prostatectomy cases with long-term follow-up (≥10 years) were used. Development cohort: 600 PCa cases from two institutions; external validation cohorts included CPCBN (approximately 890 cases; three institutions), PROCURE (287 patients; 16 digital TMA slots), and PLCO (861 patients; 1502 H&E-stained whole-slide images). Only representative slides/cores from the RP index lesion were used. Ethical approvals, informed consent, and data access protocols were observed per each cohort’s governance. PC regions were manually demarcated on whole-slide images following senior pathologist instruction.
Image processing and labeling: TMA cores/WSIs were tiled into patches. Patch labels were derived from patient-level biochemical recurrence (BCR) status. Case-level BCR scores were obtained by averaging core/slide predictions.
Model development: A novel convolutional architecture was identified via neural architecture search (PlexusNet) and grid search. Training used Adam optimization and cross-entropy loss, with early stopping and regularization; 3-fold cross-validation guided model selection. Comparative models included ResNet variants, VGG-16, and EfficientNet trained on the same development set. The final model emphasized reduced capacity (substantially fewer parameters and feature maps than comparators) and used ×10 objective magnification patches; higher magnifications, attention aggregation, and Cox deep convolutional concepts did not yield performance gains.
Risk stratification: CHAID analysis on BCR scores defined four risk groups: low (≤5%), low-intermediate (6–42%), high-intermediate (43–74%), and high (≥75%).
Evaluation: Performance metrics included AUROC, Heagerty’s c-index, generalized concordance probability, calibration (Harrell’s resampling model calibration; comparison to KM estimates within 10 years), and model fit (AIC/BIC). Survival analyses comprised univariate/multivariate weighted Cox regression (accounting for non-proportional hazards), Fine–Gray competing risk regression for cancer-specific mortality, and Kaplan–Meier analyses. Nested partial likelihood ratio tests compared models with risk groups and/or Gleason grade groups (GG). Multicollinearity was assessed via VIF. Interpretability analyses included SHAP/LIME and feature distribution inspection; pathologists independently sorted image groups corresponding to risk strata to assess human interpretability and concordance with AI.
Key Findings
- External validation (CPCBN): c-index 0.682 ± 0.018; generalized concordance probability 0.927 (95% CI: 0.891–0.952); AUROC 0.714 (95% CI: 0.673–0.752). Using a 0.5 BCR score threshold, sensitivity 50.0%, specificity 83.2%, precision 56.3%, recall 50.0%. Calibration was good for predicting 5- and 10-year BCR.
- Model comparison: The novel model achieved better effect sizes and higher generalized concordance probabilities for BCR prognosis than ResNet, VGG-16, and EfficientNet, while using 8–32× fewer feature maps in the last convolutional layer and 24–125× fewer parameters. EfficientNet did not outperform the novel model (non-nested partial likelihood ratio test). Higher magnification patches, attention aggregation, and Cox deep convolutional modeling did not improve performance.
- Risk groups: CHAID-derived thresholds produced four distinct risk categories with clear survival separation in Kaplan–Meier analyses across external cohorts.
- BCR-free survival: In CPCBN and PROCURE cohorts, the BCR score was an independent prognostic factor alongside clinical variables (e.g., PSA, tumor stage, age, margin status). Models including the novel risk groups showed better fit (lower AIC/BIC) than models with GG.
- Cancer-specific survival (CSS): Across CPCBN, PROCURE, and PLCO cohorts, the novel risk score was an independent prognostic factor for cancer-specific mortality. In CPCBN, tumor stage and the novel risk score were significant, whereas GG was not. Fine–Gray competing risk analyses confirmed the independent prognostic value of the novel risk groups. The low-risk group had no PCa-related deaths across the three cohorts, while GG-based stratification included patients who died in two of three cohorts.
- CRPC: Among men with BCR (PROCURE cohort), CRPC frequency increased with higher risk groups; Kendall’s tau = 0.22, z = 4.227, p < 0.0001. The low-risk group had no CRPC cases. In multivariate Cox analysis, the novel risk score was an independent prognosticator for CRPC, whereas tumor stage, nodal stage, and surgical margin status were not.
- Interpretability: Five experienced genitourinary pathologists, blinded to risk groups and clinical data, showed strong concordance with the AI-defined image groupings/risk strata. Low-risk images were predominantly Gleason pattern 3; high-risk images predominantly patterns 4/5; intermediate groups showed mixed patterns 3/4. Feature analyses (e.g., the 23rd representative feature; Levene test P < 0.0001) revealed distinct distributions across risk groups and a histopathologic gradient (loss of organized glandular architecture) aligning with risk.
Discussion
The findings demonstrate that an AI-derived, calibrated, and interpretable grading system based on prognostic outcomes can outperform the traditional Gleason grade groups in predicting biochemical recurrence and cancer-specific mortality. By decoupling risk stratification from expert-dependent pattern proportions and instead anchoring it to long-term outcomes, the proposed system reduces the impact of interobserver variability and associated biases. The robust external validation across three independent cohorts (CPCBN, PROCURE, PLCO), superior model fit (AIC/BIC) compared with GG, and consistent performance in BCR-free survival and CSS analyses underscore clinical relevance. The clear separation of risk groups, their association with CRPC development after BCR, and concordance with pathologists’ independent assessments suggest that the model’s risk groups are human-interpretable and align with recognizable histopathologic features. Good calibration supports the use of model scores as risk estimates and facilitates integration into clinical decision tools and nomograms. Overall, the approach suggests a path toward standardized, reproducible malignancy grading that complements or supersedes Gleason-based systems for prognostication.
Conclusion
The study introduces and externally validates a novel, AI-driven malignancy grading system for prostate cancer that yields four interpretable risk groups with superior prognostic performance to Gleason grade groups. The model is efficient, well-calibrated, and generalizes across multiple cohorts, independently predicting BCR, cancer-specific mortality, and CRPC risk following BCR. The approach fosters synergy with pathologists by providing interpretable image groupings aligned with familiar histopathologic patterns while mitigating interobserver variability. Future work should focus on broader clinical integration (e.g., incorporation into prognostic nomograms), prospective validation, assessment across diverse clinical settings and specimen types, and further investigation of robustness to sampling variability, tissue degradation, and other pre-analytic factors.
Limitations
- Uncertainty remains regarding the model’s ability to overcome sampling errors, tissue fragmentation, and degradation of histologic material, as noted by the authors.
- Data sharing is restricted due to transfer agreements and privacy concerns, limiting open external replication.
- Training data selection excluded certain Gleason pattern combinations (e.g., pattern 5 cores and some others) to encourage learning alternative features, which may affect generalizability to underrepresented patterns.
- The development relied on TMA cores and selected WSI regions; performance in other specimen types or broader clinical workflows requires further prospective validation.
Related Publications
Explore these studies to deepen your understanding of the subject.