Medicine and Health

UroPredict: Machine learning model on real-world data for prediction of kidney cancer recurrence (UroCCR-120)

G. Margue, L. Ferrer, et al.

Discover how the innovative machine learning model, UroPredict, developed by a team of experts, including Gaëlle Margue and Loïc Ferrer, is transforming the landscape of kidney cancer treatment by accurately predicting postoperative disease-free survival for patients. This groundbreaking research promises to enhance personalized follow-up care and therapy choices.

00:00

~3 min • Beginner • English

Index

Introduction

Kidney cancer incidence is rising globally, with RCC frequently detected at localized stages where surgery is standard. Despite surgery, recurrence risk at 5 years ranges from 20% to 50% depending on stage. Current surveillance recommendations rely on prognostic scores with only moderate predictive performance, leading to uncertainty in tailoring follow-up imaging and selecting candidates for adjuvant therapy. In the context of personalized medicine, there is a need for accurate, individualized prediction of recurrence risk after nephrectomy to optimize surveillance intensity and guide adjuvant therapy decisions. This study aimed to develop and validate a real-world, machine learning-based model to predict individual postoperative disease-free survival (DFS) in localized or locally advanced RCC and to compare its performance against conventional prognostic scores.

Literature Review

Multiple established prognostic systems are used for RCC but often show only moderate performance, particularly on external validation. The UISS (Fuhrman grade, ECOG, pT) classifies patients into three risk groups with c-indices around 0.56–0.72 across external validations. The SSIGN system (stage, size, grade, necrosis) initially reported high c-index (~0.84 for CSS in ccRCC) but external validations show reduced accuracy (c-index ~0.63–0.78). Leibovich models tailored to histologic subtypes (clear cell, papillary, chromophobe) achieved c-indices up to 0.83–0.86 in development cohorts, but 0.73–0.81 on external validation. The GRANT score (grade, age, nodes, tumor) classifies into two risk groups and demonstrated an external concordance of ~0.59. Prior ML efforts (e.g., DeepSurv, Naive Bayes) in smaller cohorts suggested improved accuracy but often lacked detailed methodology and individual-level interpretability. Known prognostic factors such as TNM stage, Fuhrman grade, histological subtype (ccRCC worse than papillary/chromophobe), ECOG performance status, and microvascular invasion are consistently associated with outcomes. Inflammatory biomarkers, notably neutrophil-to-lymphocyte ratio (NLR), are also linked with poorer prognosis in RCC. These data support the inclusion of clinical, pathological, and inflammatory variables in advanced predictive models.

Methodology

Study design and population: Data were drawn from the French UroCCR prospective database (NCT 03293563), labeled by the French NCI and HAS. Included were patients undergoing partial or radical nephrectomy (open or laparoscopic) between May 2000 and January 2020 for localized or locally advanced RCC (pT any, N0). Exclusions: hereditary RCC, non-primary renal tumors, benign lesions, concomitant malignancies, metastases at diagnosis, no post-surgery follow-up information, insufficient data, and conditions likely to alter blood tests (hematopathies, chronic inflammatory diseases). Data were collected prospectively with informed consent and regulatory approvals. Train-test split and validation: Participating sites were randomly assigned to training or testing cohorts (~2:1 ratio) with similar DFS distributions. Model development and risk group thresholds were optimized on the training set and evaluated on the test set for external validation. Preprocessing: Categorical variables with unbalanced modalities were recoded and one-hot encoded; numerical variables were normalized. Missing data were handled via multiple imputation (3 imputations) using the MICE algorithm with gradient-boosted decision trees. Non-informative features (null variance, redundancy, imbalance) were removed. Models and tuning: Time-to-event models trained included Cox proportional hazards models with LASSO regularization, random survival forests, and gradient-boosted survival trees. Hyperparameters were tuned via repeated cross-validation (3 × 10-fold) with Bayesian optimization maximizing integrated AUC (iAUC) over 6–60 months post-surgery. Discrimination was assessed using iAUC and time-dependent AUC; calibration with time-dependent Brier score. Censoring was addressed using Kaplan–Meier-based IPCW. Confidence intervals used Nadeau–Bengio correction in cross-validation and permutation bootstrapping for external validation. Final model and features: The best-performing approach combined multiple imputation with a Cox PH model, yielding a final signature of 24 clinical, pathological, and biological variables. Permutation-based feature importance identified tumor size, histological subtype, and age at surgery as top contributors. Interpretability and clinical utility: Individual predictions were explained with SHAP values to show feature-level risk contributions. Decision curve analysis assessed clinical utility for predicting 5-year recurrence risk across threshold probabilities (10–50%). Risk stratification: Using training data, thresholds were selected to define four risk groups (very low, low, medium, high) targeting a group with very low actual relapse at 5 years and a group with high relapse. Thresholds were informed by false omission rate and positive predictive value analyses at 5 years, then applied to the test cohort. Group DFS curves were estimated via Kaplan–Meier and compared by log-rank tests. Comparative evaluation: The ML model was compared against UISS, SSIGN, GRANT, and Leibovich scores. Due to missingness, conventional scores were computed only in subsets. Pairwise comparisons used one-sided p-values with bootstrapping on the applicable subsets.

Key Findings

- Cohort: 3372 surgically treated RCC patients; median follow-up 30 months. Train n=2241; test n=1131. Baseline characteristics were broadly similar between cohorts; DFS curves did not differ significantly (p=0.67). - Final ML model: Cox PH with 24 variables after multiple imputation and feature filtering. Most important features: tumor size, histological subtype, age at surgery. - Discrimination and calibration (test set): Integrated AUC 0.81 (95% CI 0.77–0.85); integrated Brier score 0.11 (0.10–0.13). Time-specific AUCs: 0.86 at 0.5 years (0.80–0.91), 0.86 at 1 year (0.80–0.90), 0.81 at 2 years (0.76–0.85), 0.71 at 5 years (0.61–0.80). Time-specific Brier scores: 0.04 (0.03–0.05) at 0.5 yr; 0.06 (0.05–0.07) at 1 yr; 0.10 (0.08–0.12) at 2 yr; 0.18 (0.14–0.22) at 5 yr. Performance decreased over time, likely due to fewer patients at risk and fewer observed events. - Decision curve analysis: Higher net benefit than alternative strategies for 10–50% risk thresholds. At a 30% threshold, net benefit was 0.10, equating to 10 additional true recurrences detected per 100 patients without increasing false positives. - Individual predictions: SHAP explanations provided per-patient risk attribution; example patient predicted 63% 5-year recurrence risk (vs. 20% average), driven by clear cell subtype, Fuhrman grade 4, necrosis, and large tumor size; younger age was protective. - Risk stratification (test cohort): Four groups with good separation (AUC 0.78, 95% CI 0.74–0.83). Very low risk: 19% of patients with actual 5-year recurrence <2%; Low risk: 43% with actual 5-year DFS 83%; Medium risk: 22% with DFS 54%; High risk: 17% with DFS 49%. - Comparison to conventional scores: ML model outperformed GRANT (p<0.001), SSIGN (p=0.01), and UISS (p<0.001) on applicable subsets. Conventional scores were calculable only for subsets due to missingness (e.g., UISS n=882, SSIGN n=946, GRANT n=1008, Leibovich 2018 n=578).

Discussion

The study addresses the clinical need for accurate postoperative recurrence risk prediction in localized/locally advanced RCC, where existing prognostic scores have moderate performance and limited applicability due to missing data requirements. By leveraging a large, prospectively collected, real-world multicenter dataset and modern ML methods for survival analysis, the authors developed a model with strong discrimination and reasonable calibration that generalizes to an external test cohort. Key predictors (tumor size, histology, age, Fuhrman grade, ECOG PS, vascular invasion, and inflammatory markers like NLR) align with established prognostic factors, supporting the model’s face validity. Interpretability via SHAP enables patient-specific explanations, which can enhance clinician trust and shared decision-making. Decision curve analysis demonstrates potential clinical benefit in guiding surveillance intensity and adjuvant therapy selection across plausible risk thresholds. Stratification into four risk categories identifies a sizable very-low-risk group that may warrant de-escalated surveillance and a high-risk group potentially suitable for adjuvant therapy discussions. Performance declines over longer horizons, likely reflecting increased censoring and fewer events, highlighting the importance of periodic model recalibration or horizon-specific modeling. Overall, the ML approach improves upon conventional risk scores in both accuracy and applicability by handling incomplete data and providing individualized risk estimates.

Conclusion

Applying machine learning to a large, prospective, real-world RCC cohort enabled accurate individualized prediction of disease-free survival after nephrectomy, outperforming established prognostic scores and supporting clinically meaningful risk stratification. The model’s interpretability and decision curve–demonstrated utility suggest it can inform tailored surveillance and adjuvant therapy decisions. Future work should focus on external validations across diverse healthcare systems, prospective impact studies, integration into clinical workflows, periodic recalibration for longer-term horizons, and potential augmentation with imaging, molecular, or genomics data to further enhance performance.

Limitations

- Time-dependent performance decline: Both discrimination and calibration decreased at later time points (e.g., 5-year AUC 0.71), likely due to fewer patients at risk and fewer observed events. - Comparative analyses limited by missing data: Conventional scores (UISS, SSIGN, GRANT, Leibovich) could only be computed in subsets of the test cohort due to incomplete observations, potentially affecting comparative power and generalizability of those comparisons. - Outcome/scope limitations: The model is developed for localized/locally advanced RCC post-surgery and may not generalize to metastatic disease or nonsurgical populations without further validation. - Some calibration metrics for comparators (e.g., SSIGN for DFS) were not available due to score definitions (CSS), limiting full head-to-head calibration comparison.

Related Publications

Explore these studies to deepen your understanding of the subject.

Computer Science

Reliability of Supervised Machine Learning Using Synthetic Data in Health Care: Model to Preserve Privacy for Data Sharing

D. Rankin, M. Black, et al.

Earth Sciences

A machine learning estimator trained on synthetic data for real-time earthquake ground-shaking predictions in Southern California

M. Monterrubio-velasco, S. Callaghan, et al.

Chemistry

Developing a machine learning model for accurate nucleoside hydrogels prediction based on descriptors

W. Li, Y. Wen, et al.

Medicine and Health

Predictive model of castration resistance in advanced prostate cancer by machine learning using genetic and clinical data: KYUCOG-1401-A study

M. Shiota, S. Nemoto, et al.

Listen, Learn & Level Up

Over 10,000 hours of research content in 25+ fields, available in 12+ languages.

No more digging through PDFs, just hit play and absorb the world's latest research in your language, on your time.

listen to research audio papers with researchbunny