Medicine and Health

Cross-site transportability of an explainable artificial intelligence model for acute kidney injury prediction

X. Song, A. S. L. Yu, et al.

This groundbreaking research by Xing Song and colleagues leverages artificial intelligence to predict acute kidney injury (AKI), revealing challenges in clinical adoption due to varying risk factors across different health systems. The findings not only highlight performance issues but also propose a novel method to enhance AI model transportability and adaptation in hospitals.... show more

Introduction

Acute kidney injury (AKI) affects 10–15% of all inpatients and over 50% of critically ill patients, and is associated with markedly increased mortality and long-term adverse outcomes. Despite expanding EHR data and advances in AI, implementing prediction models in practice requires interpretability and transportability across settings. Prior AKI models show variable performance, and deep learning approaches have lacked independent validation on diverse populations. Transportability is challenged by patient heterogeneity, clinical process variability, differences in EHR configuration, and non-interoperable data. This study asks whether an explainable AKI prediction model developed within a common data model (PCORnet CDM) is transportable across health systems, identifies sources of performance heterogeneity, and proposes a method to predict transportability without full data disclosure from target hospitals.

Literature Review

Previous machine-learning AKI models reported AUROCs of ~0.71–0.80 in derivation, 0.66–0.80 internal validation, and 0.65–0.71 external validation. A deep learning model (Google DeepMind) predicted a substantial fraction of AKI but was trained on a predominantly male, Caucasian VA population without broad external validation. Gradient boosting models (e.g., Koyner et al.) achieved AUROC >0.85 for AKI stage 2 within 48 h in selected settings. Literature emphasizes the need for external validation and attention to calibration drift and generalizability. Common data models (e.g., PCORnet CDM, OHDSI) enhance data harmonization but vocabulary granularity and local coding differences persist, affecting cross-site model portability.

Methodology

Data source: Multi-center EHR data from the Greater Plains Collaborative (GPC), a PCORnet Clinical Data Research Network of 12 health systems across 9 US states. Six sites participated: University of Kansas Medical Center (source) and five external sites (University of Texas Southwestern, University of Nebraska Medical Center, University of Missouri, Medical College of Wisconsin, Marshfield Clinic Research Institute). De-identified CDM-conformant data were used. Cohort: Adults 18–90 years, hospitalized ≥2 days, with ≥2 serum creatinine (SCr) measurements, 2010–2018. Exclusions: eGFR <15 mL/min/1.73 m² before admission, prior renal replacement therapy (RRT) or transplant, RRT within 48 h of first SCr, burn patients. Feature extraction and processing: All variables present in PCORnet CDM v4 were used, including demographics, diagnoses (ICD-9/10), procedures (ICD/CPT), labs (LOINC), medications (RxNorm/NDC), and selected vital signs (e.g., BP, height, weight, BMI). Data were time-stamped and aggregated to daily observation vectors for discrete-time prediction windows. Automated curation included outlier removal (1st/99th percentiles), one-hot encoding for categorical codes, medication cumulative exposure-days, last observation carried forward (sample-and-hold), using most recent measurement within interval, and engineered features (e.g., SCr change, BP trends). The source dataset had 153,821 eligible encounters (1,064,619 encounter-days), 38,920 unique variables (142,167,783 observations), with 1,933 variables shared across all six sites. AKI outcomes: KDIGO SCr-based definitions: Any AKI (increase by ≥0.3 mg/dL within 48 h or ≥1.5× baseline within 7 days); Stage 2 (2.0–2.9× baseline); Stage 3 (≥3× baseline, or ≥4 mg/dL after a 0.3 mg/dL acute rise, or RRT initiation). Urine output not used. Baseline SCr defined as most recent prior measurement or admission SCr if none available. Outcomes computed at SCr measurement times and carried forward; right-censoring at 7 days LOS. Modeling: Gradient Boosting Trees in a discrete-time survival framework (DS-GBT). Prediction tasks: 24 h and 48 h risk for any AKI, AKI ≥2, and AKI 3. Hyperparameters (tree depth 2–10, learning rate 0.01–0.1, min child weight 1–10, early stopping) tuned via CV. Missing handling: categorical missing as 0; numerical missing handled by tree-level learned splits including missingness; carry-forward used when prior values exist. Splits: Source site encounters post-2017 used as temporal validation; 2010–2016 encounters split into derivation (70%), calibration (15%), internal validation (15%). External validation applied the source-trained model to each target site (transported model). Models were also refit on each target site’s local data (refitted model). Calibration assessed via Hosmer–Lemeshow (HL) with F-tests for differences. Explainability: Feature importance by gain; SHAP values (bootstrapped) for marginal effects; interactive dashboards created. Transportability metric: Adjusted Maximum Mean Discrepancy (adjMMD) computed over joint feature distributions between source and target, with adjustments to penalize missingness and enable computation when variables are absent (extreme-value imputation strategy; kernel density approaches). Procedure identifies a minimal feature set (top-k by source importance) that maximizes correlation of adjMMD with AUROC drop (AAUC). Linear regression and mixed-effects models relate AAUC to adjMMD. Evaluation metrics: AUROC and AUPRC overall and in subgroups (age, baseline SCr, days since admission); calibration via HL; external validation across five sites; comparisons with LASSO and limited-feature GBT.

Key Findings

Source site performance (48 h lead time):

Temporal hold-out AUROC: Any AKI 0.76 (95% CI 0.75–0.78); AKI ≥2 0.81 (0.76–0.86); AKI 3 0.87 (0.78–0.93).
AUPRC: AKI ≥2 0.14 (0.08–0.23); AKI 3 0.23 (0.08–0.42).
Excluding SCr and BUN: AUROC Any AKI 0.75 (0.74–0.76); AKI ≥2 0.82 (0.78–0.86); AKI 3 0.85 (0.78–0.89).
Performance improved with more hospital days of data; calibration improved after simple recalibration (HL from 42.8, p=0.01, to <30, p>0.02 for Any AKI).
24 h prediction performed better: AUROC Any AKI 0.84 (0.82–0.84); AKI ≥2 0.90 (0.88–0.93); AKI 3 0.93 (0.89–0.96).
DS-GBT outperformed LASSO (48 h AKI ≥2: AUROC 0.78 [0.73–0.84], AUPRC 0.06 [0.04–0.11] for LASSO). Explainability (top predictors and effects):
Top features included SCr level and change, vancomycin exposure, BP minima and slopes (systolic/diastolic), age, BMI, height, chest X-ray. Without SCr/BUN, piperacillin–tazobactam, bilirubin, and anion gap were among top features.
Notable marginal effects (approximate log-odds changes): SCr >0.5 mg/dL increased odds for AKI ≥2 by exp(4)≈54-fold; vancomycin exposure ~1.8-fold; BMI >40 ~1.3-fold; anion gap >13 mEq/L ~1.3-fold. U-shaped associations for BP minima and BP slopes; chest X-ray associated with ~1.3-fold increased odds; piperacillin–tazobactam ~1.6-fold. External validation (48 h):
Transported model AUROC for AKI ≥2 ranged 0.68 (0.66–0.71) to 0.80 (0.77–0.82); AUPRC 0.09 (0.07–0.12) to 0.15 (0.12–0.19).
Refitting on local data improved performance. Example (Any AKI, Site5): Refitted AUROC 0.83 (0.81–0.85), AUPRC 0.15 (0.13–0.17) vs transported 0.71 (0.70–0.71), AUPRC 0.08 (0.08–0.09).
Calibration differences varied by site and task; some sites showed minimal calibration change for severe AKI and for models excluding SCr/BUN.
24 h external validation showed higher AUROCs: transported 0.66–0.81; refitted 0.83–0.89 for Any AKI. Feature heterogeneity across sites:
Many predictive features were site-specific (e.g., vancomycin and piperacillin–tazobactam representations; various glucose labs). Common predictors across all sites included SCr, height, BMI, age, INR, hemoglobin; second-tier common features included blood pressure summaries and several serum chemistries (bilirubin, chloride, potassium, phosphate).
Marginal effects of common features varied in strength and sometimes direction (e.g., hemoglobin association differed between sites), indicating population and documentation differences. Transportability metric (adjMMD):
adjMMD increased when important variables were missing and decreased as distributions aligned; sensitive to representation gaps (e.g., vancomycin RxNorm differences).
Using only the top-13 source-important variables yielded Pearson correlation 0.95 between adjMMD and AUROC drop (AAUC) for 48 h AKI ≥2. Linear relation: AAUC = −0.018 + 0.344 × adjMMD; each 0.1 increase in adjMMD implies ~0.0344 AUROC decrease at target (e.g., expected AUROC ≈ 0.81 − 0.0344).
Without SCr/BUN, a minimal set of 33 features achieved robust prediction: AAUC = −0.028 + 0.367 × adjMMD.
Robustness across derivation sites and models: In 16/18 scenarios, Pearson r >0.9; mixed-effects model across sites/models: AAUC = −0.014 + 0.257 × adjMMD (slope 95% CI [0.08, 0.44]); sample- and model-agnostic validations showed no significant RSS differences (p=0.66 and p=0.57).

Discussion

The study demonstrates that an explainable DS-GBT model built within the PCORnet CDM can predict AKI with good accuracy and interpretability and that transportability varies markedly across sites. Refitting with local data typically improves performance, highlighting the influence of population case-mix, clinical practice, and data representation differences (e.g., coding granularity in RxNorm/LOINC). SHAP-based analysis reveals both shared and site-specific risk factors and nonlinear, sometimes non-monotonic, associations (e.g., BP and age effects), providing clinical insight and aiding model auditing. Importantly, the proposed adjMMD provides a practical, data-light way to anticipate performance drops when transporting models, using a small set of highly informative features from the source model. This enables target institutions to estimate whether a transported model will be acceptable as-is or requires recalibration/refitting, potentially accelerating external validation and adoption. The CDM infrastructure facilitates multi-site evaluation but does not fully resolve vocabulary granularity and documentation inconsistencies, which remain significant contributors to heterogeneity. Overall, the findings address the core research questions on interpretability and transportability, and present a generalizable framework to quantify and manage performance changes when deploying AI models across health systems.

Conclusion

This work develops an explainable, discrete-time gradient boosting model for 24–48 h AKI prediction using PCORnet CDM data and validates it across six health systems. The model achieves strong internal and temporal performance, with meaningful clinical explanations via SHAP. External validation reveals substantial cross-site variability driven by differences in populations and data representations. The introduced adjMMD metric, computed on a minimal set of top features, accurately predicts AUROC deterioration when transporting models, offering a practical tool to guide deployment decisions (e.g., accept, recalibrate, refit). Future work should focus on improved data harmonization and ontology mappings, statistical model updating and transfer learning strategies to enhance transportability, inclusion of additional real-time variables (e.g., heart rate, oxygen saturation), and prospective impact studies coupling prediction with early intervention bundles.

Limitations

AKI definition relied on SCr-based KDIGO criteria; baseline SCr often unavailable (2/3 community-acquired), leading to use of admission SCr and potential underestimation of AKI incidence.
Follow-up right-censored at 7 days; performance not assessed for longer hospital stays.
Procedures captured via CPT billing codes due to CDM constraints; procedure orders available in EMR were not used.
PCORnet CDM lacked several important variables (e.g., heart rate, oxygen saturation, Braden scale), potentially limiting predictive performance.
Data heterogeneity persists despite CDM (vocabulary granularity, local coding practices), contributing to transportability issues.
Generalizability limited by participating sites and CDM-populated elements; privacy constraints precluded public data sharing.

Related Publications

Explore these studies to deepen your understanding of the subject.

Medicine and Health

Novel Biomarkers for Early Detection of Acute Kidney Injury and Prediction of Long-Term Kidney Function Decline after Partial Nephrectomy

M. Allinovi, F. Sessa, et al.

Business

Understanding continuance intention of artificial intelligence (AI)-enabled mobile banking applications: an extension of AI characteristics to an expectation confirmation model

J. Lee, Y. Tang, et al.

Chemistry

A generative artificial intelligence framework based on a molecular diffusion model for the design of metal-organic frameworks for carbon capture

H. Park, X. Yan, et al.

Medicine and Health

Development and evaluation of an artificial intelligence system for COVID-19 diagnosis

C. Jin, W. Chen, et al.

Listen, Learn & Level Up

Over 10,000 hours of research content in 25+ fields, available in 12+ languages.

No more digging through PDFs, just hit play and absorb the world's latest research in your language, on your time.

listen to research audio papers with researchbunny