Medicine and Health
Dynamic prediction of mortality after traumatic brain injury using a machine learning algorithm
R. Raj, J. M. Wennervirta, et al.
This study presents a retrained and validated machine learning model that predicts mortality risk in traumatic brain injury patients, showing promising results as a clinical decision support tool. The research was conducted by Rahul Raj, Jenni M. Wennervirta, and their colleagues.
~3 min • Beginner • English
Introduction
Traumatic brain injury (TBI) is a leading cause of death and disability worldwide with a global incidence estimated at 369–790 per 100,000 and an increasing trend. Approximately 10% of TBI patients require ICU admission, with hospital and six-month mortality of about 15% and 20%, respectively. Increased intracranial pressure (ICP) is the main cause of TBI-related death, and cerebral perfusion pressure (CPP = MAP − ICP) reflects the degree of brain injury. Optimizing ICP and CPP is central to severe TBI management. Fixed thresholds are commonly recommended (ICP <20–22 mmHg; CPP 60–70 mmHg), though these may vary with cerebrovascular reactivity, age, and sex. Despite the prognostic value of ICP and CPP for mortality, their dynamic prognostic information beyond static thresholds has not been translated into individualized, clinically useful tools. Time spent above ICP thresholds correlates with mortality, suggesting that quantifying the dynamic effects of ICP and CPP could support standardized, data-driven decisions and reduce biases, especially during prolonged ICU stays. Current TBI prediction models are usually static, based on admission parameters, and explain about 35% of variance in severe TBI. Given the dynamic nature of TBI and lesion progression, longitudinal monitoring is required to optimize treatment. A key translational challenge in machine learning is the lack of external validation. This study evaluates the external validity of a previously developed ICP–MAP–CPP algorithm in two international cohorts after retraining with additional data. Performance was assessed via AUC, AUPRC, accuracy, and rates of false positives (predicting death when the patient survives) and false negatives (predicting survival when the patient dies). The algorithm was calibrated to minimize false positives to avoid the clinical risk of inappropriate withdrawal of life-sustaining treatment.
Literature Review
Methodology
Design: Multicenter observational retrospective study including adult (≥16 years) ICU patients with TBI and at least 24 h of ICP data. Training and internal validation were conducted using Finnish ICU cohorts; external validation was performed in Swedish (Stockholm) and U.S. (eICU) cohorts.
Cohorts:
- Training: 686 patients from four Finnish university hospital ICUs (Helsinki 2010–2019; Kuopio 2004–2013; Turku 2003–2013; Tampere 2007–2017), covering ~85% of Finland’s population and following Brain Trauma Foundation guidelines.
- External validation 1 (Stockholm cohort): 464 patients admitted to the neurosurgical ICU at Karolinska University Hospital (2006–2019) with electronically stored ICU data (1–2 min medians).
- External validation 2 (eICU cohort): 174 patients from the Philips eICU Collaborative Research Database (2014–2015) identified by APACHE IV TBI diagnoses and ≥24 h ICP monitoring; data stored as 5-min medians.
Data and preprocessing: ICP, MAP, and CPP median values at 1–5 min intervals were collected for the first 120 h after ICU admission. Extreme values were excluded (ICP >100 or <0 mmHg; MAP >150 or <20 mmHg). Missing values were not imputed; if an entire time window lacked data, that patient was excluded from that window’s estimate.
Outcome: Primary outcome was 30-day all-cause mortality (training and Stockholm validation). In eICU, in-hospital mortality was used due to lack of 30-day data. For Stockholm false positives, 1-year functional outcome (GOS 1–3 unfavorable) was assessed.
Algorithm and features: A dynamic logistic regression model using rolling time windows provided mortality predictions every 8 h after the first 24 h. Feature engineering included:
- From first 24 h window (begin): means.
- From last 8 h window (end): means, minima, maxima, means of differences (diff), variances (var), linear trend coefficients (coef), percentile-based trends (q90, q10).
- ICP-specific: % time >20 mmHg (ht20), % time <10 mmHg (lt10).
- MAP-specific: % time >120 mmHg (ht120).
- Age in deciles (agec).
In total, 54 engineered features plus age were considered. Features were computed in 4 h rolling windows; predictions were output every 8 h after 24 h.
Retraining and selection: The algorithm was retrained on the extended Finnish cohort (n=686). Stratified five-fold cross-validation was used. Recursive feature elimination selected 14 final features. The most important features included: cpp_diff_coef, map_diff_coef, map_diff_begin, icp_diff_coef, cpp_diff_begin, icp_q90_coef, icp_end, agec, cpp_end, cpp_var_coef, icp_diff_end, icp_coef, map_var_coef, icp_var_coef.
Evaluation metrics: Time-dependent AUC (discrimination), AUPRC (precision-recall), accuracy, and false positive (fp) and false negative (fn) rates at each prediction time point. Default classification threshold was 50% predicted risk; sensitivity analyses used 25% and 75% thresholds. The model was calibrated to minimize fp to reduce risk of inappropriate treatment withdrawal.
Statistical analysis: Conducted in Stata 15 and Python 3.7.12 (Google Colab). Libraries: scikit-learn 1.0.1, pandas 1.1.5, numpy 1.19.5, matplotlib 3.2.2, seaborn 0.11.2, joblib 1.1.0, bayesian-optimization 1.2.0. Standard tests applied for variable distributions; chi-square for categorical comparisons and Wilcoxon rank-sum for non-parametric continuous variables.
Implementation properties: The model uses low-frequency (1–5 min) medians to mitigate artifacts and interruptions common during ICU care and is fully automated without manual input. Predictions are intended particularly for prolonged ICU courses (≥3 days) when clinical decisions become more challenging.
Code and data access: Original and retraining/testing code available at the provided repositories. Finnish data via FINDATA; eICU via PhysioNet; Swedish data on request per regulations.
Key Findings
- Data scope: Retraining on 686 Finnish patients (~62,000 h of data); external validation on 638 patients across Stockholm (n=464) and eICU (n=174) cohorts (~60,000 h total).
- Internal (training cohort): AUC increased from 0.67 (24 h) to 0.79 (120 h); AUPRC from 0.33 (24 h) to 0.55 (120 h). False positive rate decreased from 11.7% to 2.2%; false negative rate remained ~11.7–13.1%. Accuracy at 120 h: 85.4%.
- Stockholm cohort (external validation): Mean ICP monitoring 97.3 h. AUC increased from 0.66 (24 h) to 0.79 (120 h). AUPRC increased from 0.31 to 0.57. False positive rate decreased from 20.3% (24 h) to 2.4% (120 h); false negative rate ~7% at 120 h. Accuracy at 120 h: 90.3%. At 50% threshold, 55% of false positives had unfavorable 1-year functional outcome (GOS 1–3).
- eICU cohort (external validation): Mean ICP monitoring 87.7 h. AUC increased from 0.67 (24 h) to 0.73 (120 h). AUPRC increased from 0.48 to 0.64. False positive rate decreased from 10.3% to 1.1%; false negative rate ~23% at 120 h. Accuracy at 120 h: 75.9%.
- Threshold sensitivity: Lowering threshold to 25% increased fp and decreased fn; raising to 75% decreased fp but increased fn in both validation cohorts.
- Feature importance: The top predictors were dynamic difference-based slope features for CPP and MAP (cpp_diff_coef, map_diff_coef), the relation of early mean MAP to subsequent differences (map_diff_begin), and dynamic ICP differences (icp_diff_coef). Mean ICP was higher in non-survivors than survivors across cohorts during the first 120 h.
Discussion
The ICP–MAP–CPP dynamic algorithm demonstrated improving discrimination and precision over time in two diverse external cohorts, achieving AUCs up to 0.79 (Stockholm) and 0.73 (eICU) and maintaining very low false positive rates (<2.5%) by 120 h. This calibration prioritizes avoiding false positives to reduce the risk of inappropriate withdrawal of life-sustaining treatment. Notably, over half of the Stockholm false positives had unfavorable 1-year outcomes, suggesting limited clinical harm from conservative management in such cases. The algorithm’s dynamic risk outputs can alert clinicians to prognosis changes during prolonged ICU care, serving as an objective decision-support tool for treatment continuation or de-escalation and aiding communication with families.
Performance differences between cohorts likely reflect case-mix and healthcare system differences. The eICU cohort, despite being younger, had higher mortality (30% vs. 13% in Stockholm) and lower admission GCS scores, indicating more severe injuries and distinct etiologies (e.g., differing rates of high-energy trauma, assaults, and gunshot injuries in the USA). Socioeconomic, ethnic, and insurance-related factors prevalent in the U.S. may also influence outcomes. Despite these differences, the algorithm performed robustly in both settings, supporting its potential generalizability. The use of low-frequency ICU data and a fully automated pipeline enhances feasibility and resilience to data interruptions common in ICU environments. Prospective studies are needed to define clinically actionable changes in risk over time and to evaluate whether algorithm-guided care improves outcomes or cost-effectiveness.
Conclusion
A retrained, dynamic logistic regression model based on ICP, MAP, CPP, and age provided time-improving predictions of mortality after TBI, externally validated in Swedish and U.S. cohorts with low false positive rates and acceptable discrimination. The tool shows promise as an automated, bedside decision-support system to standardize and inform treatment during prolonged ICU care. Future research should prospectively evaluate clinical integration, define thresholds for actionable changes in predicted risk, assess impact on long-term functional outcomes, and explore extensions to predict granular functional outcomes beyond 30-day mortality.
Limitations
- Outcome definition in eICU: 30-day mortality was unavailable; in-hospital mortality was used. Although median hospital stay differed markedly between survivors (21 days) and non-survivors (5 days), residual misclassification due to post-discharge deaths before 30 days is possible.
- Cohort heterogeneity: Differences in injury severity, mechanisms, demographics, and healthcare systems (e.g., socioeconomic and insurance factors in the U.S.) may affect performance and generalizability.
- Missing data handling: No imputation was performed; patients with completely missing values in a window were excluded for that time window, which may introduce bias if missingness is informative.
- Dynamic feature selection: Included features may vary depending on cross-validation fold randomization, potentially affecting model consistency.
- External validation scope: Validation was limited to one Swedish center and a multi-center U.S. database; performance in other regions and healthcare settings remains to be established.
Related Publications
Explore these studies to deepen your understanding of the subject.

