
Transportation
Calibrated confidence learning for large-scale real-time crash and severity prediction
M. R. Islam, D. Wang, et al.
Explore groundbreaking research by Md Rakibul Islam, Dongdong Wang, and Mohamed Abdel-Aty, introducing an innovative framework for real-time crash and severity prediction. This study tackles significant challenges, achieving high sensitivity and low false alarm rates using advanced spatial ensemble modeling techniques. Join us in the journey to enhance road safety!
~3 min • Beginner • English
Introduction
The study addresses the challenge of predicting both crash likelihood and crash severity in real time within a single framework. Prior work primarily optimized accuracy for crash likelihood without equal emphasis on computational efficiency, model size, or false alarm rate (FAR), and rarely modeled severity jointly with crash prediction. Real-world traffic data are non-IID across space due to heterogeneous segment characteristics, and severity distributions further exacerbate non-IID challenges. The authors aim to develop a scalable, energy-efficient, and deployable system that predicts crash occurrence and, conditionally, the severity level (KABCO) using only readily available real-time traffic and weather data. The research questions include: how to handle spatial heterogeneity, reduce FAR and computational costs, improve model generalization with limited local data, and calibrate output confidence to map crash likelihood to severity. The work builds on spatial ensemble learning with local model regularization and post-calibration to deliver reliable, real-time crash and four-level severity predictions, and discusses deployment strategies and sustainability impacts.
Literature Review
A comprehensive review across IEEE Xplore, Accident Analysis & Prevention, TR Part C/B, ASCE Library, Scientific Reports, TRR, Web of Science, Scopus, and ACM showed most studies focus on real-time crash likelihood rather than integrated crash-and-severity prediction. Typical data sources include crash reports, loop detectors/MVDS, AVI, RITIS, ATSPM, cameras, and weather stations; features are often aggregated over 5–10 min prior to crash time. Severity is commonly modeled in two or three levels by combining KABCO categories (e.g., KA/BC/O or KAB/C/O), with relatively fewer studies using four levels. Sampling strategies address imbalance (random undersampling, SMOTE, case-control). Many studies used variables not deployable in real time (e.g., driver/vehicle attributes). Methods include ordinal/sequential logistic/probit, mixed/random parameter models, Bayesian models with spatial priors, and machine learning (SVM, RF, XGBoost, ANN, CNN, KNN, trees, gradient boosting), with performance assessed by DIC, WAIC, AUC, accuracy, sensitivity, specificity, and FAR. Few works applied ensemble learning for crash likelihood; none leveraged ensemble methods to manage spatial heterogeneity for severity prediction or jointly predicted crashes and four-level severity in a single pipeline. Identified gaps: lack of integrated crash-then-severity real-time frameworks, limited attention to reducing FAR and computational burden, and inadequate treatment of non-IID spatial heterogeneity for severity.
Methodology
Study area: I-75 southbound mainline (District 5, Florida, USA), 67.6 miles with 97 segments.
Data sources and features: Crash data from Signal Four Analytics (S4A) and State Safety Office GIS (SSOGIS) for 2019-01-01 to 2022-06-30; MVDS traffic data (speed, volume, occupancy) every 30 s; weather from Visual Crossing (precipitation, visibility, cloud cover). Traffic metrics aggregated to 5-min windows. Features: 30 total (27 traffic = per segment: mean, SD, CoV of speed, volume, occupancy for target, immediate upstream, immediate downstream segments; 3 weather variables). Post-crash turbulence excluded by removing 1 h after crashes. Outlier removal rules applied; rows with missing training features dropped; at evaluation, missing features linearly interpolated.
Targets and labeling: Crash prediction index (CPI): binary crash indicator (1 for crash, 0 otherwise). For a crash at 3:30 pm, six 5-min aggregated windows from 10–15 to 5–10 min prior labeled as crash (CPI=1); all others non-crash. Crash severity index (CSI): four levels (O=0, BC=1, A=2, K=3) for severity of predicted crashes.
Train-test split: Time-series split by period. Observation (training): 2019-01-01 to 2021-06-30 with 2,080 crash events (CPI=1) and 66,997,275 non-crash events; Forecasting (test): 2021-07-01 to 2022-06-30 with 1,035 crash events and 27,099,425 non-crash events. Same training data used for all-crash, rear-end, and sideswipe/angle models; testing restricted to corresponding subsets when evaluating crash-type-specific models.
Modeling framework: Three-layered approach.
1) Spatial ensemble learning to address non-IID heterogeneity:
- Per-segment zonal expert models trained on near-IID local data (crash and non-crash clustered per segment), implemented as lightweight multilayer perceptrons (MLPs).
- Ensemble aggregation via importance-weighted sum of expert outputs; weights derived from a 2D Gaussian kernel over spatial distance between prediction datapoint and model zone (higher weight for closer segments; uniform 1/N if no prior expertise info).
- Teacher network: 10-layer fully connected MLP with tanh activations; trained with mean squared error loss to output probability P(y|x,D) for CPI and CSI.
2) Confidence calibration through local regularization during training:
- To reduce overfitting and over-confidence in local models trained on limited data, three regularization strategies were examined: weight decay (L2 with λ≈1e−3), label smoothing (α sampled U[0.1, 0.2]), and knowledge distillation (KD).
- KD: Distill from the ensemble teacher to a reduced student MLP (3 layers) using combined distillation and MSE losses, balancing with hyperparameter α. Benefits: model compression, improved generalization, reduced FAR.
3) Global severity post-calibration:
- Leverage calibrated crash likelihood outputs to infer severity levels via temperature scaling of output confidence, mapping higher predicted crash probabilities to higher severities (K, A, BC, O) with temperatures selected on a validation set (e.g., T≈1.3, 1.1, 1.0, 0.7). This post-calibration utilizes the empirical observation that severe crashes exhibit more deviant traffic features (e.g., high speed/variance), thus higher crash likelihood.
Benchmarks for comparison: Binary classifiers (per severity level vs non-crash), SMOTE-balanced classifiers, random undersampling, and an ensemble of 10 binary classifiers. All benchmark methods also integrated the same post-calibration step for fair comparison.
Metrics: Accuracy, Sensitivity (TPR), and False Alarm Rate (FAR), reported for each severity level and by crash type (rear-end, sideswipe/angle). Computational efficiency assessed by CPU training time and parallelization scalability.
Key Findings
Performance of calibrated confidence learning (CCL) with knowledge distillation (KD) substantially exceeded benchmarks across four severity levels (K, A, BC, O), achieving high sensitivity with low FAR:
- Overall CCL(KD): Accuracy per level (K/A/BC/O): 0.826/0.781/0.738/0.713; Sensitivity: 0.917/0.833/0.856/0.877; FAR: 0.174/0.219/0.263/0.287 (Table 2). These results surpass binary classifiers, SMOTE, undersampling, and an ensemble of 10 binary classifiers.
- Regularization analysis (Table 3): Knowledge distillation yielded the best performance versus spatial ensemble alone, weight decay, and label smoothing. KD improved K-level sensitivity by ~10% and reduced FAR by ~25% relative to non-KD variants.
- Crash-type results (Table 4): For rear-end crashes, CCL achieved strong performance (Accuracy K/A/BC/O: 0.799/0.789/0.744/0.723; Sensitivity: 0.667/0.812/0.865/0.889; FAR: 0.202/0.211/0.256/0.277). For sideswipe/angle (K not available), sensitivity was 0.723/0.818/0.846 for A/BC/O, with higher FAR than rear-end, reflecting feature similarity between non-crash and lower severity states.
- Spatial patterns: Urban segments generally showed higher sensitivity and lower FAR than rural segments, likely due to denser detector spacing and more reliable traffic measurements. Segments near curves, ramps, merges/diverges showed poorer performance, suggesting benefits from denser sensing and improved sensor accuracy.
- Feature ablation (Table 5): Removing volume features caused the largest degradation in accuracy and FAR across severity levels, indicating traffic volume is a key predictor; removing speed and occupancy also degraded performance but less severely. FAR varied more than sensitivity/accuracy under feature ablations.
- Computational efficiency: The proposed approach trained the ensemble teacher (10-layer MLP) in ~3 CPU hours plus ~1.5 hours for student distillation sequentially; with 100-core parallelization, total training+distillation reduced to ~3 minutes. Direct monolithic modeling on all data took >24 CPU hours. Local training converged in <30 epochs vs >100 epochs for global models, enabling major reductions in training time and energy use. The distilled student (3-layer MLP) provided faster inference, with potential greater gains at larger scale.
- Headline sensitivities across all crash types: fatal (K) 91.7%, severe (A) 83.3%, minor (BC) 85.6%, PDO (O) 87.7% with low FAR (K 17.4%, A 21.9%, BC 26.3%, O 28.7%).
Discussion
The integrated framework addresses spatial heterogeneity, overfitting in local models, and the mapping of crash likelihood to severity. Spatial ensemble learning captures non-IID segment-level differences while maintaining small, efficient local models. Knowledge distillation regularizes local experts and compresses models, enhancing generalization and reducing FAR. Post-calibration exploits the calibrated confidence from crash likelihood to infer four-level severity without training fine-grained severity classifiers end-to-end, improving efficiency and performance. The findings inform traffic safety management, supporting real-time interventions such as variable speed limits, queue warnings, ramp metering, dynamic merge control, and part-time shoulder use, targeted by predicted severity and location. Spatial analysis suggests prioritizing denser and higher-quality sensing in segments with high traffic fluctuations (e.g., ramps/curves) and in rural areas to reduce FAR and improve sensitivity. Feature analysis highlights traffic volume as an important early warning indicator, consistent with literature (low volume/high speed associated with higher severity). The framework’s substantial training-time and energy savings support sustainable deployment. Overall, the approach provides a reliable, deployable system for predicting crash occurrence and severity with low false alarms, enabling proactive, severity-aware traffic management.
Conclusion
This study presents the first unified, deployable framework that predicts real-time crash likelihood followed by four-level crash severity (KABCO) using only readily available traffic and weather data. The approach combines spatial ensemble learning to manage non-IID heterogeneity, local model regularization via knowledge distillation to improve generalization and reduce FAR, and global post-calibration (temperature scaling) to map crash likelihood confidence to severity levels. On a large-scale I-75 dataset (67.6 miles, 97 segments; 67 million+ training non-crash samples), the method achieved high sensitivities with low FAR across all severity levels and outperformed strong baselines. It demonstrated robustness across crash types and provided actionable insights into spatial performance patterns and critical features (especially volume). The methodology greatly reduced training time and computational cost, supporting energy-efficient, sustainable modeling and real-time deployment by traffic management centers to trigger targeted ATM strategies. Future research should optimize model updating and retraining as new data/features become available, improve robustness under sensor noise or failures, address communication latency in real-time operations, and explore larger-scale deployments to further realize inference speed gains from knowledge distillation.
Limitations
Identified challenges include: optimizing retraining/update processes to incorporate new data and features efficiently; ensuring robustness to noisy or missing sensor data (e.g., loop detector failures) to prevent severe performance degradation; addressing communication latency and reliability in real-time deployments; and dealing with limited data for rare classes (e.g., fatal crashes), which can constrain sensitivity despite improvements. Performance degrades in segments with high traffic fluctuations (curves, ramps, merges/diverges) and in rural areas with sparser sensing, suggesting dependency on sensor density and quality.
Related Publications
Explore these studies to deepen your understanding of the subject.