logo
ResearchBunny Logo
Deep learning corrosion detection with confidence

Engineering and Technology

Deep learning corrosion detection with confidence

W. Nash, L. Zheng, et al.

Discover how researchers Will Nash, Liang Zheng, and Nick Birbilis have developed a groundbreaking deep learning model for pixel-level corrosion segmentation, enhancing economic safety with confidence estimates. Their innovative approach outperforms existing solutions and brings new insights into decision-making processes!... show more
Introduction

The study addresses the challenge of reliable, automated corrosion detection, a problem of significant economic and safety importance given corrosion’s annual cost of 3–4% of GDP. Prior deep learning approaches for corrosion segmentation suffer from limited, private datasets and lack uncertainty estimation, leading to false positives/negatives, especially on out-of-distribution inputs. The authors propose a pixel-level deep learning corrosion detector augmented with Bayesian techniques to quantify prediction uncertainty at each pixel, aiming to improve decision-making in industrial inspection contexts by indicating confidence and guiding attention to uncertain regions.

Literature Review

Recent progress in computer vision with architectures such as LeNet, AlexNet, VGG, DenseNet, FCN and U-Net has improved accuracy in many tasks, but these deterministic models lack calibrated uncertainty and can fail on out-of-distribution data. Prior corrosion detection efforts include FCN, U-Net, and Mask R-CNN trained on private datasets, with the best reported F1-Score ~0.71 after edge-refinement. The authors previously achieved an F1-Score of 0.55 with an FCN trained 50,000 epochs and observed false positives on faces and foliage. Bayesian neural networks (BNNs) extend deep nets by placing distributions over weights to produce probabilistic outputs. Three scalable Bayesian approaches are widely used: variational inference (Gaussian weight distributions), Monte Carlo dropout (Bernoulli dropout at train and test), and deep ensembles (multiple independently initialized models approximating posterior samples). Uncertainty is often categorized as epistemic (model) and aleatoric (data) uncertainty, though definitions vary; epistemic should decrease with more diverse data, while aleatoric reflects inherent input noise (e.g., lighting, resolution) and cannot be reduced post-capture.

Methodology

Base architecture: HRNetV2 for semantic segmentation, initialized with weights pre-trained on MS COCO Stuff. Model modifications to obtain Bayesian variants: (1) Variational inference: insert variational convolution layers at the end of each branch with separate mu and sigma parameters to sample kernel weights from a normal distribution per forward pass. (2) Monte Carlo dropout: apply in-line dropout at the end of each branch during both training and inference (Bernoulli distribution). (3) Ensemble: train HRNetV2 multiple times from different random initializations; at inference, run the input through each model and aggregate mean and standard deviation. Aleatoric output and loss: Each variant adds an aleatoric uncertainty head outputting log-variance (treated as s = log sigma^2). The loss adapts binary cross-entropy to a Bayesian binary cross-entropy that jointly trains for segmentation logits and aleatoric uncertainty with an added 0.5*s term. Dataset: 225 JPEG images of corrosion captured at an industrial site using a consumer DSLR; resolutions range from ~50,496 to 36,329,272 pixels. Expert pixel-level labels for corrosion vs background. Ten-fold cross-validation due to small dataset size. Data distribution characteristics (dimensions and RGB histograms) provided in supplementary figures. Training protocol: Transfer learning from HRNetV2 pre-trained on COCO Stuff. Newly added layers (aleatoric head, variational parameters) initialized from a normal distribution. Optimizer: RMSProp; learning rate 0.0001. Schedule per fold: 80 epochs with standard binary cross-entropy, then switch to Bayesian binary cross-entropy for a further 40 epochs (Monte Carlo dropout and ensemble) or 70 epochs (variational). Validation every 10 epochs with checkpointing. Implemented in PyTorch; code and trained models available at https://github.com/StuvX/SpotRust. Inference and uncertainty estimation: Perform N stochastic forward passes; stack outputs to [N, C, H, W]. Prediction is the mean of outputs; threshold 0.75 to label pixels as corrosion. Aleatoric uncertainty is the mean of the log-variance maps; epistemic uncertainty is the dispersion (e.g., standard deviation) across stochastic outputs. To assess uncertainty’s utility, compute F1-Score vs threshold for raw outputs and for uncertainty-adjusted outputs. Adjustment uses f(x)_adj = exp(y + s). Sparsity analyses remove pixels in order of decreasing uncertainty and compare normalized MSE against an oracle curve (binary cross-entropy loss) to assess how well uncertainty ranks error-prone pixels. Evaluation metrics: Mean Intersection over Union and F1-Score computed from TP, TN, FP, FN; F1 favored due to class imbalance (corrosion vs background).

Key Findings
  • Overall accuracy: All three Bayesian variants achieved high F1-Scores on the 10-fold cross-validation test sets, surpassing previously reported best-in-class F1 ~0.71 and approaching or exceeding an estimated human benchmark F1 ~0.81 (from MS COCO human labeling analysis).
  • Test set F1-Scores (min, max, average; reported as raw then uncertainty-adjusted): • Variational: min 0.82 (raw), 0.78 (adjusted); max 0.92 (raw), 0.87 (adjusted); avg 0.88 (raw), 0.84 (adjusted). • Monte Carlo dropout: min 0.81 (raw), 0.75 (adjusted); max 0.93 (raw), 0.86 (adjusted); avg 0.88 (raw), 0.80 (adjusted). • Ensemble: min 0.86 (raw), 0.73 (adjusted); max 0.93 (raw), 0.93 (adjusted); avg 0.89 (raw), 0.86 (adjusted). Prior FCN baseline on a subset achieved F1 ~0.55.
  • Example-level OoD performance (Table 2, threshold 0.75) varied widely depending on image quality and corrosion characteristics. For relatively clear, well-lit corrosion (e.g., Fig. 5a–e), F1 often exceeded 0.8–0.9 (e.g., ensemble 0.99 for Fig. 5e). For difficult cases with fine, dispersed corrosion, shadows or overexposure (Fig. 5f–j), F1 could drop substantially (e.g., variational 0.02–0.47; ensemble 0.13–0.43). On a novel bridge column (Fig. 12), F1 ranged 0.67–0.83 across models.
  • Uncertainty behavior: • Epistemic uncertainty was distinctly lower in true positive regions; ensemble yielded the clearest, most detailed epistemic maps; Monte Carlo dropout displayed orthogonal banding; variational tended to concentrate epistemic uncertainty near corrosion edges. Maximum epistemic uncertainty was generally higher on novel (OoD) images than on training-distribution images, suggesting it can act as a pseudo-confidence indicator. • Aleatoric uncertainty was higher in shadows, overexposed areas, dark paint, visually unclear regions, and also elevated in corroded areas (likely due to darker regions and some hedging behavior). Aleatoric maps were broadly consistent across methods aside from contrast differences.
  • Threshold and uncertainty adjustment effects: • Optimal segmentation threshold varies by image; mis-specified thresholds can degrade F1 substantially. On training-distribution images, epistemic adjustment improved maximum F1 for Monte Carlo dropout and ensemble, but reduced it for the variational model. On novel images, epistemic adjustment reduced maximum F1 across all models. Aleatoric adjustment generally increased F1 across wider threshold ranges for ensemble and Monte Carlo dropout, with limited effect for the variational model. • Sparsity curves indicated epistemic uncertainty tracked the oracle more closely (better ranking of error-prone pixels), although its influence on F1 was smaller than aleatoric adjustment in many cases.
  • Error patterns: OoD false positives commonly occurred on foliage, water, and text (signage, timestamps), and under lighting conditions unlike those in training images. False negatives were less frequent than false positives, aligning with a preference to avoid missed corrosion in decision-making.
Discussion

The study demonstrates that integrating Bayesian uncertainty into a strong segmentation backbone (HRNetV2) for corrosion detection yields high pixel-level accuracy while providing actionable uncertainty maps. These uncertainty estimates help address the core challenge: deterministic models can be overconfident and unreliable on unfamiliar inputs. Epistemic maps highlight where the model is less certain (often background or domain-shifted regions), useful for flagging questionable detections and guiding additional inspection or caution. Aleatoric maps reflect input quality issues (shadows, overexposure) and pinpoint areas requiring improved image capture or closer examination. Among approaches, deep ensembles produced both the highest segmentation accuracy and the most interpretable epistemic uncertainty, supporting better operational decision-making. Nonetheless, performance remains sensitive to threshold selection and dataset domain; OoD images still trigger false positives, indicating that broader, more diverse training data and multi-task/contextual cues are needed for robust generalization.

Conclusion

This work introduces a pixel-level deep learning corrosion detector augmented with three Bayesian variants (variational inference, Monte Carlo dropout, and deep ensembles) that output both segmentation and uncertainty estimates. On a newly curated, expertly labeled dataset of 225 images, all variants achieve high F1-Scores, exceeding prior reported benchmarks and approaching estimated human-level performance, while providing epistemic and aleatoric maps that inform confidence and data quality. The ensemble variant offers the best combination of accuracy, stability (deterministic inference), and clarity of uncertainty outputs, making it the most useful for practitioners. Future directions include substantially expanding and diversifying labeled datasets (potentially via expert crowdsourcing) to improve generalization; extending labels to related defects (paint blisters, delamination, peeling); tailoring localized datasets for specific deployment contexts; exploring additional sensing modalities (e.g., infrared) to reduce false positives; and leveraging multi-task learning and heterogeneous pretraining within ensemble frameworks to enhance robustness in unfamiliar settings.

Limitations
  • Small, domain-specific dataset (225 images from a single industrial site) limits generalizability; authors estimate thousands more images (~9,000) would be needed to reliably approach human performance across diverse conditions.
  • No publicly available, standardized corrosion dataset prevents fair cross-study comparisons; the dataset here has controlled access due to legal constraints.
  • Performance is sensitive to threshold selection, which is unknown at deployment and varies across images; uncertainty adjustments can help but are not uniformly beneficial (e.g., epistemic adjustments can reduce maximum F1, especially on OoD data).
  • Models exhibit false positives on OoD content such as foliage, water, and text, and under lighting conditions unlike training data.
  • Variational and Monte Carlo dropout variants produce stochastic outputs per run, which may complicate reproducibility compared to ensembles.
  • Aleatoric and epistemic definitions and disentanglement are approximate and model-dependent; aleatoric estimates may be influenced by learned hedging behavior.
Listen, Learn & Level Up
Over 10,000 hours of research content in 25+ fields, available in 12+ languages.
No more digging through PDFs, just hit play and absorb the world's latest research in your language, on your time.
listen to research audio papers with researchbunny