logo
Loading...
Benchmarking and Improving Bird's Eye View Perception Robustness in Autonomous Driving

Computer Science

Benchmarking and Improving Bird's Eye View Perception Robustness in Autonomous Driving

S. Xie, L. Kong, et al.

Explore the groundbreaking RoboBEV benchmark, which rigorously evaluates the robustness of BEV-based perception models across diverse conditions. This research, conducted by Shaoyuan Xie, Lingdong Kong, Wenwei Zhang, Jiawei Ren, Liang Pan, Kai Chen, and Ziwei Liu, reveals critical insights into enhancing resilience in advanced 3D perception systems.... show more
Introduction

The paper addresses the under-explored question of how robust BEV-centric 3D perception systems are to natural, real-world corruptions and sensor anomalies that occur in autonomous driving. While camera-centric BEV methods have achieved strong results on standard datasets, their resilience under distribution shifts (e.g., adverse weather, lighting changes, sensor artifacts, and temporal failures) is not well understood. The authors propose RoboBEV, a comprehensive benchmark to systematically evaluate BEV robustness under eight corruption types across three severities and under complete sensor failure in multi-modal settings. The purpose is to reveal robustness trends, identify architectural and training factors that improve resilience (e.g., pre-training, depth-free transformations, temporal fusion), and provide actionable insights and tools to guide the development of safer, real-world-ready BEV perception systems.

Literature Review

The authors review five areas: (1) Camera-based BEV perception: methods with explicit depth branches (e.g., LSS-inspired BEVDet, BEVDepth, BEVerse) and depth-free transformer/query-based approaches (e.g., DETR3D, PETR, BEVFormer, PolarFormer, SRCN3D, Sparse4D, SOLOFusion), extending to map segmentation, multi-view depth, and semantic occupancy; existing works excel in-distribution but provide limited evidence of robustness to corruptions. (2) LiDAR-based 3D perception: point-, voxel-, pillar-, and hybrid-based detectors and segmentation methods, with multiple representations (point, range, BEV, voxel) and their relations to BEV; despite strong progress, real-world robustness remains to be thoroughly validated. (3) Robustness under adversarial attacks: vision models (classification, detection, segmentation) are vulnerable; limited work studies adversarial robustness for camera-centric 3D detection, but this paper targets average-case, natural corruptions rather than worst-case adversaries. (4) Robustness under natural corruptions: benchmarks like ImageNet-C and ObjectNet show that synthetic corruption robustness correlates with real-world robustness; some recent works study 3D robustness, depth robustness, or point cloud corruptions, but a comprehensive BEV-centric benchmark is missing. (5) Robustness enhancements using CLIP: CLIP pretraining improves OOD robustness; however, end-to-end fine-tuning can erode OOD gains. Recent techniques propose weighted updates or robust fine-tuning. This work explores leveraging CLIP’s robustness for BEV tasks while retaining OOD benefits.

Methodology

Benchmark design: RoboBEV evaluates BEV perception robustness using nuScenes-C, constructed by applying natural and sensor-related corruptions to the nuScenes validation set. It encompasses eight corruption types in three severity levels (easy, moderate, hard) with intra-severity diversity: Brightness, Dark, Fog, Snow (environmental), Motion Blur and Color Quantization (sensor-induced), and two temporal corruptions tailored for multi-frame BEV methods (Camera Crash: dropping cameras; Frame Lost: dropping frames). The dataset comprises 866,736 images (1600×900). For multi-modal fusion models, complete sensor failure is simulated by zeroing all camera pixels (camera failure) and retaining only LiDAR points within a frontal [-45°, 45°] FOV to avoid total collapse when simulating LiDAR failure. Corruption parameterization: Each corruption’s severity is controlled via parameters (e.g., HSV brightness adjustments; darkness scale factors; fog thickness/smoothness; snow noise/blur/blending; motion blur radius/sigma; color quantization bit depth; number of dropped cameras; frame drop probability). Severity levels are chosen to challenge models without completely destroying performance. Metrics: The benchmark follows nuScenes metrics for detection—NDS, mAP, mATE, mASE, mAOE, mAVE, mAAE—averaged across severities per corruption. Two robustness metrics are introduced: (1) mean Corruption Error (mCE), comparing models relative to a baseline (DETR3D) by aggregating 1 − NDS over corruptions and severities; lower mCE indicates better relative robustness. (2) mean Resilience Rate (mRR), the ratio of corrupted NDS to clean NDS averaged across corruptions; higher mRR indicates better relative performance retention. Evaluation scope: 33 BEV models across tasks—3D detection, map segmentation, depth estimation, and semantic occupancy—are evaluated, including 30 camera-only and 3 camera–LiDAR fusion models. Experiments include camera corruptions, fusion robustness with clean LiDAR vs. corrupted cameras, and complete sensor failures for models trained on multi-modal inputs. Implementation leverages open-source configurations/checkpoints (plus re-implementations for fair comparisons) on MMDetection3D; a public model zoo is provided. Validity studies: (1) Pixel distribution analysis compares synthesized corruptions with real-world datasets (ACDC, Cityscapes, Foggy-Cityscapes, nuScenes) to validate statistical similarity. (2) Corruption-augmented training: models trained with the synthetic corruptions are tested on real-world domain shifts (e.g., day-to-night, dry-to-rain) to verify improved generalization. Robustness enhancement strategies: (1) Corruption-augmented training for several detectors (BEVFormer, DETR3D, PETR/PETRv2, BEVDet). (2) CLIP backbone adaptation to BEV tasks with three schemes: freezing backbone and training detection head; end-to-end fine-tuning; and a two-stage procedure (head alignment on frozen CLIP, then joint fine-tuning), with and without corruption augmentation. (3) Analysis of pre-training effects (e.g., FCOS3D and M-BEV masked pretraining), depth-free BEV transformations, and temporal fusion (e.g., SOLOFusion variants).

Key Findings
  • Clean vs. corrupted performance: Strong linear correlation between clean NDS and mCE suggests models that perform well in-distribution also tend to be relatively robust in absolute terms; however, mRR shows notable variance, indicating some models overfit to clean nuScenes and retain performance poorly under corruptions.
  • Depth-free vs. depth-based: Depth-based pipelines degrade significantly under image corruptions, especially Dark and Snow, due to errors in intermediate depth estimation. Depth-free BEV transformations show better robustness trends.
  • Pre-training boosts robustness: Model pre-training (e.g., FCOS3D initialization; masked pretraining M-BEV) generally improves robustness to semantic corruptions (Color Quant, Motion Blur, Dark), though gains on temporal corruptions are limited. Example: pre-trained BEVDet improved mRR by large margins on Quant (+22.5%), Motion (+17.2%), and Dark (+27.8%) despite slightly lower clean NDS.
  • Temporal fusion helps when leveraged effectively: Longer and richer temporal fusion (SOLOFusion-fusion) yields better robustness (e.g., lowest mCE 92.86 and solid mRR 64.53 among many candidates). Not all temporal models gain under Camera Crash or Frame Lost; effectiveness depends on fusion strategy and number of frames.
  • Backbone effects: ResNet backbones show broad robustness; VoVNet-V2 is more robust to Snow; Swin Transformer variants are particularly vulnerable to lighting changes (Bright/Dark). Feature-space Gramian analyses align with these end-to-end findings.
  • Fusion under camera corruptions: With clean LiDAR and corrupted cameras, fusion models often still benefit from LiDAR, but corrupted camera features can hurt, especially in Dark (e.g., narrative reports performance reductions when adding darkened cameras to clean LiDAR). Multi-modal models are heavily reliant on LiDAR; under LiDAR failure, mAP can drop by ~89–95% (BEVFusion, TransFusion), whereas camera failure causes milder degradation.
  • Corruption augmentation is effective: Training with synthetic corruptions improves robustness on nuScenes-C and real-world shifts. mRR improvements include: BEVFormer 0.6040→0.7427, DETR3D 0.7077→0.8506, PETR 0.6503→0.8555, PETRv2 0.8642→0.9144, BEVDet 0.5854→0.8210. Cross-domain day-to-night performance improved by 45.8% with corruption augmentation.
  • CLIP-based robustness transfer: Naively freezing CLIP or end-to-end fine-tuning provides limited OOD robustness. A two-stage head alignment then joint fine-tuning, especially with corruption augmentation, better transfers CLIP’s robustness to BEV tasks; reported NDS gains under Dark, Fog, Snow of 23.1%, 11.8%, and 15.8%, respectively, over end-to-end tuning.
  • Pixel distributions vs. performance: Pixel histogram shifts do not directly predict robustness; Motion Blur causes relatively small pixel distribution shift but large performance drops, while Bright and Fog shift histograms more yet cause smaller performance gaps.
  • Task-wide impact: Beyond detection, BEV map segmentation, depth estimation, and semantic occupancy also degrade notably under Dark and Snow, revealing shared vulnerabilities across tasks.
Discussion

The study systematically demonstrates that BEV perception robustness is influenced by architectural choices, training regimes, and temporal modeling. The findings address the core research question by showing: (1) depth-free BEV transformations and model pre-training consistently enhance robustness to semantic corruptions; (2) robust temporal fusion, especially with longer context and effective fusion designs, improves resilience; (3) multi-modal fusion must be designed to mitigate negative transfer from corrupted camera inputs and reduce over-reliance on LiDAR; and (4) corruption-augmented training and a two-stage CLIP-based adaptation effectively improve out-of-distribution robustness without sacrificing clean performance. These insights are directly relevant to deploying BEV systems in safety-critical autonomous driving, where environmental conditions, sensor artifacts, and partial failures are commonplace. The observation that pixel-level statistics are weak predictors of robustness underscores the need for model-level strategies and systemic evaluations like RoboBEV.

Conclusion

The paper introduces RoboBEV, a comprehensive benchmark (nuScenes-C) for evaluating the robustness of BEV perception systems under eight corruption types, temporal failures, and complete sensor failures in multi-modal settings. Evaluating 33 models across detection, map segmentation, depth estimation, and semantic occupancy reveals a strong link between clean and corrupted performance, and highlights the benefits of depth-free BEV transformations, robust pre-training, and longer temporal fusion. The authors further validate the synthetic corruptions’ realism and show that corruption-augmented training and a tailored CLIP-based two-stage adaptation can significantly enhance robustness, particularly under challenging conditions like Dark, Fog, and Snow. Future work should expand coverage of real-world OOD scenarios, develop fusion strategies resilient to missing or corrupted modalities, explore principled temporal fusion to avoid error accumulation, and design training objectives that better align feature representations across corruption types.

Limitations

The benchmark’s synthetic corruptions, while diverse and validated, cannot cover the full spectrum of real-world OOD conditions and complexities. Analyses focus on coarse-grained architectural and training factors (e.g., depth usage, backbone choice, pre-training, temporal fusion) rather than fine-grained module designs, leaving trade-offs between detailed architectural choices underexplored. Multi-modal sensor failure simulations rely on partial retention of LiDAR FOV due to total failure collapses, which may not capture all field conditions. Moreover, negative transfer from corrupted modalities and error accumulation in temporal fusion highlight areas requiring specialized robustness modules.

Listen, Learn & Level Up
Over 10,000 hours of research content in 25+ fields, available in 22+ languages.
No more digging through PDFs, just hit play and absorb the world's latest research in your language, on your time.
listen to research audio papers with researchbunny