logo
ResearchBunny Logo
Deep learning-based robust positioning for all-weather autonomous driving

Engineering and Technology

Deep learning-based robust positioning for all-weather autonomous driving

Y. Almalioglu, M. Turan, et al.

Dive into the innovative world of autonomous vehicle technology with groundbreaking research by Yasin Almalioglu, Mehmet Turan, Niki Trigoni, and Andrew Markham. This study introduces a robust, deep learning-based method for ego-motion estimation under adverse weather conditions, integrating visual and radar data to enhance safety and reliability in all environments.

00:00
00:00
~3 min • Beginner • English
Introduction
Autonomous vehicles promise improved safety, reduced congestion and enhanced mobility, but their large-scale deployment is hindered by reliability and safety concerns, especially under adverse weather. Precise localization is foundational for downstream AV functions such as prediction and motion planning. GNSS-based localization can be unreliable in urban canyons and lacks orientation accuracy; thus, ego-motion estimation (odometry) from onboard sensors is a critical complementary solution. However, cameras and lidars degrade in rain, fog, snow, low illumination, and glare, while millimetre-wave radars are comparatively weather-immune but provide noisier, lower-resolution measurements. The research question addressed is whether a self-supervised, geometry-aware, multimodal deep learning approach that fuses camera, lidar and radar can deliver robust, generalizable ego-motion and depth estimation across day/night and adverse weather, while remaining modular and interpretable. The paper introduces GRAMME, a sensor-agnostic framework designed to exploit complementary sensor strengths and mitigate their weaknesses to enable all-weather autonomous driving.
Literature Review
Prior AV perception and localization research has relied heavily on camera and lidar sensors and large public datasets, but accurate ground-truth labels are costly and often unavailable, especially under adverse conditions. Self-supervised monocular and stereo depth/pose methods leverage view reconstruction as supervision but are vulnerable to photometric violations, occlusions and low-texture regions; their generalization under weather shifts is limited. Millimetre-wave radar offers robustness to illumination and airborne obscurants due to longer wavelengths, yet traditional radar odometry methods (geometry-based or supervised learning) struggle with sparsity, speckle, ghost artefacts and coarser spatial resolution. Existing datasets (e.g., Oxford RobotCar, Oxford Radar RobotCar, RADIATE) provide multiple sensing modalities across varied conditions, enabling study of multimodal learning. The literature indicates a need for: (1) self-supervised training without reliance on scarce ground truth, (2) multimodal fusion that respects geometry and sensor idiosyncrasies, (3) mechanisms to mask unreliable regions to maintain reconstruction consistency, and (4) interpretability tools to understand learned features.
Methodology
Overview: GRAMME (Geometry-aware multimodal ego-motion estimation) is a self-supervised deep learning framework that estimates per-pixel depth (for cameras) and the vehicle's 6-DoF ego-motion by reconstructing views across time using multiple sensors (camera, lidar, radar). It is modular (each modality can operate independently at training and inference), sensor-agnostic (supports varying resolutions, beamwidths, FoVs), and employs late fusion of modality-specific motion predictions. Architecture and modules: - Camera module: DepthNet (UNet with skip connections) predicts dense per-pixel depth from a single RGB frame. VisionNet (ResNet18 encoder with FC regressors) predicts relative pose between consecutive camera frames. A spatial transformer performs differentiable perspective warping to reconstruct the target view from source views. A second-order spatial smoothness regularizer encourages piecewise-planar depth in low-texture/occluded regions. Left-right consistency enables use with stereo. - Range module: RangeNet (ResNet18-based feature encoder with FC regressors) predicts relative pose from pairs of consecutive range frames (lidar or radar) represented in bird’s-eye view (Cartesian). MaskNet (UNet-style with skip connections and transposed convolution decoder) predicts consistency masks to identify reliable regions in each range frame, mitigating artefacts (e.g., radar speckle/ghosts, lidar ground reflections, fog droplets). A differentiable bilinear sampling reconstructs the target range view; a masked intensity loss penalizes reconstruction error while a cross-entropy regularizer avoids trivial all-zero masks. - Multimodal fusion: A late fusion module (FusionNet) takes unaligned pose predictions from VisionNet and RangeNet, along with attention-weighted features, and outputs a fused pose via an MLP. An attention submodule produces importance weights (0–1 via SoftMax) over concatenated features, enabling dynamic weighting of radar/lidar features. - Spatial transformer: Implements differentiable view synthesis for both camera (perspective projection with intrinsics) and range sensors (inverse warp in BEV via Rodrigues-based SE(3) transform and bilinear sampling), providing the self-supervised training signal. Masking: GRAMME combines learned masks (MaskNet) and geometric masks (from motion consistency, near-identical frames, and dynamic object handling) to exclude unreliable/occluded regions from reconstruction losses in all modalities. Training objective: The total loss sums modality-specific reconstruction losses and mask regularization: L = λ_l L_range(M, I_s, I_t) + λ_d L_camera(D, T, M) + λ_m L_mask(M). Cross-modal fusion loss is also applied by using the fused pose in range reconstruction to improve robustness and cross-modal information flow. Fixed weights used in experiments: λ_l = 2, λ_d = 30, λ_m = 1. Optimization uses Adam (lr 1e-4, L2 weight decay 1e-5), batch size 16, 50–200 epochs with early stopping on validation loss (patience 5). Data augmentation includes random in-plane rotations of lidar/radar scans in [-10°, 10°]. Datasets and preprocessing: Experiments use Oxford Radar RobotCar (ORR) and Oxford RobotCar (camera/lidar only) for primary evaluation and RADIATE for cross-dataset generalization. Radar frames are converted from polar to Cartesian; lidar point clouds are projected to BEV intensity maps. ORR provides ground-truth trajectories from GPS/INS fused with VO/loop closures and lidar depth ground truth from merged point clouds. Dataset splits per fold: 80% train, 10% validation, 10% test; fivefold cross-validation with challenging cross-condition evaluation: train on day sequences only, test on night, rain, fog, snow. Evaluation protocols: - Ego-motion: KITTI odometry metrics, reporting average relative translation (%) and rotation (deg/100 m) over sub-sequences (100–800 m), with spatial cross-validation as in prior work. - Depth: Standard error metrics (AbsRel, SqRel, RMSE, RMSElog) and accuracy thresholds (δ < 1.25, 1.25^2, 1.25^3). Monocular predictions are median-scaled to resolve scale ambiguity. Errors capped at 60 m for comparability. Implementation details: PyTorch-based; ResNet18 encoders unless otherwise stated; ablations with MobileNet and VGG16 evaluate accuracy-latency trade-offs; inference benchmarked on NVIDIA GTX 1080Ti; training on dual RTX 3090 GPUs.
Key Findings
All-weather robustness and generalization: - Models trained only on day sequences generalize to night, rain, fog, and snow without fine-tuning. Camera-only self-supervised models degrade substantially under adverse conditions, while adding lidar or radar markedly improves robustness. Depth prediction improvements (tested monocular only): - Across all adverse conditions, fusion-trained GRAMME models outperform camera-only baselines. Example AbsRel (lower is better): Day: Monodepth2 (stereo) 0.252 vs GRAMME S+R 0.228 and S+L 0.229; Night: 0.251 vs 0.238 (S+R) and 0.237 (S+L); Fog: 0.260 vs 0.232 (S+R) and 0.243 (S+L); Snow: 0.245 vs 0.230 (S+R) and 0.234 (S+L). Similar gains observed in RMSE and δ-accuracy. Radar-camera training yields greater immunity to precipitation; lidar-camera training performs relatively better under poor illumination. Ego-motion accuracy and benefits of fusion: - Fusion substantially reduces translational and rotational errors versus camera-only models across day, night, rain, fog, snow. Representative mean relative translation/rotation errors (translation % / rotation deg/100 m): - GRAMME (Radar&Camera): 1.98 / 0.51, improving over radar-only GRAMME 2.49 / 0.61 and prior radar odometry (e.g., Barnes Dual Cart 2.78 / 0.85; Hong et al. 3.11 / 0.90). - GRAMME (Lidar&Camera): 0.90 / 0.23, outperforming lidar-only GRAMME 1.06 / 0.31 and competitive lidar baselines. - Masks learned per modality effectively identify and exclude unreliable regions (e.g., camera glare/occlusions; lidar ground reflections and fog-induced false returns; radar speckle/ghosts), improving supervisory signals and final performance. Self-supervision vs supervision: - Self-supervised training yields better cross-condition generalization than models trained with ground-truth poses; supervised models overfit input-to-label mappings rather than underlying geometry. Data efficiency and complexity: - Fusion models achieve strong performance with ≥50% of training data across conditions; camera-only models need ~75% to reach comparable baselines. With very limited data (25%), fusion can underperform due to increased model complexity. Interpretability: - SHAP-based analyses show camera-only models have scattered high-importance regions and focus on artefacts under adverse conditions, whereas fusion-trained models concentrate on semantically and geometrically consistent structures (vehicles, road boundaries, static objects), even when tested using camera input alone. Runtime and architecture: - ResNet18 encoders offer a favorable accuracy-latency trade-off versus MobileNet (faster, less accurate) and VGG16 (slightly higher accuracy in monocular but ~4x slower). Multimodal branches increase latency but remain compatible with real-time constraints on consumer GPUs. Sensor-specific observations: - Radar provides superior robustness in precipitation; lidar suffers notably in dense fog. GRAMME is sensor-agnostic and benefits from diverse sensor resolutions/beamwidths.
Discussion
The findings demonstrate that a geometry-aware, self-supervised multimodal approach addresses core localization challenges for AVs. By leveraging complementary sensing (camera richness, lidar granularity, radar weather immunity) and masking unreliable regions, GRAMME maintains reconstruction consistency and learns generalizable geometric cues. The modular design yields independent, uncorrelated failure modes and preserves gains even when some modalities are absent at test time, enhancing robustness toward minimal-risk conditions. Compared to supervised counterparts, self-supervision improves cross-condition generalization by optimizing for geometric consistency rather than dataset-specific labels. The improved ego-motion accuracy and depth estimation under day/night and adverse weather directly support the downstream AV stack (prediction, planning), enabling safer operation when GNSS is degraded. Interpretability analyses (SHAP) indicate that multimodal training guides the model to focus on semantically meaningful structures, fostering trust and aiding validation. Data efficiency experiments suggest that multimodal self-supervision can achieve competitive performance without exhaustive labeled datasets, though fusion architectures require adequate data to fully realize benefits.
Conclusion
GRAMME introduces a modular, geometry-aware, self-supervised fusion framework for ego-motion and depth estimation that operates robustly across day/night and adverse weather. It fuses camera, lidar and radar via differentiable view reconstruction and attention-based late fusion, with learned and geometric masks to filter unreliable regions. Experiments on Oxford RobotCar and RADIATE show state-of-the-art performance and strong cross-condition generalization, surpassing prior radar odometry and improving over camera-only self-supervised baselines in depth and motion accuracy. The approach is interpretable and data-efficient relative to its complexity, and sensor-agnostic. Future work includes incorporating range sensor signal-to-noise into masking, exploiting radar Doppler for better dynamic/static discrimination, and extending GRAMME to continual and lifelong learning for collaborative, continuously improving AV localization.
Limitations
Public AV datasets do not comprehensively cover extreme weather (e.g., heavy downpours, large snowfalls), limiting stress-testing of generalization. Radar imaging remains relatively sparse/low-resolution; higher-resolution 3D radar would likely yield further gains. Lidar is vulnerable in dense fog due to attenuation and backscatter, which can degrade fusion if not adequately masked. Fusion architectures are more data-hungry; with very limited training data (e.g., 25%), performance can drop below simpler camera-only models. Although self-supervision improves generalization, it depends on static-scene and photometric/consistency assumptions that can be violated by dynamic objects and severe sensor artefacts, necessitating robust masking and may still leave residual errors.
Listen, Learn & Level Up
Over 10,000 hours of research content in 25+ fields, available in 12+ languages.
No more digging through PDFs, just hit play and absorb the world's latest research in your language, on your time.
listen to research audio papers with researchbunny