logo
Loading...
Low-latency automotive vision with event cameras

Engineering and Technology

Low-latency automotive vision with event cameras

D. Gehrig and D. Scaramuzza

Discover how researchers Daniel Gehrig and Davide Scaramuzza are revolutionizing advanced driver-assistance systems with a groundbreaking hybrid object detector that combines the speed of event cameras with the accuracy of RGB cameras, achieving unprecedented efficiency without sacrificing quality.... show more
Introduction

Frame-based RGB sensors face a bandwidth–latency trade-off: higher frame rates reduce perceptual latency but require higher bandwidth, whereas lower rates save bandwidth but increase blind time, risking missed scene dynamics. In automotive safety, typical 30–45 fps ADAS cameras incur 22–33 ms blind times, which can be critical at high speeds or under uncertainty (occlusions, poor lighting, adverse weather). Increasing frame rates inflates already massive data volumes. Event cameras offer microsecond temporal resolution, high dynamic range, and sparse, low-power data, promising low-latency perception. However, event-only methods struggle with slowly varying signals and many approaches densify events for CNNs, incurring redundant computation and latency. The study proposes a hybrid detector that combines low-rate images with asynchronous event processing to reduce both perceptual and computational latency, aiming to achieve high-rate, accurate detections with low bandwidth and improved safety in inter-frame blind times.

Literature Review

Event-based vision provides advantages in latency, dynamic range, and sparsity (surveys and foundational sensors). Yet, existing event-based algorithms either (1) convert events into dense frame-like representations for CNNs, enabling accuracy but causing redundant computation and higher latency/power (e.g., dense feedforward/recurrent methods for detection, optical flow, video reconstruction), or (2) use sparse/asynchronous methods that are efficient but often less accurate (e.g., AEGNN and other asynchronous sparse CNNs). Recurrent dense methods (e.g., RED, ASTM-Net, MatrixLSTM) can improve accuracy but at substantial computational cost. Spiking neural networks on automotive event data remain less accurate with current training strategies. Fusion methods that stack events into histograms with standard backbones (Inception+SSD, Events+YOLOv3/YOLOX) can be effective but tend to reprocess data repeatedly and often use bidirectional feature sharing, which increases computation. This work builds on asynchronous graph-based processing and residual/spline-based graph convolutions to improve both efficiency and accuracy, and adopts a YOLOX-style detection head for robust detection while maintaining sparsity.

Methodology

The proposed system, Deep Asynchronous GNN (DAGr), is a hybrid object detector combining a conventional CNN for low-rate images with an asynchronous graph neural network (GNN) for high-rate event streams. Architecture and data flow: (1) Image processing: At each arriving image, a CNN (e.g., ResNet-18/34/50 backbone) computes dense features. These features are shared unidirectionally with the event GNN via skip connections; the GNN uses image features but does not feed back to the CNN. (2) Event processing: Incoming events are organized as spatio-temporal graphs using an efficient CUDA implementation. The GNN processes this graph through a sequence of graph convolution and pooling layers with residual connections to enable deep, efficient training. A specialized voxel grid max-pooling layer aggressively reduces node counts early to cap computation. The detection head mirrors YOLOX but replaces standard convolutions with graph convolutions; an efficient variant of spline convolution precomputes lookup tables to reduce compute. Asynchrony and recursion: The model is first trained in a batched, synchronous setting: given an image It and the subsequent 50 ms of events, it is trained to predict detections at the next frame. After training, the identical-weight model is converted to an asynchronous form with recursive update rules. For each new event, each GNN layer maintains and updates its prior graph structure and activations, propagating only localized changes. Efficiency strategies include: (a) limiting computation to messages between nodes whose features or positions changed; (b) pruning non-informative updates at early max-pooling stages to stop propagation; and (c) using directed event graphs (edges only forward in time) to reduce update spread. Models and training: Four model scales (nano/small/medium/large) vary channel counts (32/64/92/128) in deeper blocks and detection heads. Power is estimated by counting MAC operations and multiplying by 1.69 pJ. Evaluation protocols include purely event-based object detection (Gen1) and image+event fusion on DSEC-Detection, where all methods see one image and the following 50 ms of events to predict labels at 50 ms. Inter-frame performance is assessed by varying temporal offsets between frames, including comparisons to image-only baselines with constant or linear extrapolation of detections.

Key Findings
  • Latency–bandwidth equivalence: A 20-fps RGB camera paired with an event camera achieves perceptual latency comparable to a 5,000-fps camera with bandwidth comparable to a 45-fps camera, maintaining accuracy. The hybrid setup delivers ~0.2 ms perceptual latency with only ~4% more bandwidth than a 45-fps sensor and ~41% of the bandwidth of a 120-fps camera, while matching or exceeding accuracy.
  • Event-only detection (Gen1): • DAGr-L achieves 32.1 mAP, outperforming MatrixLSTM by 1.1 mAP with ~120× fewer FLOPs; it also outperforms feedforward dense methods Events+RRC (30.7 mAP), Inception+SSD (30.1 mAP), and Events+YOLOv3 (31.2 mAP). Compared to a spiking DenseNet, DAGr shows +13.1 mAP. • DAGr-S achieves the best computational efficiency among sparse methods, using ~13% of the MFLOPs per event of runner-up AEGNN and +14.1 mAP over AEGNN. • DAGr-N is 3.8× more efficient than AEGNN and still +10 mAP higher. • Power: the smallest model requires only 1.93 µJ per event, the lowest among compared methods.
  • Event-only classification (N-Caltech101): • DAGr-S: 70.2 mAP, +5.9 mAP over AsyNet, and lower computation than AEGNN. • DAGr-L: 73.2 mAP (highest score). • DAGr-N: 2.28 MFLOPs per event (3.25× lower than AEGNN) with 3.4% higher mAP.
  • Image+event fusion (DSEC-Detection): • With ResNet-18 backbone, DAGr reaches 37.6 mAP, exceeding Inception+SSD (18.4) and Events+YOLOv3 (28.7). Events+YOLOX attains 40.2 mAP on the same backbone, likely due to bidirectional feature sharing. • Scaling to ResNet-50 increases DAGr to 41.9 mAP, surpassing Events+YOLOX while remaining vastly more efficient. • Compute: DAGr uses ~0.03% of the computation of Events+YOLOX; power ~5.42 µJ per event. Using directed edges cuts compute by 91% with only ~2% mAP drop.
  • Inter-frame detection stability and accuracy: DAGr’s mAP slightly increases over 0–50 ms with events (+0.7 mAP by 50 ms), indicating robust use of incremental event information; Events+YOLOX starts lower (34.7) and varies more (up to 42.5 by 50 ms), suggesting overfitting to the 50 ms window. Image-only YOLOX without events degrades by 10.8 mAP (constant extrapolation) and 6.4 mAP (linear extrapolation) over 50 ms.
  • Bandwidth–performance trade-off: Across automotive cameras (30–120 fps), the hybrid approach yields higher worst-case and average mAP than image-only baselines at similar or lower bandwidth. It exceeds YOLOX on Bosch MPC3 (45 fps) by 2.6 mAP with only 4% more data (64.9 vs 62.3 Mb/s) and outperforms the 120-fps IMX224 by 0.2 mAP with 41% of its bandwidth.
Discussion

The proposed DAGr system addresses the bandwidth–latency trade-off by streaming sparse event data through an asynchronous GNN while leveraging rich contextual features from low-rate images. This design provides high-rate detections during inter-frame blind times, improving worst-case and average performance relative to image-only systems. Architectural choices—residual graph layers, early voxel grid max-pooling, efficient spline convolutions, directed graphs, and recursive localized updates—enable four orders of magnitude lower computational complexity than dense event processing approaches and substantial power savings, while deeper networks deliver accuracy surpassing prior sparse/asynchronous models. Fusion with images stabilizes performance across temporal offsets and improves localization (raising recall at high IoU). Compared to high-frame-rate cameras, the hybrid achieves comparable or better accuracy with far lower bandwidth, enhancing practical deployability in automotive perception. Remaining runtime gaps suggest that hardware-software co-design (e.g., spiking or specialized accelerators) could further realize the theoretical efficiency gains.

Conclusion

This work introduces DAGr, a hybrid event–frame object detector that combines low-rate image features with an asynchronous GNN over event graphs to deliver low-latency, low-bandwidth, high-rate detections. It achieves (i) event-only state-of-the-art efficiency with competitive accuracy, (ii) image+event fusion that surpasses or rivals dense baselines while being orders of magnitude more efficient, and (iii) inter-frame robustness that provides certifiable, earlier snapshots of dynamic objects. Practically, a 20-fps RGB camera plus an event camera attains the perceptual latency of a 5,000-fps camera at bandwidth comparable to a 45-fps setup, without sacrificing detection performance. Future research could integrate additional modalities (e.g., LiDAR) to supply strong priors and enable shallower, more efficient models, and pursue specialized hardware (e.g., spiking accelerators) and optimized implementations to translate computational savings into further wall-clock speedups.

Limitations

Although DAGr achieves orders-of-magnitude reductions in theoretical computation and power versus dense methods, current implementations translate to about 3.7× runtime improvements, leaving room for further system-level speedups via optimized software/hardware. The approach relies on a fixed training window (50 ms) and evaluation with interpolated ground truth over subsets without appearing/disappearing objects, so broader validation under diverse temporal windows and object dynamics would strengthen generalizability. More generally, event cameras are less sensitive to slowly varying intensity changes; while the hybrid mitigates this via image features, performance in scenes with minimal motion cues may depend on the image stream quality.

Listen, Learn & Level Up
Over 10,000 hours of research content in 25+ fields, available in 22+ languages.
No more digging through PDFs, just hit play and absorb the world's latest research in your language, on your time.
listen to research audio papers with researchbunny