Engineering and Technology
A platform-agnostic deep reinforcement learning framework for effective Sim2Real transfer towards autonomous driving
D. Li and O. Okhrin
Discover a groundbreaking Deep Reinforcement Learning framework for autonomous driving developed by Dianzhao Li and Ostap Okhrin that conquers the Sim2Real transfer challenge. This innovative approach enhances lane-following and overtaking capabilities using simulated environments, ensuring seamless performance in the real world.
~3 min • Beginner • English
Introduction
The study addresses the challenge of transferring DRL policies for autonomous driving from simulation to the real world, where differences in sensing, visuals, and dynamics create a Sim2Real gap. While DRL has shown strong capabilities in complex, dynamic decision-making tasks, most Sim2Real driving studies focus on single subtasks like lane following, with little work on overtaking or combined lane following and overtaking in real-world settings. The paper proposes a framework that separates a platform-dependent perception module from a universal DRL control module to abstract away platform heterogeneity and reduce visual redundancy, enabling robust transfer across simulators and to real robot platforms. The goal is to train a single agent capable of lane following and safe overtaking that generalizes across environments and operates effectively in real-world conditions.
Literature Review
Prior DRL research has addressed individual driving subtasks including car following, lane keeping, lane changing, overtaking, and collision avoidance. Sim2Real transfer techniques such as domain randomization, domain adaptation, knowledge distillation, meta-RL, and robust RL have narrowed the simulation–reality gap, enabling real-world lane following and collision avoidance. However, existing works largely focus on single tasks (typically lane following) and do not address overtaking or joint lane following plus overtaking with real-world deployment. This gap motivates a framework that generalizes across platforms and tasks.
Methodology
Framework: The approach comprises (i) a platform-dependent perception module that extracts task-relevant affordances (e.g., lateral displacement and orientation deviation relative to lane) and (ii) a universal DRL control module trained in simulation and transferred across platforms. The DRL agent is LSTM-based to capture temporal dependencies essential for overtaking.
Perception module: A multi-step image pipeline compensates illumination via k-means clustering on sampled pixels, matches clusters to expected colors (red/yellow/white/gray), and fits an affine RGB transformation for color balance. Edges are detected via Canny, colors via HSV thresholding, and line segments via probabilistic Hough transform. Using camera calibration, features are reprojected to the world frame. A nonlinear non-parametric histogram filter estimates lateral displacement d and angle offset θ to the right lane center. These affordances feed the control module. In real-world tests, true values for error metrics are unavailable; perception is quantitatively assessed in simulation via RMSE.
Control module and states: The agent observes lateral displacement and angle offset from the perception module, vehicle speed, and Vector Field Guidance (VFG) states during training, plus a conditional time-to-collision and proximity flag for overtaking. Normalization is applied. The action space comprises speed and steering angle. During evaluation, VFG states are replaced by perceptual substitutes.
Vector Field Guidance (VFG): Provides desired course angle based on cross-track error relative to a target path, aiding stable lane following during training.
Reward: The reward balances safety and performance: a -1 penalty for collisions or boundary violations; otherwise a weighted sum of lane-following (cross-track error), velocity/efficiency (and overtaking incentives), and heading components with weights w_c=0.3, w_v=0.6, w_a=0.1.
Baselines: PID/PD controllers (simulation PD with privileged lane-center information; real-world PI using perception outputs), human baseline in simulation (25 players with training and best-of-six evaluation), and two DRL baselines: (a) end-to-end (E2E) image-based DRL (compact CNN) and (b) CNN perception predicting lateral offset and angle offset feeding an LSTM-SAC control module.
Environments and hardware: Training in ROS+Gazebo; validation in Gym-Duckietown with ROS wrappers; real-world tests with Duckiebots DB21 (Jetson Nano) and DB19 (Raspberry Pi 3B). Domain randomization and noise (Gaussian on images and controls) improve robustness. Real-world tests vary lane color, width, lighting, and platform.
Training: Algorithms include TD3, SAC, LSTM-TD3, and LSTM-SAC. Agents trained for ~1.5M timesteps (per seed) with 10 seeds on an RTX 3080 (~40 hours per 1M steps). LSTM-based agents converge; feedforward TD3/SAC fail on overtaking due to partial observability, highlighting the importance of recurrence.
Evaluation protocol: Lane following in Gym-Duckietown on five maps (Normal 1, Normal 2, Plus, Zig-Zag, V track) with 100 episodes, randomized starts; metrics include survival time, distance, lateral deviation, orientation deviation, and infractions; a composite score aggregates metrics. Overtaking evaluated on three maps with success rate plus lane metrics. Real-world evaluations on a circular map and five additional maps; metrics include lateral/orientation deviation (from perception), average speed, and infractions; overtaking scenarios include static obstacles and dynamic overtaking of a PID-led slower vehicle.
Key Findings
Simulation lane following: Across five Gym-Duckietown maps and 100-episode evaluations, the DRL agent in fast mode achieved the highest final scores and traveled distances on every map. Relative to a near-optimal PID baseline (with privileged state), the DRL agent drove approximately 50–70% faster while maintaining competitive or better deviations and infractions; the slow-mode DRL agent outperformed PID on lateral and orientation deviations at similar speeds. The DRL agent also outperformed the best human baseline (best of six attempts) in speed, deviations, and infractions.
Simulation overtaking: On Normal 1, Normal 2, and Zig-Zag maps, overtaking success rates were 94.74%, 94.44%, and 90.91%, respectively (10-episode medians). Lateral and orientation deviations remained comparable to pure lane following; infractions were higher due to left-lane use during overtaking. The agent returned to the right lane after overtakes.
Real-world lane following: On DB21 hardware, the proposed DRL agent exhibited smooth trajectories and consistent performance on inner and outer rings across five maps, outperforming PID in average speed and infractions. In controlled tests with three DB21 vehicles on a circular track, the DRL agent achieved up to a 65% higher average speed than PID while maintaining robust lane keeping. E2E DRL and CNN-DRL baselines that performed comparably in simulation failed to transfer robustly, incurring aggressive maneuvers (E2E) or inability to recover from left-lane excursions (CNN) and numerous infractions under real-world variations.
Real-world overtaking: The DRL agent successfully performed overtakes around multiple static obstacles and dynamically overtook a slower PID-led vehicle, then returned to lane following. It demonstrated recovery from unfavorable post-overtake states, re-entering the lane without intervention.
Perception accuracy and Sim2Real gap: In simulation, perception RMSE for lateral deviation ranged 0.046–0.067 m (road width 0.23 m) and orientation deviation RMSE 0.548–0.836 rad across maps, yet the DRL agent remained robust to these inaccuracies. Appearance gap quantified by FID between simulated and real camera images was 198.82, indicating substantial visual differences. Despite content and appearance gaps, the agent maintained strong performance, whereas E2E and CNN baselines degraded notably under real-world variations (lane color/width).
Discussion
The findings demonstrate that separating a platform-dependent perception module from a universal LSTM-based DRL control module enables robust transfer from simulation to real robots for lane following and overtaking. The abstraction of task-relevant affordances reduces platform heterogeneity and visual redundancy, allowing a policy trained on a simple map in Gazebo to generalize to diverse Gym-Duckietown tracks and real-world tracks with minimal parameter adjustments. Despite imperfect perception (notable RMSE) and substantial Sim2Real appearance and content gaps, the agent outperformed PID (even when PID had privileged access in simulation) and human baselines in simulation, and surpassed PID in real-world speed and reliability. Recurrence (LSTM) was essential to handle partial observability in overtaking. The results underline the framework’s resilience across varying lane colors, widths, lighting, and hardware, addressing the core research question of effective Sim2Real transfer for combined lane following and overtaking.
Conclusion
This work introduces a platform-agnostic DRL framework that decouples perception and control, enabling reliable Sim2Real transfer for lane following and overtaking. Trained in a simple simulated environment, the agent generalizes across simulators and real-world platforms, outperforming PID and human baselines in simulation and exceeding PID performance in real-world speed while maintaining safe operation. The study quantifies appearance and content gaps and shows robustness to perception noise and domain shifts. Future work includes modeling platform dynamics more explicitly, incorporating recurrent architectures within perception to mitigate latency and noise, developing more realistic cooperative overtaking with multi-agent RL, and leveraging insights from the E2E agent to further refine the modular framework.
Limitations
- Vehicle dynamics are treated as black boxes; the control module issues high-level commands that may not translate optimally across different robotic systems.
- Perception and control noise, including image transfer latency, can impact performance; the current perception module lacks temporal modeling.
- The overtaking setup assumes a constant slow leader speed, which is unrealistic; cooperative multi-agent interactions were not modeled.
- Perception accuracy is limited (notable RMSE in lateral and orientation estimates), especially given narrow lane widths.
- PID comparisons in simulation used privileged information, and human evaluations used best-of attempts; although discussed, cross-approach fairness constraints persist.
Related Publications
Explore these studies to deepen your understanding of the subject.

