
Transportation
Deep reinforcement learning for decision making of autonomous vehicle in non-lane-based traffic environments
Y. Fei, L. Xing, et al.
Using toll-plaza diverging areas as a case study, this study builds a microscopic simulation of realistic human-driven trajectories and proposes a deep reinforcement learning–based lateral motion strategy for autonomous vehicles with tailored state and reward functions. Simulations show reduced single-vehicle diverging time and that moderate AV penetration improves efficiency and safety while excessive penetration harms operations. Research conducted by Yi Fei, Lu Xing, Lan Yao, Zhizhi Yang, and Yujie Zhang.
~3 min • Beginner • English
Introduction
Most vehicle accidents are caused by human errors, and human drivers exhibit unstable decision-making under varying environmental and personal factors. Autonomous vehicles (AVs), with superior perception and faster response, are expected to enhance traffic efficiency and safety, with decision-making being a critical capability. Lateral motion decisions are more complex than longitudinal ones due to interactions with more surrounding vehicles. Existing AV lateral decision-making studies predominantly focus on lane-based road sections, leveraging lane markings to structure perception and interaction. In contrast, non-lane-based areas (e.g., intersections, diverging and merging zones) involve irregular geometries, signals, and variable queues; vehicles have longitudinal destinations but lack lateral constraints, generating conflicts from any angle. Key challenges include: (1) constructing high-fidelity training environments for AVs that accurately reflect destination-driven, weak-constraint behaviors of surrounding human-driven vehicles (HVs), as traditional microscopic simulations remain lane-based and cannot capture realistic interactions; (2) designing AV perception states and reward functions that reflect non-lane-based environmental features to ensure safe and efficient lateral motion decisions. Deep reinforcement learning (DRL) combines deep learning's feature extraction with reinforcement learning's interaction-based optimization, avoiding reliance on labeled data and enabling safer, more stable decisions in complex environments. This paper selects the toll plaza diverging area of traditional mainline toll plazas (TMTPs) in China—characterized by widening zones without lane markings between ETC and MTC lanes—where vehicles must diverge to appropriate lanes over limited distance, exhibiting destination-driven, weak constraints and high lateral motion demand. Prior simulations (SUMO, VISSIM) impose lane-based rules and simplify lateral motion, underestimating impacts on traffic. Considering prolonged mixed traffic during AV adoption, most studies validate single-AV scenarios without examining effects under varying AV penetration. This study aims to develop a DRL-based safe and efficient lateral motion decision method for AVs in toll plaza diverging areas and to analyze its impacts on mixed traffic flow. Main contributions: (1) propose a micro-simulation approach reproducing weak-constraint HV motions to provide realistic AV training interactions; (2) design DRL state and reward functions tailored to diverging areas, guiding AVs to select shorter queues and penalize aggressive lateral motions; (3) investigate how varying penetration of self-efficient AVs affects efficiency and safety in the diverging area.
Literature Review
Lateral motion decision models include rule-based (e.g., Gipps, MOBIL, utility-based) and learning-based approaches. Rule-based models are interpretable but limited in scenario coverage for complex situations, motivating the shift to DRL-based models with improved generalization. DRL designs vary across input state variables, reward formulations (balancing efficiency, safety, comfort, macro/micro objectives), and model structures (constraints for safer learning, integration with trajectory planning). Among DRL methods, DQN is widely used for discrete actions (left/right/keep lane), with extensions DDQN and Dueling DQN mitigating overestimation and improving value decomposition. Studies demonstrate DRL’s superior performance over rule-based baselines in safety, efficiency, and sometimes fuel economy. DRL is applied to both high-level decisions and low-level control (coordinating longitudinal and lateral motions, multi-agent interactions, continuous control via DDPG). However, existing DRL research largely targets lane-based scenarios, often with idealized assumptions (static or constant-speed surrounding vehicles), lacking realistic dynamic interactions with HVs and generalization to non-lane-based areas. The broader impact of AV lateral strategies on mixed flow dynamics remains underexplored.
Methodology
The study formulates AV lateral motion decision-making in toll plaza diverging areas as a Markov decision process (MDP) and evaluates DQN, DDQN, and Dueling DQN for discrete selection of target toll lanes. Reinforcement learning aims to maximize discounted cumulative reward with discount factor γ, using deep networks to approximate Q-values and stabilize training via experience replay and target networks. DDQN separates action selection and evaluation to reduce overestimation; Dueling DQN decomposes Q into state-value and advantage components. DRL settings: Action space comprises feasible toll lanes matching the vehicle’s ETC/MTC type and current lane arrangement; selection is discrete per lane. State space includes: (1) subject vehicle (SV) kinematics (x(t), y(t), v_x(t), v_y(t), a_x(t)); (2) surrounding environment via a rectangular influence area divided into left, right-front, left-behind, right-behind binary occupancy indicators (A1–A4), with the rectangle aligned to SV direction; (3) path-based, lane-specific variables: queue length Q_j(t) for toll lane j, longitudinal distance L_j(t) SV can move along path j before encountering a vehicle or the toll lane, and β_j(t), the ratio of lateral distance to longitudinal distance required to reach lane j (penalizing aggressive lateral moves). Reward function R = R1 + R2 + R3 + R4: R1 = −α1·L(t) incentivizes reducing longitudinal distance to toll lanes; R2 = α2(Q_pre − Q_new) rewards selecting lanes with shorter queues; R3 penalizes collisions (m3 if collision, else 0); R4 = −α3·β penalizes aggressive lateral motion relative to longitudinal progress. Simulation platform: A two-dimensional microscopic simulation tailored to non-lane-based diverging areas implements a Perception–Decision–Action framework. Perception layer: vehicles (AVs and HVs) independently sense self-state, surrounding occupancy (A1–A4), and path-based variables (Qc, Lc, βc); AVs rely solely on onboard sensors (no V2X). Diverging paths are modeled as smooth cubic polynomials f(x) = ax^3 + bx^2 + cx + d defined by current/previous positions and points on lane medians; vehicles within 0.8 m of a path line are considered on that path; L_j is measured to the nearest vehicle along the path or to the toll lane if empty. Decision layer: AV target lane selection uses the DRL policy; HV decisions are modeled by a calibrated multilayer perceptron neural network. Action layer: car-following on selected paths uses an improved Full Velocity Difference (FVD) model suitable for weak lane discipline: a_SV(t) = α{V[Δx_SV(t)] − V_SV(t)} − λ1 dθ/dt + λ2 dφ/dt, where θ (visual angle) and φ (offset angle) account for lateral offsets; optimized velocity V[Δx_SV(t)] = V1 + V2 tanh(C1 b_n / tan φ_n + C2), with parameters V1=6.75, V2=7.91, C1=0.13, C2=1.57. Leading vehicle determination considers shortest longitudinal distance within ±1.5 m of the SV’s path centerline; vehicle dimensions are length 5 m, width 1.6 m. A virtual lead vehicle is placed at the target toll lane when none is present. Study site and training environment: Real trajectories from a TMTP diverging area on the G55 freeway in Changsha, China (145 m diverging length; 5 ETC lanes left, 3 MTC lanes right) collected by UAV (4K, 30 fps). The Python-based platform includes visualization, computation, logging; traffic flow set to 1,500 veh/h with ETC about twice MTC; main lanes: 3 lanes, 3.75 m each; ETC/MTC toll lanes 5 m wide. Initial speeds: ETC N(14.7,3) m/s, MTC N(12,3) m/s; ETC departure ratio across main lanes 1:2:1; MTC 1:2:4. ETC pass at 20 km/h; MTC stop 15–20 s at 15 m inside MTC. AVs and HVs may be ETC or MTC. AV decision interval 1 s; action updates at 0.1 s. Training uses ε-greedy (ε=0.9 threshold), learning rate 0.001, replay memory 20,000, batch size 128, γ=0.98, hidden layers 64 neurons each, target network updated every 500 steps; ETC AVs trained 20,000 episodes, MTC AVs 15,000 episodes. Hardware: i7-12700H (2.30 GHz), 32 GB RAM, RTX 3070 Ti; Python 3.8, TensorFlow 2.6.0.
Key Findings
Single-AV strategy performance: Across multiple training trials, DDQN and Dueling DQN learned faster than DQN, with Dueling DQN exhibiting superior training stability. Because MTC vehicles traverse longer distances within the diverging area, their cumulative rewards were lower than ETC vehicles during training. Under varying hourly traffic flows (1250, 1500, 1750, 2000 veh/h), AVs trained with the proposed DRL strategies achieved higher average diverging speeds than HVs for both ETC and MTC vehicle types, with tighter speed distributions and higher lower bounds, demonstrating robustness of the lateral motion strategy across traffic levels. AVs trained via Dueling DQN attained the highest overall average speeds with lower variance than DDQN-trained AVs. Mixed-flow penetration impacts: Using the self-efficient Dueling DQN strategy for AVs, increasing AV penetration (0–100%) yielded non-monotonic effects. Efficiency: For ETC vehicles, average travel time decreased as AV penetration rose below 70%, remained stable in the 70–80% range, and increased beyond 80%. MTC vehicles showed a similar trend with an earlier turning point (50–60%); when penetration exceeded 60%, average travel time slightly increased and then stabilized, likely due to fewer accessible lanes limiting improvement potential. Safety (PET conflicts): As penetration increased from 0% to 50%, the frequency of conflicts in both PET intervals [0,1] s and (1,2] s decreased, reflecting more balanced lane selections compared to HVs’ tendency to prefer lanes requiring shorter lateral distances. Beyond 50%, conflicts in (1,2] s increased noticeably; above 80%, severe conflicts in [0,1] s rose markedly. Reported PET counts (by penetration rate 0%, 10%, …, 100%) showed [0,1] s conflicts: 27745, 26698, 25772, 25106, 24451, 24145, 24310, 24885, 25332, 25980, 26075; and (1,2] s conflicts: 22420, 21895, 21235, 21003, 20495, 19997, 20105, 20295, 20608, 20998, 21406. Overall, moderate AV penetration improves efficiency and safety, while excessive penetration with self-efficient strategies intensifies competition for limited road space, degrading operations.
Discussion
The study addresses the challenge of AV decision-making in non-lane-based environments by providing a realistic microscopic simulation of destination-driven HV trajectories and tailoring DRL state and reward functions to the diverging area’s characteristics. The resulting self-efficient lateral motion strategies enable AVs to choose target toll lanes that balance queue length and required lateral maneuvering, improving single-vehicle efficiency relative to HVs under a range of traffic demands. The penetration analysis reveals system-level dynamics: modest adoption of self-efficient AVs promotes better throughput and fewer conflicts through more balanced lane selection, but high penetration leads to resource competition and increased crossing conflicts, reflecting the complex interactions in non-lane-based areas where lateral freedom can create multi-angle conflict points. These findings suggest that while self-efficient decision-making enhances individual performance, system-wide benefits depend on penetration and may require cooperative or system-aware strategies to sustain efficiency and safety at higher AV shares.
Conclusion
A DRL-based self-efficient lateral motion strategy for AVs was developed for toll plaza diverging areas, supported by a new microscopic simulation platform that reproduces weak-constraint HV behaviors to furnish realistic training interactions. Tailored state and reward designs guided AVs to select lanes effectively while penalizing aggressive lateral movements. Trained AVs achieved higher travel efficiency than HVs, and Dueling DQN provided the most stable and effective performance. Mixed-flow evaluations demonstrated that moderate AV penetration improves overall efficiency and safety, whereas excessive penetration of self-efficient AVs degrades both due to intensified competition for limited road resources. Future work should: (1) incorporate cooperative multi-AV strategies (e.g., information sharing and motion coordination) to enhance system-level safety and efficiency; (2) integrate longitudinal motion optimization (speed and acceleration) with lateral decisions to deliver comprehensive control for comfort and efficiency; and (3) expand calibration and validation across diverse non-lane-based geometries to assess generalizability.
Limitations
The simulation platform and model calibration are based on a single toll plaza diverging site; detailed calibration and accuracy are deferred to an accompanying manuscript, limiting immediate generalizability to other geometries without re-calibration. AVs rely solely on onboard sensing (no V2X) in this study, potentially constraining performance in scenarios where connectivity could aid decision-making. The DRL framework focuses on discrete target toll lane selection, with motion execution governed by rule-based car-following rather than end-to-end continuous control. The work primarily optimizes lateral decisions without co-optimizing longitudinal speed and acceleration, which may affect comfort and efficiency. While penetration effects are analyzed, cooperative behaviors among multiple AVs are not modeled; self-efficient strategies at high penetration lead to degraded system performance.
Related Publications
Explore these studies to deepen your understanding of the subject.