logo
ResearchBunny Logo
Deep Reinforcement Learning-Based Dynamic Pricing for Parking Solutions

Computer Science

Deep Reinforcement Learning-Based Dynamic Pricing for Parking Solutions

V. Bui, S. Zarrabian, et al.

Discover how a groundbreaking deep reinforcement learning-based dynamic pricing model can revolutionize parking utilization and alleviate urban traffic congestion. This innovative research was conducted by a team from Multimedia University, demonstrating the power of real-time data to adjust prices and enhance parking efficiency.

00:00
00:00
~3 min • Beginner • English
Introduction
The study addresses increasing urban vehicle ownership and its negative impacts, including time spent searching for parking (about 25 minutes per day in some urban areas), increased fuel consumption, CO2 emissions, and congestion. Smart parking systems using real-time data (sensors, ticketing/e-payment) enable dynamic pricing to improve utilisation and revenue. The research proposes a deep reinforcement learning-based dynamic pricing (DRL-DP) model to regulate parking prices based on vehicle volume and occupancy, without requiring labeled data. The model aims to sequentially select prices to reduce peak-hour congestion, distribute demand to off-peak times (e.g., via discounts/cashback), and maximise vendor revenue and space utilisation. The paper outlines related work, the proposed method, experiments, and future directions.
Literature Review
Two strands are reviewed: (1) Dynamic pricing: Prior work in e-commerce and competitive markets explored time-based pricing, market segmentation, RL for multi-agent pricing, genetic algorithms for bargaining, and game-theoretic models addressing demand uncertainty, strategic consumers, and regret bounds. Most focus on financial/e-commerce contexts; limited integration with AI beyond online retail platforms. (2) Smart parking: Reviews of smart solutions categorize research into data gathering, system implementation, and service diffusion. Prior systems include reservation and pricing frameworks (e.g., iParker), crowdsourced matching and pricing (ParkForU), event-driven space allocation, ridesharing/carpooling matching, dynamic pricing tied to arrival time, zone-based policies to encourage short/mid-term stays, and data-driven availability prediction (quality and quantity issues). Some studies model macroscopic pricing decisions and reservation-based dynamic pricing to maximize revenue. The gap identified is combining dynamic pricing with RL to regulate arrivals and occupancy using real-time data and forecasting, within competitive multi-operator environments.
Methodology
The proposed DRL-DP framework models a competitive parking market where a target parking operator (player) competes with nearby operators (opponents). Key components: - Environment and MDP: The environment is a sequential decision-making process formulated as a (partially observable) Markov Decision Process. State includes time step and occupancy for player and opponent; action is a discount applied to current pricing; rewards consider occupancy rate, revenue, and vehicle flow regulation relative to target smoothed arrivals. - Data and forecasting: Real in/out flows collected via IoT devices on barrier gates with ANPR cameras at two Kuala Lumpur locations (Location A capacity 165, Location B capacity 950) over roughly six months (Sep 2020–Jan 2021). SARIMAX forecasts hourly arrivals using historical occupancy (captures seasonality and exogenous factors) for both player and opponent. Vehicle stay time distributions (mostly 1–5 hours) are extracted to compute fees. - Vehicle flow regulation: A simple moving average (SMA) smooths historical arrival rates to define target flows; rewards increase when regulated flows approach SMA targets (dispersing demand from peak to off-peak). - Modules: 1) Parking Operator (capacity, location, pricing policy). 2) Driver (commuter/frequent/intermittent) with preferences over price and distance (normalized 0–1) and aversion to high occupancy; decides among operators based on a preference score that combines distance, price (computed by a pricing engine per policy: arrival-time-dependent pricing, usage-aware/progressive pricing, flat rate), and occupancy. 3) Grid module: 2D map with locations, road info, and distances. 4) Pricing engine: Generates new prices per scheme and applied discount; consumes availability and forecast information. 5) Reward module: Computes hourly and episodic rewards (normalized) from occupancy, revenue, and vehicle-flow regulation outcomes. - RL agent and learning: Uses Q-learning with epsilon-greedy exploration to learn discount actions. Episodes last one week (seven days), with global reward reset per episode; training repeats 3000 episodes. Reward aggregation occurs hourly and over period sections. - Deep learning model: A DNN serves as function approximator for action selection based on normalized inputs: time step and occupancy of player and opponent. Architecture: input array [time, occupancy_player, occupancy_opponent]; three hidden layers (32 units, ReLU); output layer with 5 units corresponding to discrete actions in the action space. The DNN is updated every 200 episodes; epsilon encourages exploration. - Pricing policies and action spaces: Simulated policies include Uniform Arrivals (time-of-day flat fees) and Progressive Pricing (piecewise hourly fees). Action spaces include five discrete discount/markup levels, e.g., [0, −5, −10, −15, −20], [−10, −5, 0, 5, 10], [−20, −10, 0, 10, 20] (currency units), enabling different adjustment magnitudes. - Assumptions: Weekly episodes; operators consider utilisation, driver satisfaction, and revenue; pricing engine computes fees from arrival time and stay time; market demand derived from past flows; rewards normalized over period sections to form global reward.
Key Findings
- The DRL-DP approach can learn effective dynamic pricing policies in a competitive multi-operator environment using real data and SARIMAX forecasts, improving revenue, occupancy, and smoothing of vehicle arrivals depending on the reward design. - Convergence: Average reward fluctuates substantially in early episodes (roughly episodes 1–100) and tends to converge towards an optimal value around 2500 episodes as exploration decays and exploitation stabilizes. Training runs for 3000 episodes per experiment. - Occupancy-only reward: Quickly achieves high apparent accuracy in early episodes because always-discounting fills capacity; however, this sacrifices revenue and leads to trivial policies where discounts dominate regardless of financial outcomes. - Revenue-only reward: Generally increases rewards across most experiments; in cases where the player already has a low-priced (highly competitive) scheme and uses an action space with large steps (e.g., [−20, −10, 0, 10, 20]), the agent struggles to further improve revenue—discounts reduce income, while markups erode competitive advantage. - Vehicle Flow Regulation (VFR) reward: Harder to optimize than occupancy or revenue. When action spaces only allow discounts (e.g., [0, −5, −10, −15, −20]) and the player is already cheaper, the agent cannot decrease arrivals to meet smoothing targets; rewards can improve in more balanced competitive settings. - Unified reward (occupancy + revenue + VFR): Accuracy is lower due to competing objectives, but the agent still achieves higher overall rewards during training, indicating feasible multi-objective trade-offs. - Action space sensitivity: Wider adjustment ranges (e.g., [−20, −10, 0, 10, 20]) increase reward variance and can help shift driver preferences more strongly; with uniform arrival rates, narrower, more precise action steps are preferable to enable fine-grained optimization. - Data and context: Two KL locations (capacities 165 and 950) show weekday peaks and weekend troughs; vehicle stay times mostly 1–5 hours, which influence pricing and revenue calculations.
Discussion
The findings indicate that reinforcement learning can effectively regulate parking prices to manage demand and revenue when reward design aligns with operational goals. Exploration helps avoid short-term myopic choices and supports convergence to better long-term policies. The effect of action-space granularity is pronounced: coarse adjustments can destabilize learning or prevent nuanced competition management, especially when the player is already price-competitive. For revenue goals, low-priced vendors may require permissible price increases in the action set to escape low-revenue equilibria; otherwise, discounts only increase occupancy without boosting revenue. Vehicle flow regulation is inherently more challenging due to competitive asymmetries and limited levers (e.g., only discounts), but multi-objective training shows the feasibility of balancing utilisation, revenue, and congestion mitigation. Overall, the DRL-DP framework demonstrates the capacity to shift demand away from peaks, increase off-peak utilisation, and improve profitability under realistic constraints and competition.
Conclusion
The paper proposes and evaluates a deep reinforcement learning-based dynamic pricing framework for off-street parking, integrating SARIMAX arrival forecasting, a driver choice model, and a modular environment (grid, pricing engine, reward). Simulations using real data from two urban parking facilities show that DRL-DP can learn policies that enhance revenue, adjust occupancy, and smooth demand over time, particularly when action spaces and reward functions are well-aligned with objectives. Future research should incorporate on-street competition, building/facility effects, and richer human factors in driver preferences. Further, modelling the relationship between road vehicle volume and parking arrivals and enabling more flexible, context-adaptive environment parameters could improve realism and the robustness of learned policies.
Limitations
- The model does not account for on-street parking, which can significantly affect willingness to use off-street facilities due to price differentials. - Facility/building-specific effects (e.g., tenant or shop destination within an opponent’s building) are not modeled and may bias driver choices. - Human factors beyond price/distance/occupancy (e.g., habitual preferences, safety, amenities, land use type such as tourism/office/residential) are simplified. - Competitive asymmetries and limited action spaces (e.g., discount-only) can limit the agent’s ability to regulate flows or improve revenue. - Partial observability: Operators lack access to full traffic information and true arrival rates; reliance on forecasts introduces uncertainty. - Results are simulation-based; generalizability to broader urban contexts requires further validation with diverse datasets and real-world trials.
Listen, Learn & Level Up
Over 10,000 hours of research content in 25+ fields, available in 12+ languages.
No more digging through PDFs, just hit play and absorb the world's latest research in your language, on your time.
listen to research audio papers with researchbunny