logo
ResearchBunny Logo
Reinforcement Learning Based Topology Control for UAV Networks

Engineering and Technology

Reinforcement Learning Based Topology Control for UAV Networks

T. Yoo, S. Lee, et al.

This research, conducted by Taehoon Yoo, Sangmin Lee, Kyeonghyun Yoo, and Hwangnam Kim, introduces a cutting-edge reinforcement learning system for optimizing UAV network connectivity. By focusing on UAV positions, interference, and energy use, this study reshapes network structure efficiently using DDPG, showcasing significant effectiveness across various topologies.... show more
Introduction

Recent advances in UAV technology enable diverse civilian applications, including ad hoc communication infrastructure in areas without networks or in disasters. Swarming multiple UAVs can quickly provide low-cost networks, but designing an efficient topology under dynamic conditions is critical for scalability and performance. Conventional control often connects every UAV to all neighbors within range, which is robust but highly inefficient due to excessive overhead and power consumption. Prior power-control approaches can suffer from centralized-controller scalability limits. The authors explore a Minimum Spanning Tree (MST)-based framework: each UAV partitions its surrounding 3D space into subspaces, builds an MST per subspace, and aggregates them into a forest that reduces power, maximizes throughput, and preserves connectivity. Two issues arise: optimally choosing the number of subspaces (which varies with time and local conditions), and expediting that decision. They use reinforcement learning (DDPG) to learn optimal partitioning in a continuous action space and introduce a variable-step learning mechanism to shorten training time, especially upon UAV deployments. Contributions: • Construct a UAV network topology leveraging MST to optimize energy consumption and network throughput while maintaining connectivity. • Learn the optimal topology via RL (DDPG) in continuous action space for spatial partitioning. • Reduce RL training time by adaptively adjusting the number and size of learning steps. The remainder: Section 2 surveys related work on UAV formation, DDPG, and the ADAM optimizer; Section 3 details system design; Section 4 presents experiments; Section 5 concludes.

Literature Review

Related work spans UAV formation, RL algorithms (DDPG), and optimizers (ADAM). UAV formation and deployment affect network topology; studies address collision avoidance, multi-UAV flight information sharing, surveillance, and precise localization using deep learning, UWB, and RTK-GPS. Energy efficiency has been approached via fuel optimization, genetic algorithms for reconfiguration, coalition formation games, and data aggregation strategies. MST-based methods are relevant for energy-efficient connectivity. DDPG extends DQN to continuous action domains and has been effective in robotics with sparse rewards, autonomous driving, and power grid voltage control—suitable for continuous 3D spatial partitioning. ADAM is a popular optimizer in RL; enhancements include Nesterov momentum and analyses of convergence and local minima; convergence guarantees and variants have been studied. This body of work motivates applying DDPG for continuous topology control and ADAM-based training with modifications for faster convergence.

Methodology

System design comprises three modules: (1) Network topology control, (2) Reinforcement learning (RL) with DDPG, and (3) Step optimization for training efficiency. 1) Network topology control: Each UAV updates neighbor positions via periodic beacons, computes distances, partitions its transmission range (3D space) into multiple subspaces, and selects nearest neighbors within each subspace to form links. Within each partition, an MST-based topology is constructed; aggregating partitions yields a forest that reduces unnecessary links and overhead while preserving connectivity. The state metrics extracted from the resulting topology are average hop-count (T_hop), average link power/energy consumption (T_power, using distances and radio parameters), and average node degree (T_degree). 2) Reinforcement learning (DDPG): The action is the number of sectors (subspaces) into which a UAV partitions its 3D neighborhood. State S_TOPO,t = [T_hop,t, T_power,t, T_degree,t]. The agent (per UAV) uses an actor network μ(S|θ_μ) to output deterministic actions with exploration noise; a critic Q(S,A|θ_Q) evaluates action-value. Experiences (S_t, A_t, R_t, S_{t+1}) are stored in a replay buffer. Target values are computed as y_TOPO,t = R_TOPO,t + γ Q(S_{t+1}, μ(S_{t+1}|θ_μ)|θ_Q). The critic minimizes L_TOPO = (1/N) Σ_t (y_TOPO,t − Q(S_t, A_t|θ_Q))^2. The actor is updated via sampled policy gradients ∇{θ_μ} J ≈ (1/N) Σ_t ∇A Q(S,A|θ_Q)|{A=μ(S_t)} ∇{θ_μ} μ(S_t|θ_μ). Soft updates for target networks use τ: θ_Q' ← τ θ_Q + (1−τ) θ_Q'; θ_μ' ← τ θ_μ + (1−τ) θ_μ'. 3) Step optimization: Recognizing that required training steps vary with topology, the method adaptively adjusts episode length and action step size. After exceeding a threshold H_done, it computes the average reward F_avg over recent steps; if F_avg exceeds H_learn (near upper reward bound), the episode terminates early. Otherwise, it increases action step size A_size by A_change and continues. This reduces training time and increases responsiveness to topology changes, particularly when new UAVs join. Environment and RL setup: The simulator uses a 1000 m cube space, frequency 2.5 GHz, transmit power 20 dBm, antenna gain 2.5 dBi, and receive threshold −70 dBm. Episodes: initially 100 episodes with up to 1000 steps each (before applying step optimization). Discount γ=0.99, τ=0.005. Actor learning rate 0.001, critic 0.002; replay buffer (50,000, 64). Reward: If topology disconnects, assign R_disconnect (no reward/penalize and explore another action). If action exceeds A_border (20), assign negative R_out. Otherwise R_TOPO,t = H_hopT_hop,t + H_powerT_power,t + H_degree*T_degree,t (weights chosen to balance connectivity, energy, and overhead). Actions influence the number of one-hop links and thus hop, power, and degree, enabling closed-loop learning of optimal partitioning.

Key Findings
  • Typical formations (30 UAVs in sphere, cube, pyramid): After training, unnecessary links were pruned and stable MST-based connectivity maintained. Subspaces per node stabilized between ~7–10; episodic rewards converged near-optimal before 10 episodes in many cases, with brief exploration-caused dips around episode ~50. - Multiple nodes analysis: Five sampled nodes in a 30-UAV network showed learned sector counts averaging 7–10; high rewards reached after ~60 episodes despite positional changes each episode. - Arbitrary formations and scalability (10, 20, 30, 40, 50 UAVs): For 10 UAVs, optimal subspaces ≈12 (broader search 4–18); for ≥20 UAVs, optimal subspaces ≈6–10. Reward convergence occurred before 10 episodes for ≥30 UAVs, and by ~40 episodes for ≤20 UAVs. Post-training topologies maintained connectivity without isolated UAVs and with substantially fewer links than full connections. - Step function optimization: Variable-step algorithms reduced the number of training steps per episode substantially: about 80% reduction when changing the number of steps, 85% reduction when changing step size, and ~86% reduction when combining both, while achieving episodic rewards comparable to or higher than the baseline fixed-step DDPG. Over time, fewer steps were needed per episode as learning converged, increasing responsiveness. - Simulation/visual validation: OpenCV/MATLAB-based simulations showed consistent results across formations; a video demonstration confirmed maintained optimal topologies as formations changed; AirSim scenarios with 10 UAVs in urban-like environments showed the method sustaining optimal connectivity during flight.
Discussion

The study tackles the core challenge of scalable, energy-efficient, and throughput-aware topology control for UAV swarms by combining MST-based structural pruning with RL-driven spatial partitioning. Results indicate that learning the number of 3D subspaces per UAV effectively governs link density and distribution, improving energy efficiency (shorter links, reduced overhead) while maintaining full connectivity. Rapid convergence (often <10 episodes for larger fleets) and substantial reductions in training steps via step optimization demonstrate practicality for dynamic, real-time operations where UAV positions change frequently. Scalability is evidenced up to 50 UAVs with consistent performance; the method adapts sector counts based on density (more sectors for sparse, fewer for dense). The approach is relevant to UAV-assisted networks for disaster relief and urban missions, as it reduces unnecessary links and supports multi-hop connectivity without centralized control. The AirSim validation further suggests robustness in realistic environments, reinforcing feasibility for deployment and paving the way for integration with more advanced RL architectures.

Conclusion

The paper presents an RL-based topology control system for UAV networks that partitions each UAV’s 3D neighborhood and forms MST-based links per partition. Using DDPG to learn the number of partitions yields adaptive, efficient topologies that balance energy consumption, throughput, and connectivity. A step optimization mechanism further accelerates training by dynamically adjusting step count and action step size, enhancing responsiveness to topology changes. Simulations across typical and random formations, with fleet sizes up to 50 UAVs, show rapid convergence, sustained connectivity, and significantly reduced training steps. Future work includes exploring alternative or combined deep RL algorithms, richer state/action/reward designs, optimal blending of multiple learning strategies, and more sophisticated variable-step modeling to further reduce training time and improve adaptability.

Limitations

The paper does not explicitly enumerate limitations. Validation is primarily simulation-based (Python/MATLAB, OpenCV) and in the AirSim simulator; results from real-world multi-UAV flight experiments are not reported. Quantitative network-layer metrics (e.g., throughput, latency) are discussed qualitatively via reward components rather than detailed empirical measurements; sensitivity to reward weights (H_hop, H_power, H_degree) and hyperparameters (e.g., A_border, step optimization thresholds) is not deeply analyzed.

Listen, Learn & Level Up
Over 10,000 hours of research content in 25+ fields, available in 12+ languages.
No more digging through PDFs, just hit play and absorb the world's latest research in your language, on your time.
listen to research audio papers with researchbunny