Introduction
Thermal soaring, a technique employed by birds and gliders, involves utilizing rising columns of warm air (updrafts) for efficient long-distance flight, even in challenging wind conditions. This capability makes it a compelling model for studying motion control and its learning processes in both animals and autonomous systems. Thermal soaring presents a rich dynamic system with non-trivial constraints, yet it involves relatively few control parameters compared to other locomotion methods. Its simplicity in modeling (lack of contact forces and availability of reliable aerodynamic models) and increasing experimental accessibility (through bird flight data and affordable AI-based glider control) further enhance its suitability as a model problem. While rule-based algorithms have achieved autonomous thermal soaring, they are limited in exploring novel solutions and advancing our understanding of motion control as a learning challenge. Machine learning, particularly reinforcement learning (RL), offers a powerful alternative. In RL, an agent learns to maximize a reward function through interaction with its environment, acquiring a policy that governs its actions based on observed states. This allows for exploration and discovery of optimal strategies that may surpass human-designed ones. Previous RL applications in thermal soaring, while successful, have used simpler architectures and environments. This study uses deep reinforcement learning (deep-RL) with deep neural networks (NNs) to train an agent to perform thermal soaring in a simulated environment characterized by challenging horizontal winds, a more realistic scenario than previously studied. The goal is to investigate the learning process, identify potential bottlenecks, and compare the learned behavior to that of real-world soaring birds, specifically vultures.
Literature Review
Recent studies have successfully implemented autonomous thermal soaring using various methods. Rule-based algorithms, while effective, rely on pre-programmed rules and state estimators, limiting their capacity for innovative solutions. Machine learning, specifically reinforcement learning (RL), offers a more flexible approach. Early applications of RL to thermal soaring used simpler architectures like lookup tables and SARSA algorithms in environments without significant wind. While effective, these methods likely have limited performance under more challenging conditions such as those with horizontal winds. Deep neural networks (DNNs), coupled with advanced RL algorithms like actor-critic and policy gradient methods, have shown promise in complex motion control tasks. Studies have demonstrated the successful application of deep-RL to autonomous thermal soaring in both simulated and real-world glider environments. These studies explored aspects like individual thermal exploitation, cross-country soaring strategies balancing local exploitation and global way-point navigation, and addressed the exploration-exploitation dilemma. However, deeper analysis of the learning process itself, including potential learning bottlenecks, robustness of learned policies, and comparison with natural soaring behaviors, remain open questions. This study aims to address these using a simulation-based deep-RL approach.
Methodology
This research employed a simulation-based deep-RL system to train a neural network-based agent to perform thermal soaring under horizontal wind conditions. The simulation model simplified glider dynamics to three degrees of freedom (x, y, z), representing it as a point mass with lift, drag, and side forces. Aerodynamic forces were calculated using a simplified model based on velocity, angle of attack, and sideslip angle. The atmospheric model included a constant horizontal wind (u) in the +x direction and a thermal updraft modeled using a combination of Gedeon's and Lenschow's models, creating a radial updraft profile. The agent's state included glider speed, climb rate, bank angle, angle of attack, wind speed, and the angle between the glider's velocity and the wind, along with a memory buffer of previous states. Actions were changes in bank angle and angle of attack. The reward function primarily incentivized climb rate but included penalties for instability and distance from the thermal center. Curriculum learning was employed, gradually increasing the maximum horizontal wind speed during training. The deep deterministic policy gradient (DDPG) algorithm, an actor-critic method, was used to train the neural network, with hyperparameter optimization performed to determine optimal settings. To analyze the learned policy and neural network, trajectory analysis and neural activation clustering techniques were employed. The agent's performance was evaluated using a newly defined soaring efficiency metric (η), quantifying the fraction of the updraft exploited. Robustness was assessed by testing the agent under various wind speeds, thermal parameters, and sensor noise levels. Finally, the learned policy was compared to data from free-ranging vultures to identify similarities in soaring techniques.
Key Findings
The study revealed several key findings: 1) **Learning Bottlenecks:** Thermal soaring presented at least two learning bottlenecks: achieving stable flight and maintaining proximity to the thermal center. Reward shaping, by adding penalties for instability and distance from the thermal, effectively mitigated these bottlenecks and facilitated efficient soaring. 2) **State and Action Representation:** Analysis of different state representations revealed the crucial role of wind speed (u) and the angle (θ) between glider velocity and wind in efficient soaring. A memory buffer of approximately 5 seconds was necessary for optimal performance. Controlling both bank angle and angle of attack was essential for stable thermalling in strong winds. 3) **Robustness:** The learned policy exhibited robustness across a range of thermal parameters and sensor noise levels. Sensitivity analysis indicated that the angle (θ) was the most sensitive sensor, highlighting its importance for efficient soaring. 4) **Neural Network Analysis:** Clustering of neural activation patterns revealed distinct functional clusters associated with specific phases of the circling flight. These clusters became more defined with training, exhibiting a correlation with the angle (θ), similar to observed differences in soaring behavior between young and adult vultures. 5) **Comparison with Vultures:** Comparison with real vulture flight data showed striking similarities in thermalling trajectories, circling radii, and the distribution of the angle (θ), highlighting the biological plausibility of the learned policy. The efficiency metric (η) allowed for quantitative comparison between the RL agent and vultures. In particular, for *u* = 3 m/s, when omitting the first 20 s of each trajectory, the mean *v<sub>z</sub>* reached 0.54 m/s (maximum of 0.67 m/s), the mean η reached 0.88 (maximum of 0.96).
Discussion
The results of this study provide valuable insights into the learning process of complex motion control tasks like thermal soaring. The identification of learning bottlenecks emphasizes the importance of reward shaping and curriculum learning in training agents for complex tasks. The importance of wind speed (u) and angle (θ) in the agent's state highlights the significance of environmental sensing in achieving efficient soaring. The robustness analysis underscores the capacity of the learned policy to generalize to unseen conditions, and the neural network analysis provides a unique perspective into the internal representation and functional organization of the agent's policy. The close correspondence between the simulated agent's behavior and the flight patterns of real vultures suggests that the learned policy may capture key aspects of the natural soaring strategy. While the penalty based on distance from the thermal center is only applicable in simulation, pre-training agents using this penalty may improve their ability to estimate thermal position in real-world scenarios, bridging the simulation-to-reality gap. Future work could explore more complex environments, including multiple thermals, variable wind conditions, and more sophisticated glider models. The integration of additional sensory modalities, such as visual information, could also enhance the agent's performance.
Conclusion
This study demonstrates the successful application of deep reinforcement learning to the problem of autonomous thermal soaring under challenging wind conditions. The use of a novel efficiency metric and analysis of learning bottlenecks and neural activation patterns provided novel insights into the learning process and the nature of the underlying control policy. The close correspondence between the agent's behavior and that of real-world vultures validates the model and suggests potential applications in improving autonomous UAVs and furthering our understanding of animal behavior. Future research could focus on extending the model to more complex environments and incorporating additional sensory inputs to enhance the system's robustness and adaptability.
Limitations
The study is based on a simplified simulation model of glider dynamics and atmospheric conditions. The simplified thermal model may not fully capture the complexity of real-world thermals. The reliance on a reward penalty for distance from the thermal center, while helpful during training, is not directly applicable in real-world scenarios where this information is not readily available. The analysis of the neural network focused on clustering of activation patterns; further investigations using more sophisticated interpretation methods could provide more detailed insights. The comparison to vulture data, while insightful, involved some approximations in estimating wind speed and bank angle from the available data.
Related Publications
Explore these studies to deepen your understanding of the subject.