
Biology
Revealing principles of autonomous thermal soaring in windy conditions using vulture-inspired deep reinforcement-learning
Y. Flato, R. Harel, et al.
Explore the fascinating world of thermal soaring, a technique beloved by birds and gliders that utilizes updrafts of hot air. This groundbreaking study by Yoav Flato, Roi Harel, Aviv Tamar, Ran Nathan, and Tsevi Beatus offers insights into motion control learning, unveiling new efficiency metrics and demonstrating how trained networks evolve their functionalities. Immerse yourself in the rich complexities of this model-problem that bridges nature and technology.
~3 min • Beginner • English
Introduction
Thermal soaring allows birds and gliders to exploit localized convective updrafts for efficient flight, even under challenging horizontal wind conditions. As a model for motion control, it combines rich dynamics and constraints with relatively few control parameters and is increasingly accessible for empirical study. Prior autonomous soaring approaches largely relied on rule-based control and state estimation, which, while effective, limit discovery of novel strategies and insights into learning-based control. Reinforcement learning (RL) offers a framework where an agent learns a policy to maximize rewards from observations through exploration, potentially surpassing human-designed heuristics. This study aims to use deep RL to learn autonomous thermal soaring under realistic horizontal winds, investigate the learning process and its bottlenecks, quantify robustness with a new efficiency metric, analyze the internal neural network representations underlying behavior, and compare learned strategies with the thermalling behavior of free-ranging vultures.
Literature Review
Early autonomous thermalling used rule-based controllers with state estimation of thermal centers and glider energy state (e.g., Allen, Andersson, Depenbusch, Tabor and others), effective but limited in exploring new strategies. Initial RL applications included tabular SARSA agents controlling bank angle based on discretized state (vertical speed and bank) in low-wind settings (Wharington; Reddy et al., including a real glider). With deep neural networks and modern policy-gradient/actor-critic methods, deep RL has achieved complex motion control and has been applied to soaring: Novati et al. demonstrated deep-RL control for gliding and landing in complex flows; Notter et al. implemented deep-RL with LSTM for both locating/exploiting thermals with winds and hierarchical cross-country decision-making; Cui et al. addressed exploration–exploitation with energy-aware deep RL in simulation. These successes motivate using deep RL not only for control performance but also to probe fundamental questions of learning structure, sensing/actuation importance, robustness, interpretability of neural policies, and comparisons to animal behavior.
Methodology
The problem is formulated as a continuous-control RL task where a simulated glider (vulture-like parameters) acts to maximize climb rate under a drifting thermal with horizontal wind. Environment and dynamics: a simplified 3-DOF point-mass aircraft model under spatially varying wind (horizontal wind plus thermal updraft). State variables include speed V, climb rate v_z, bank angle σ, angle-of-attack α, angle to wind θ, and wind speed u. A short history (memory buffer) of the previous 8 one-second states is appended. Actions are changes in bank and angle-of-attack per 1 s step: Δσ ∈ [−15°, 15°] with σ ∈ [−50°, 50°]; Δα ∈ [−10°, 10°] with α ∈ [−30°, 30°]. Reward is primarily v_z plus shaping terms: P_center = −d/50 penalizing horizontal distance d from thermal center, and P_stab penalizing early termination when unstable (spinout) proportional to remaining time; crash penalty −1000. The thermal model follows Gedeon/Lenschow/Bencatel with updraft profile w(r,z) and characteristic parameters w_∞=5 m/s, z* = 2000 m; the thermal drifts uniformly at horizontal wind u; overall flow v(x,y,z,t) = (u,0,w(r(t),z))^T with r(t)=√[(x−ut)^2+y^2]. Training uses DDPG (actor–critic, policy gradient) with stable-baselines3 (PyTorch). Actor and critic are feedforward NNs with two hidden layers of 200 neurons each. Hyperparameters (algorithm DDPG/PPO, learning rates, layers) were optimized (BOHB). Each episode is 200 s with 1 s control intervals; initial conditions randomized: orientation, (x,y), z=500 m, V=15 m/s, glide angle −5°. Curriculum learning on wind speed u was crucial: trains with u in [0–2], then [0–4], then [0–6] m/s. Evaluation involves 100 runs per u with random initial conditions, computing time-in-thermal, mean v_z, and efficiency η = (v_z^agent − v_z^baseline)/(v_z^optimal − v_z^baseline), where v_z^optimal is an analytically derived upper bound for circular thermalling given the model (0.72 m/s for nominal parameters) and v_z^baseline is the best sink rate without updraft (−0.75 m/s). Robustness tests varied thermal parameters (w*, z*), added Gaussian sensor noise to state variables individually, and introduced environmental wind gusts (u varying with 20 s timescale). Interpretability: Hidden-layer activations of the trained networks were concatenated each timestep and clustered via k-means; number of clusters k determined by knee method on SSE. Agents at different training stages (“young”, “intermediate”, “expert”) were analyzed to relate clusters to flight phases (θ). Vulture comparison: From GPS data of 20 vultures (1 Hz, >4M samples), 243 thermalling trajectories (≥100 s each; total 9.4 h) were selected in Aug–Sep. Wind magnitude/direction and bank angle were estimated from data; parameters (z*, w*) for training a vulture-like agent (m=7.75 kg, S=0.87 m^2) were chosen to match observed mean v_z and thermalling radius. Thermalling radius was estimated from trajectory geometry using θ crossings.
Key Findings
- Successful autonomous thermalling under horizontal winds u=0–6 m/s using deep RL with curriculum learning. Trajectories become trochoid in the xy-plane, with agents circling and downwind-tracking the drifting thermal; typical distance to thermal center 10–50 m (mean ~25 m). The agent tightens turn radius near θ≈±180° to track the drift.
- Vertical speed generation depends on wind phase: strongest vertical acceleration for θ in approximately −45° to 90° (flying against the wind), converting headwind into climb; variability in θ and α increases with u (e.g., std(θ) from 6° at u=0 to 12° at u=5 m/s).
- Performance metrics: Agent spends most of time within thermal (≤80 m from center) for u=0–6 m/s. Efficiency η > 0 across all u, peaking around u=1–3 m/s with η ≈ 0.8. Highest average v_z ≈ 0.42 m/s at u=3 m/s; excluding first 20 s yields mean v_z ≈ 0.54 m/s (max 0.67 m/s) and mean η ≈ 0.88 (max 0.96). Bounds: v_z^optimal ≈ 0.72 m/s; v_z^baseline ≈ −0.75 m/s.
- Learning bottlenecks identified via reward shaping: (1) Stabilization—without P_stab, agents crash early when reward is v_z alone; adding P_stab enables stable flight. (2) Thermal centering—adding P_center encourages staying near the thermal; combined v_z + P_stab + P_center yields efficient, stable thermalling. Using v_z + P_center alone fails to stabilize.
- State/action representation insights: Including horizontal wind speed u in the state is crucial for generalization across wind speeds; agents without u can overfit to a single u and perform poorly at other u. Memory of ≈2 s suffices to stay near the thermal, but efficient soaring (high η) needs ≈5 s of memory. Control over Δα in addition to Δσ is necessary for stability and efficiency under strong winds; adding Δβ slowed convergence.
- Robustness: Generalizes to varied thermal parameters; for w*^2 ≥ 5 m/s across tested z*^2, η > 70%. Sensor noise degrades performance at different rates: most sensitive to θ; normalized noise of order ~0.65 (relative to θ’s natural variability) reduces performance to baseline; least sensitive to V (requires ~1.7). Noise on sensed u: σ_noise ≈ 7 m/s (greater than mean u=3 m/s) drives performance to baseline, suggesting memory averaging of u. Environmental gusts (20 s timescale) with σ≈3 m/s reduce to baseline.
- Interpretability: Hidden-layer activations cluster into four functional modes that align with flight phases (θ). This phase-specific clustering becomes more distinct as agents mature from “young” to “expert”, with clusters mapping to headwind, tailwind, crosswind segments, and a brief early search phase.
- Animal comparison: Learned policy resembles soaring vultures. Distributions of thermalling radii have similar means (~30 m), with the agent’s narrower due to simulation regularity. Both agent and vultures peak in θ near 180° (tailwind). Mean speed, climb rate, and bank angle are comparable between agent and vultures.
Discussion
The study demonstrates that deep RL can learn efficient thermal soaring under realistic horizontal winds by addressing the inherent structure of the task. Two sequential bottlenecks—stable flight and centering on the thermal—must be overcome; reward shaping and curriculum learning enable progression through these stages. The newly defined efficiency metric η quantifies how effectively the agent exploits available updraft, facilitating comparisons across conditions and with animal behavior. Analysis of state and action representations highlights the necessity of wind information and adequate short-term memory for non-Markovian inference of thermal position, while control over angle-of-attack is essential in strong winds. Robustness analyses delineate sensitivity to sensor noise (especially θ) and environmental gusts, informing priorities for sensing in UAVs and illuminating potential sensory limits in birds. Interpreting neural activations reveals functional clusters aligned with flight phases, suggesting an emergent modular structure in the learned controller analogous to behavioral modes seen in adult vultures, which also show θ-dependent climb performance. Comparisons with vulture data affirm shared thermalling strategies (radius, phase preferences), supporting the relevance of the learned policy to natural soaring and offering a framework for studying learning in animal motion control.
Conclusion
This work introduces a vulture-inspired deep-RL framework that learns robust thermal soaring under horizontal winds, identifies key learning bottlenecks, proposes an efficiency metric η for performance assessment, and reveals interpretable neural activation clusters linked to flight phases. The trained agents achieve near-optimal exploitation of thermals across a range of winds and generalize to varying thermal structures while maintaining robustness to moderate sensor noise and gusts. The approach bridges engineered and biological systems: learned behavior mirrors vulture thermalling patterns, and interpretability analyses offer a pathway to distill explicit control rules (e.g., decision-tree extraction from clusters). Future directions include reducing the sim-to-real gap (e.g., pre-training with centering penalties, then domain adaptation), incorporating richer atmospheric models (leaning thermals, turbulence, multiple thermals), exploring onboard estimation of wind and phase using minimal sensors, long-term studies of learning and maturation in birds, and translating clustered neural policies into verifiable, interpretable controllers for UAVs.
Limitations
- Reward shaping using distance to the thermal center (P_center) requires knowledge unavailable in real-world training; it is practical in simulation and may serve for pre-training only.
- The flight dynamics use a simplified 3-DOF point-mass model and quasi-steady aerodynamics; high-fidelity unsteady effects and control surface dynamics are abstracted.
- Horizontal wind is modeled as uniform per episode; while a chimney model with leaning is known, most results use uniform drift without leaning or turbulence.
- Single-thermal environment during training and evaluation may not capture multi-thermal interactions or complex convective fields.
- Sensor noise and gust models are simplified (Gaussian, fixed timescale), and some reported noise normalization is relative to empirical variability rather than absolute sensor specs.
- Network architectures are relatively small feedforward models; memory is implemented via state history rather than recurrent networks, which may limit temporal inference in more complex settings.
Related Publications
Explore these studies to deepen your understanding of the subject.