logo
ResearchBunny Logo
Embodied intelligence via learning and evolution

Computer Science

Embodied intelligence via learning and evolution

A. Gupta, S. Savarese, et al.

This research by Agrim Gupta, Silvio Savarese, Surya Ganguli, and Li Fei-Fei introduces Deep Evolutionary Reinforcement Learning (DERL), a framework that evolves agent morphologies for complex tasks. It reveals fascinating connections between environmental complexity, morphological intelligence, and speedier learnability, showcasing how more stable, efficient designs emerge from challenging settings.... show more
Introduction

The paper explores how learning within lifetimes and evolution over generations interact to produce diverse morphologies that embody intelligence. While biological evolution has produced sophisticated forms well adapted to environments, artificial intelligence has largely focused on disembodied cognition. Creating artificial embodied agents that can learn complex control tasks in varied environments is difficult due to the vast morphological search space and the high computational cost of evaluating fitness via lifetime learning. Prior work often restricted morphology or controller complexity, or avoided learning controllers directly from raw observations. The authors aim to uncover principles linking environmental complexity, evolved morphology, and the learnability of control by scaling experiments simultaneously across environmental, morphological, and control complexity. They propose Deep Evolutionary Reinforcement Learning (DERL), which combines Darwinian evolution over morphologies with reinforcement learning of controllers from low-level egocentric sensory inputs. Using DERL, they investigate whether environmental complexity fosters morphological intelligence (morphologies that enable faster and better learning of new tasks), whether a morphological Baldwin effect emerges (evolution selecting faster learners without explicit selection for learning speed), and what mechanistic properties (e.g., stability and energy efficiency) underpin these relations.

Literature Review

Prior evolutionary robotics studies often employed generational evolution with synchronized population replacement, which scales poorly when each evaluation requires costly lifetime learning. To manage computational burden, previous work limited morphological search spaces or optimized controllers for fixed, hand-designed morphologies; avoided direct learning from raw sensory observations; used low-parameter, hand-designed controllers; learned surrogate models to predict fitness; or transmitted controller parameters across generations (Lamarckian approaches). Many studies focused on simple locomotion on flat terrain with few degrees of freedom or simplified body plans (e.g., cuboids), and in soft robotics often relied on predetermined, non-learned actuation patterns. Recent neural architecture search motivates asynchronous, tournament-based, steady-state evolution to improve scalability. Expressive rigid-body morphology spaces beyond prior hand-designed spaces are needed to study how evolved morphologies support learned control in complex environments.

Methodology

Framework: Deep Evolutionary Reinforcement Learning (DERL) decouples evolution and learning in an asynchronous, distributed setup. Each run starts with P=576 unique morphologies whose controllers are trained via reinforcement learning (RL) from low-level egocentric proprioceptive and exteroceptive observations. Fitness is the average final reward after lifetime learning. Asynchronous steady-state evolution proceeds via repeated 4-way tournaments: the fittest parent’s morphology is mutated to produce a child, which undergoes lifetime learning from scratch; only morphological information is inherited. Population size is maintained by aging: only the most recent P agents remain alive. Computation is parallelized over 288 workers × 4 CPUs (total 1152 CPUs), enabling search and training of ~4000 morphologies (≈10 generations) with 5 million environment interactions per morphology in under 16 hours.

Design space (UNIMAL): Morphologies are directly encoded as a kinematic tree of 3D rigid parts (head: sphere; limbs: cylinders) connected by motorized hinge joints in MuJoCo. Bilateral symmetry is enforced so centers of mass lie on the sagittal plane. Three mutation classes: (1) topology: grow/delete limb(s); (2) limb parameters: length, radius, density; (3) joint properties: degrees of freedom (axes), joint angle limits, gear ratios. Growth maintains downward limb addition, avoids self-intersections, and preserves symmetry. All limbs share a common density that can be mutated globally. The discrete parameterization yields an expressive space (~10^18 morphologies with <10 limbs) while keeping evaluation tractable.

Environments: Three evolutionary environments of increasing complexity:

  • Flat terrain (FT): maximize forward displacement in a 150×150 m arena.
  • Variable terrain (VT): like FT but each episode samples stochastic obstacle sequences (hills, steps, rubble) interleaved with flat segments in a 100×100 m arena.
  • Manipulation in variable terrain (MVT): as VT in a 60×40 m arena; objective is to push a 0.2 m cube from random start to a goal location farther along the forward direction. These environments increase sub-task demands from locomotion (FT) to traversing obstacles (VT) to mobile manipulation plus locomotion (MVT).

Reinforcement learning: Controllers are stochastic policies trained with Proximal Policy Optimization (PPO). Observations include morphology-dependent proprioception (joint states; head-mounted velocity, acceleration, gyroscope; touch sensors) and task-specific exteroception (e.g., egocentric goal position; object pose/velocity; local heightmap sampled on a non-uniform grid when terrain varies). The policy and critic networks are MLPs: per-modality two-layer encoders (64,64), concatenation, a 64-d shared layer, then linear outputs (Gaussian policy parameters; value). ~250k parameters; tanh activations. Early termination occurs upon falls (head height drops below 50% initial). Reward functions: FT/VT use r_t = v_x − 0.001||a||^2 (tournament selection based only on v_x). MVT uses approaching-object and object-to-goal geodesic progress terms (weights 100 each) minus 0.001||a||^2, plus sparse rewards near object and near goal.

Evaluation of morphological intelligence: After evolution, the 10 best morphologies per environment (from the top 100 agents, selected robustly across seeds and runs) are trained from scratch on 8 test tasks spanning agility (patrol, point navigation, obstacle, exploration), stability (escape, incline), and manipulation (push box incline, manipulate ball). Each morphology-task pair is trained with 5 random seeds for 5×10^6 interactions (and also with 1×10^6 to assess learning speed). Energy efficiency is measured via Cost of Work (COW) = E/(M g r), where E is sum of absolute joint torques, M mass, r reward. Passive stability is assessed by no-control standing: if head height after 400 steps remains above 50% of initial, the morphology is deemed passively stable.

Scale and runs: Experiments span 3 evolutionary runs per environment. Phylogenetic trees and Muller plots track lineage diversity, beneficial mutations (fitness increase thresholds), and abundance over evolutionary time.

Key Findings
  • DERL successfully evolves diverse morphologies and learns controllers across all three environments (FT, VT, MVT). Average population fitness improves by ~3× in FT/VT and ~2× in MVT over evolution; morphological evolution substantially increases the fitness of the best morphologies.
  • Diversity is maintained: asynchronous, small-tournament, aging-based evolution allows multiple lineages with lower initial fitness to accumulate beneficial mutations and contribute high-fitness descendants, yielding a variety of bipeds, tripeds, quadrupeds with/without arms.
  • Environmental complexity fosters morphological intelligence: On the 8-task evaluation suite, morphologies evolved in MVT outperform those evolved in FT on 7/8 tasks when trained for 5×10^6 interactions, and VT morphologies outperform FT on most agility/stability tasks. With only 1×10^6 interactions, MVT/VT morphologies significantly outperform FT across all tasks, indicating faster learning.
  • Morphological Baldwin effect: Despite no explicit selection for learning speed (fitness measured only at end of lifetime), the iterations needed to reach a criterion fitness (75th percentile of initial population) for top-100 lineages decrease rapidly over generations; within ~10 generations, average learning time is reduced by about half. Example learning curves show later-generation agents both reach higher final fitness and hit earlier generations’ final fitness in ~1/5 the time.
  • Mechanistic basis: Evolution selects morphologies that are more energy-efficient (lower COW) and more passively stable. COW decreases over generations without explicit energy penalties in selection; body mass tends to increase, suggesting improved exploitation of passive dynamics rather than trivial mass reduction. The fraction of passively stable morphologies increases over time and is higher in VT/MVT than FT, indicating stronger selection pressure for stability in more complex environments.
  • Energy efficiency correlates with performance and learning speed: Across generations and environments, lower COW correlates with higher final fitness (negative correlation) and fewer iterations to reach criterion fitness (positive correlation), linking energy efficiency to both better outcomes and faster learning.
Discussion

The study addresses how environmental complexity influences evolved morphology and the learnability of control. By evaluating morphological intelligence as the extent to which a morphology facilitates learning on diverse downstream tasks, the authors show that evolution in complex environments (VT, MVT) produces morphologies that learn both better and faster than those evolved on flat terrain. The emergence of a morphological Baldwin effect demonstrates that selection can shift learning-dependent performance into genetically encoded structural traits, reducing learning time across generations even without explicit selection for learning speed. Analyses suggest a mechanistic underpinning: morphologies that enhance passive stability and energy efficiency simplify control and improve sample efficiency. Thus, environmental complexity implicitly imposes selection pressures favoring traits that both raise performance and accelerate learning, providing a pathway for embodied agents to generalize better to new tasks.

Conclusion

This work introduces DERL, a scalable framework that jointly evolves morphology and learns control from low-level sensory inputs in complex environments, and UNIMAL, an expressive design space for rigid-body morphologies. Large-scale simulations reveal that complex environments foster morphological intelligence, that a strong morphological Baldwin effect can arise without explicit selection for learning speed, and that energy efficiency and passive stability likely underpin these effects. The findings suggest that scaling environmental complexity can yield embodied agents that learn multiple tasks quickly and robustly by leveraging passive body-environment dynamics. Future directions include designing more open-ended, physically realistic, and multi-agent evolutionary environments, expanding and systematizing evaluation suites for human-relevant behaviors, relaxing symmetry and limb-count constraints, and leveraging Baldwinian transfer to improve sample efficiency and generalization.

Limitations
  • Simulation and sensing simplifications: Agents receive engineered egocentric observations (e.g., heightmaps, object poses) rather than raw perceptual inputs like images; transferring to real-world would require perception modules and exploration strategies. MuJoCo parameters approximate realism but remain simulated.
  • Design space constraints: Bilateral symmetry enforced; discrete parameter sets; maximum of 10 limbs to keep learning tractable; no rotation about limb’s own axis in joints. These choices limit possible morphologies and gaits.
  • Controller architecture and training budget: Relatively shallow MLP policies (~250k parameters) and fixed 5M interaction budgets per morphology; results may change with larger models or budgets.
  • Reward design and termination: Simple, morphology-agnostic rewards and early termination on falls may bias learned gaits.
  • Stochasticity and seeds: RL outcomes can vary with random seeds; authors mitigate via multiple runs and reseeding, but variance remains a factor.
  • Environment scope: Three single-agent environments; broader, more open-ended, and multi-agent settings remain to be explored for generalization and robustness.
Listen, Learn & Level Up
Over 10,000 hours of research content in 25+ fields, available in 12+ languages.
No more digging through PDFs, just hit play and absorb the world's latest research in your language, on your time.
listen to research audio papers with researchbunny