logo
ResearchBunny Logo
High-speed aerial grasping using a soft drone with onboard perception

Engineering and Technology

High-speed aerial grasping using a soft drone with onboard perception

S. Ubellacker, A. Ray, et al.

Discover the groundbreaking research by Samuel Ubellacker, Aaron Ray, James M. Bern, Jared Strader, and Luca Carlone! This paper introduces a revolutionary soft aerial manipulator with an advanced onboard perception system that enables high-speed grasping of various objects – reaching up to 2.0 m/s. Experience the future of aerial manipulation as this technology showcases agility and robust object interaction in both indoor and outdoor environments.

00:00
00:00
~3 min • Beginner • English
Introduction
Quadrotors can fly fast and carry payloads, but interacting with the environment (manipulation and grasping) remains challenging due to contact dynamics and the need to perceive task-relevant features without external infrastructure. Existing high-speed aerial grasping efforts often use rigid manipulators that require precise positioning and endure large reaction forces, limiting performance; many rely on motion capture or strong assumptions about target shape, or operate slowly. The research question is how to achieve high-speed, versatile, and robust aerial grasping using only onboard sensing and computation. The authors hypothesize that combining a soft, compliant gripper with onboard perception, planning, and adaptive control can mitigate reaction forces, tolerate pose errors, and decouple flight from contact dynamics, enabling agile, vision-based aerial grasping. They present a hybrid system: a rigid quadrotor plus a passively closed, tendon-driven soft foam gripper; an onboard perception pipeline for known objects; minimum-snap grasp trajectory planning; adaptive control to handle added mass and aerodynamics; and FEM-based gripper control. The goal is deployable, generalizable aerial manipulation in diverse environments without motion capture.
Literature Review
Prior work explored grippers and aerial manipulators for drones and helicopters, including perching mechanisms and contact-based tasks. High-speed grasping demonstrations (e.g., up to 3 m/s) have required motion capture for both drone and target and used rigid grippers designed for specific light objects, sometimes avoiding ground contact by suspending targets. Soft aerial grippers have been used dynamically but with external localization and limited object diversity and speed (~0.2 m/s). Vision-based systems have used segmentation and point clouds with offboard computation and motion capture for state, rigid multi-link grippers requiring precise CAD-based grasp points for careful stationary grasps, or onboard vision with rigid grippers limited to stationary targets. Some methods assume cylindrical targets or rely on motion capture and show low speeds. Overall, no prior work combined onboard drone localization and target pose estimation, grasping with meaningful forward velocity, and fully onboard computation with generality across objects and environments. The authors position soft grasping as a way to reduce precision requirements and reaction forces via morphological computation, drawing inspiration from biological systems that blend rigid and soft tissues.
Methodology
System overview: The quadrotor performs fully onboard perception using two Intel RealSense sensors: a D455 RGB-D camera for target pose estimation and a T265 stereo fisheye module for visual-inertial odometry (VIO). The estimated target and drone poses in a global frame are used to plan a minimum-snap polynomial trajectory passing through a grasp point and a terminal hover. An adaptive controller tracks the trajectory, compensating for added mass and aerodynamics post-grasp. A finite-element-method (FEM)-based controller sets soft gripper tendon inputs to balance target visibility and grasp robustness. Onboard computation runs on an NVIDIA Jetson Xavier NX; low-level flight control runs on a Pixhawk 4 Mini; VIO runs on the T265. Soft gripper design and FEM modeling/control: The gripper has four passively closed, tendon-driven foam fingers (Smooth-On FlexFoam-iT! X) actuated by motorized winches; loosening tendons lets elasticity close the fingers rapidly with low power. This design reduces reaction forces during ground contact and tolerates positioning errors, conforming to target geometry. Each finger has one DOF but can assume complex shapes via tendon routing. The gripper body is modeled as an FEM mesh with cables as unilateral springs; given motor angles u, a statically stable shape x(u) is found by minimizing total energy. A nested control optimization chooses u to maximize a grasp objective (area enclosed by fingertips and target centroid) while enforcing camera field-of-view (FOV) constraints to avoid occluding the D455 (front) and T265 (rear) at different phases. Three gripper states are used: target observation (both FOV constraints active), pre-grasp (rear FOV only to allow front fingers to open for capture), and post-grasp (fingers return to passive closed state). Due to computational limits, tendon policies are precomputed offline using a tuned relative offset approximation for pre-grasp, applied online via a state machine. Perception pipeline: A ResNet-18-based keypoint detector (trt_pose, TensorRT, ~14 Hz) is trained via a semi-automated annotation process using a calibration board to recover camera poses and project a small set of manual keypoint labels across many images. Detected 2D keypoints are back-projected to 3D using D455 depth and robustly registered to known CAD keypoints with TEASER++ (truncated least squares), producing a camera-relative 6-DoF pose; combining with the drone VIO gives a global target pose. A fixed-lag smoother (GTSAM, iSAM2) fuses recent pose estimates under a nearly constant-velocity motion model to reduce noise and estimate target linear and angular velocities. For most grasps on flat ground, only yaw is used in planning. Trajectory planning and control: Grasp trajectories are minimum-snap polynomials (fixed-time) aligned with the target grasp axis, starting from a hover, passing a grasp point (target position plus offset for finger geometry), and ending at a terminal point away from the target. For moving targets, the polynomial is expressed in the target frame and updated online, with the adaptive controller compensating induced disturbances. The low-level controller uses geometric adaptive tracking on SE(3) with an online disturbance estimator to handle added mass, thrust occlusion by the object, ground effect, wind, and modeling uncertainty. Post-grasp, a brief feedforward vertical acceleration impulse can be added (outdoor tests) to counteract disturbances from ground contact/grass. Hardware: Custom carbon fiber quadrotor (RTF-inspired) with Yuneec motors/props, Pixhawk 4 Mini (custom PX4), 4S LiPo (~3 minutes flight, ~3 grasp trajectories), front-mounted D455 pitched 35° down, rear T265 with vibration damping and partial lens masking to meet FOV constraints, Jetson Xavier NX (USB 3.0 to cameras, UART to Pixhawk). Gripper electronics include an ATMega32U4 PCB, two DRV8434 motor drivers actuating four 12 V DC gearmotors (31:1), and voltage regulation. The gripper weighs ~544 g; each finger ~54 g. Energy per grasp ~2.072 J; passive-closed state draws minimal power. Mechanical simplicity aids manufacturability and robustness. Experimental protocol: For stationary targets, the drone takes off to a randomized start pose with target visible, aligns heading to a predefined grasp direction, plans when ~0.9 m in front/above; then flies the trajectory, freezing target pose updates after trajectory start and relying on VIO for self-state. Success is defined as retaining the object until landing. Motion capture was used for baselines and for error analysis (not for control during onboard-vision runs). For moving targets, motion capture provided states for both drone and target to focus on dynamics/planning limits.
Key Findings
Stationary targets (indoor, 0.5 m/s planned grasp speed): With fully onboard vision, success rates over 10 trials: med-kit 9/10, cardboard box 6/10, two-liter bottle 10/10. A motion-capture baseline performs similarly (slightly worse for med-kit, better for box, equal for bottle), indicating onboard perception suffices at this speed. Outdoor grasps (0.5 m/s): Med-kit success 8/10 using a brief post-grasp vertical acceleration impulse to counteract disturbances from grass and wind; to the authors’ knowledge, first dynamic manipulation outdoors with full onboard vision. High-speed grasps (two-liter bottle): Ten trials each at desired forward speeds 1.25, 2.0, and 3.0 m/s (vision-based). Actual grasp speeds (motion capture): 1.04, 1.58, and 2.15 m/s, respectively. The system achieved fully vision-based grasps at over 2.0 m/s with success 3/10 (fastest fully vision-based aerial grasp reported). Tracking errors grow with speed mainly along the longitudinal axis; lateral/vertical remain small. VIO drift increases with speed: at commanded 0.5, 1.25, 2.0, 3.0 m/s, drift ≈ 2.00, 2.69, 4.39, 7.08 cm. Pose estimation and tracking errors: Refined target pose errors at planning: translation ≈ 4–5 cm; rotation < 10°; yaw ≈ 2–6°. Pre-grasp tracking errors ≈ 5 cm position and 0.05 m/s velocity; ground effect induces vertical bias partly compensated by the adaptive controller; heavier objects increase post-grasp vertical error, which decays as adaptation learns. Moving targets (motion capture used for state): Quadruped with med-kit: target forward speeds ~0.15 m/s (slow) and ~0.3 m/s (fast), commanded relative grasp speed 0.1 m/s. Success rates: 10/10 (slow), 6/10 (fast); failures at higher speed associated with increased lateral/vertical tracking errors. Turntable with two-liter bottle: target tangential speed ≈ 0.08 m/s; relative grasp speeds 0.5, 1.0, 1.5 m/s achieved high success across 10 trials each; errors increase with speed, predominantly longitudinal, with some lateral misalignment due to rotational motion at higher speeds. Payload and grasp mechanics: Soft gripper can hold up to 2 kg in fully enveloping grasps; pinching grasps have lower capacity (cardboard box fails consistently at 250 g). System thrust occlusion and mass impose a practical limit observed around 250 g mass and 546 cm² top-face area. Object geometry and surface friction significantly affect success (bottle 60 g, cylindrical; med-kit 148 g, concave edges; box 115 g, tall flat walls).
Discussion
The work demonstrates a deployable approach to high-speed, versatile aerial manipulation by combining a soft, passively closing gripper with onboard perception, planning, and adaptive control. Compliance dampens reaction forces at contact, tolerates pose and timing errors, and weakens coupling between contact dynamics and flight, enabling successful grasps even with non-negligible estimation and tracking errors. Fully onboard vision removes dependence on motion-capture infrastructure, broadening applicability to real-world tasks such as package pickup/delivery, emergency supply distribution, and warehouse automation. The gripper’s mechanical simplicity enhances manufacturability, robustness, and repairability, while FEM-based modeling/control provides a principled way to configure soft fingers under visibility and grasp objectives, and is generalizable to different gripper geometries tailored to target morphology. Error analyses show that pose estimation and VIO drift are within tolerances for successful grasps at moderate speeds, and that longitudinal tracking errors at higher speeds are less detrimental due to how grasp triggering is tied to estimated current state rather than desired trajectory. Moving target experiments (with motion capture) indicate feasibility at meaningful relative speeds and inform where perception and synchronization must improve for fully vision-based moving grasps.
Conclusion
This paper introduces a soft-drone system that achieves dynamic, high-speed, and versatile aerial grasping using only onboard sensing and computation for stationary targets, and demonstrates motion-capture-aided grasps for moving targets. Core contributions include: a passively closed, tendon-driven soft foam gripper enabling fast, robust closure and tolerance to positioning errors; a fully onboard keypoint-driven perception pipeline with robust registration and fixed-lag smoothing; integrated minimum-snap grasp trajectory planning and geometric adaptive control compensating for added mass and aerodynamics; and an FEM-based gripper controller optimizing visibility and grasp robustness across grasp phases. Experiments across three objects and 180 flights show high success rates indoors and outdoors, and vision-based grasps at over 2.0 m/s—the fastest reported. Future work includes removing the assumption of known object instances via category-level keypoints and pose estimation, scaling the drone to increase control authority and payload capacity, and achieving fully vision-based grasps on moving targets by improving motion prediction and addressing timing synchronization between state estimates and images.
Limitations
The perception pipeline assumes prior knowledge of target geometry and visual features (known CAD and instance-specific keypoints), limiting generalization to unseen objects; category-level methods are not yet integrated. Fully vision-based grasps of moving targets are not demonstrated due to compounded errors from targets leaving FOV, motion blur, and timing offsets between state and image timestamps; motion capture was required for moving-target experiments. FEM-based gripper control objectives are optimized offline with a tuned relative offset due to computational constraints, introducing approximation error. The platform has a relatively low thrust-to-weight ratio and short flight time (~3 minutes), limiting control authority at high speeds and payload capacity; thrust occlusion by grasped objects further reduces margin. Performance depends on object mass, geometry, and surface friction; pinching grasps have substantially lower payload limits. VIO can degrade under vibration (mitigated via damping) and exhibits increased drift at higher speeds; the perception pipeline’s depth-based occlusion handling is limited by D455 accuracy, and training assumed non-occluded top-face keypoints. Indoor experiments used weak magnets to prevent target displacement from downwash, which may not be available in all real-world settings.
Listen, Learn & Level Up
Over 10,000 hours of research content in 25+ fields, available in 12+ languages.
No more digging through PDFs, just hit play and absorb the world's latest research in your language, on your time.
listen to research audio papers with researchbunny