logo
ResearchBunny Logo
Introduction
While quadrotors demonstrate high speeds, agility, and payload capacity, their interaction with the environment remains limited. Most applications focus on sensing for navigation, inspection, or videography. However, many tasks require environmental interaction, demanding the handling of complex contact dynamics and environmental feature perception. Recent work has explored perching mechanisms, contact-based inspection, and even tree eDNA collection, all involving environmental contact but not manipulation. This research expands quadrotor capabilities by enabling object pickup and movement without external motion capture, broadening the scope of achievable tasks. Building such a system necessitates integrating mechanical design, perception, motion planning, control, and manipulation. While significant prior work exists in each area individually, including grasping mechanisms, aerial manipulator control, and vision-based pose estimation, none directly addresses high-speed vision-based aerial manipulation. The challenges include precise positioning needs of rigid manipulators, adaptation to changing dynamics due to grasped objects, and the necessity of onboard perception for general applicability. Existing high-speed aerial grasping solutions rely on motion capture or lack generalization across environments and targets.
Literature Review
Existing literature demonstrates promising advances in subsystems for aerial grasping platforms, but they fail to achieve high-speed vision-based aerial manipulation. Some studies achieve high-speed grasping (up to 3 m/s) but rely on motion capture for both drone and target state estimation and are designed for specific objects. Others utilize soft aerial grippers but necessitate external localization. Several works explore vision-based aerial manipulation, but reduce error using motion capture for drone localization, slow speeds, or restrictive object assumptions. Works using segmentation networks and point clouds for pose estimation often rely on offboard computation and motion capture. Those employing rigid grippers require careful positioning and are limited to stationary grasps. In summary, prior art lacks the combined capabilities of vision-based drone and target pose estimation, the handling of targets with forward velocity, and fully onboard computation. The authors propose that soft grasping mechanisms, inspired by biological systems, can improve aerial grasping by passively conforming to objects, reducing the need for precise positioning, and weakening the coupling of flight and manipulation dynamics.
Methodology
This research combines a rigid quadrotor platform with a soft robotic gripper. The soft gripper features passively closed, tendon-actuated foam fingers enabling fast closure, error compensation, morphology compliance, and reaction force dampening. A real-time, fully onboard perception pipeline uses two cameras (RealSense D455 for object pose estimation and RealSense T265 for visual-inertial odometry (VIO)) and an onboard GPU (Jetson Xavier NX). The pipeline incorporates a neural-network-based semantic keypoint detector, a robust 3D object pose estimator (TRASER++), and a fixed-lag smoother for pose refinement. A minimum-snap trajectory planner generates grasp trajectories tracked by an adaptive controller compensating for added mass and aerodynamic effects. A finite-element-based controller determines optimal gripper configurations. The soft gripper is modeled using a finite element mesh, with cables as unilateral springs. Optimal control inputs are found by minimizing an objective function that balances target object visibility and grasp robustness, considering different grasp phases (target observation, pre-grasp, and post-grasp). The perception system estimates the target's 3D position and orientation, leveraging a deep-neural-network-based keypoint detector trained with a semi-automated data collection tool. Robust registration (TRASER++) aligns detected keypoints to the object's CAD model, resulting in a global pose estimate refined by a fixed-lag smoother. The system uses a minimum-snap polynomial trajectory planner and an adaptive flight control law to handle changing dynamics after grasping. The gripper's finger positions are controlled using a finite-element-based approach optimizing an objective function considering both grasp robustness and camera occlusion. The entire pipeline operates onboard, including the VIO processing performed on the RealSense T265.
Key Findings
Experiments involved grasping three objects (med-kit, two-liter bottle, cardboard box) in indoor and outdoor environments across 180 flight tests. The system achieved a high success rate (9/10, 6/10, and 10/10, respectively, at 0.5 m/s) with fully onboard vision. Performance with motion capture was comparable. Grasp success rates were influenced by object mass, morphology, and dimensions. The system could hold up to a 2 kg target with fully enveloping grasps, but payload decreased significantly for pinching grasps due to mass and frictional coefficient. A target with a mass of 250 g and a top surface area of 546 cm² represented the system’s limit. Error analysis, utilizing motion capture ground truth, showed maximum pose estimation errors (5 cm translation, <10 degrees rotation, <6 degrees yaw error). VIO drift was around 2 cm at 0.5 m/s. Trajectory tracking errors before the grasp were approximately 5 cm position error and 0.05 m/s velocity error. Post-grasp, increased errors in the vertical direction were observed, gradually reducing with controller adaptation. Grasp speed analysis increased the desired speed to 1.25, 2, and 3 m/s for the two-liter bottle. Tracking errors increased with speed; at 2 m/s, success was 3/10. This is reported as the fastest vision-based aerial grasp. At 3 m/s, VIO drift significantly impacted performance. Outdoor experiments demonstrated successful grasps (8/10) at 0.5 m/s, highlighting robustness to environmental challenges. Experiments with moving targets (med-kit on a quadruped robot and a two-liter bottle on a turntable) using motion capture for localization showed successful grasps, with success rate decreasing at higher speeds. Tracking errors increased with relative speed, primarily in the longitudinal direction for the quadruped, and lateral direction for the turntable, affecting success rates.
Discussion
This work advances aerial manipulation by combining a soft gripper, an advanced quadrotor, and a state-of-the-art perception system. The soft gripper's compliance enables high-speed grasps despite imperfections. The onboard perception system is crucial for real-world applications, unlike motion-capture-dependent systems. The system's capabilities are relevant for tasks such as emergency supply distribution, package delivery, and warehouse automation. The compliant fingers are key to achieving high-speed grasps despite errors in position estimation and trajectory tracking. The gripper’s simple design enhances manufacturability and repairability. The FEM-based modeling and control of soft fingers proved effective for reliable grasps. Future work includes generalizing the perception pipeline to various object categories, increasing payload capacity by scaling the drone, and achieving fully vision-based grasps of moving targets. Challenges in vision-based moving target grasping involve accurate motion prediction and handling timing offsets between pose estimation and image capture.
Conclusion
This research significantly advances high-speed, versatile aerial manipulation through the integration of a soft gripper and onboard perception. The system demonstrates successful high-speed grasps in diverse environments, exceeding the capabilities of existing methods. Future research will focus on extending the system's capabilities to handle more diverse objects, increase payload capacity, and achieve fully vision-based grasping of moving targets.
Limitations
The current pose estimation pipeline assumes prior knowledge of target geometry and visual features. The drone's payload capacity limits the size and weight of graspable objects. Fully vision-based grasps of moving targets remain a challenge due to motion prediction errors and timing offsets. The outdoor experiment success rate is lower compared to indoor due to factors like wind and uneven terrain. The current semi-automated keypoint annotation tool requires manual initial labeling of keypoints, which limits the speed of adding new object training data.
Listen, Learn & Level Up
Over 10,000 hours of research content in 25+ fields, available in 12+ languages.
No more digging through PDFs—just hit play and absorb the world's latest research in your language, on your time.
listen to research audio papers with researchbunny