Introduction
Human-machine interaction (HMI) systems are crucial for bridging the physical and digital worlds, particularly in the metaverse. Natural user interfaces are highly desirable, but while non-forceful interactions like hand gestures have been extensively studied using various sensing technologies (IMUs, EMG, strain sensors, video recording, triboelectric sensors), forceful interactions—specifically human manipulation of objects—remain less explored. This presents a significant limitation in applications such as virtual reality (VR), telemedicine, robotics, and the development of robust AI models that understand real-world interactions. Previous research on forceful manipulation has largely focused on semantic recognition and spatial localization, predicting object category and position, but has fallen short of capturing the complete hand-object state, especially during interactions with deformable objects. The challenge lies in accurately recording and tracking the complex interplay between the hand and the object's geometry, particularly when deformations are partially or fully occluded by the hand itself. To address this, a system capable of recording visual-tactile sensory data and estimating fine-grained hand-object states is needed, with tactile perception prioritized for analyzing deformations within the contact area and visual perception used to estimate the overall object state.
Literature Review
Existing research on tactile sensing for object manipulation has explored various approaches to address the challenges of high-density sensing, stretchability, and strain interference. High-density tactile arrays are crucial for capturing detailed contact information, and integrating these arrays into wearable gloves using textile techniques is a common approach. Stretchable interfaces are essential for conformal contact with deformable objects. However, the inherent stretchability can lead to strain interference, affecting the accuracy of force measurements. Previous strain insensitivity methods have focused on structural (stretchable geometric structures, stress isolation structures, Negative Poisson's ratio structures) and material strategies (strain redundancy techniques, localized microcracking techniques, nanofiber network encapsulation technique). These 'source-protection' approaches primarily concentrate on reducing strain interference at the sensor input, lacking quantitative assessment and adaptive capabilities. Visual-tactile joint learning has also been investigated, often using camera-based tactile sensors for high-resolution local geometry information combined with visual images for hand and object pose estimation. However, these models are often limited to static settings and do not consider the temporal consistency of hand movements and object deformation. The current work aims to overcome these limitations by developing a system that combines high-density stretchable tactile sensing with a robust visual-tactile learning framework.
Methodology
This research introduces ViTaM (Visual-Tactile Manipulation), a system designed to capture forceful human manipulation of deformable objects. The system consists of a high-density, stretchable tactile glove and a 3D camera. The tactile glove, a key innovation, features 1152 sensing channels distributed across the palm and fingers, enabling high-resolution force recording at 13 Hz. A crucial aspect of the glove design is its active strain interference suppression method. This method leverages a novel dual-input approach using positive and negative strain-sensing membranes, which exhibit opposite resistance changes under strain. By monitoring these opposing responses, the system can quantitatively detect and suppress strain interference, improving force measurement accuracy. The specific fabrication process involves the creation of positive and negative effect membranes using carbon nanotubes (CNTs) in natural latex substrates, with different CNT weight ratios resulting in opposite stretching-resistive effects. These membranes are then integrated with conductive fabric wires via a novel fully woven wiring method, eliminating the need for adhesives and enhancing reliability and conformality. The system utilizes a closed-loop adaptive method: after the existence of the strain is determined through simultaneous rise edges of the output voltages of the positive and negative effect membranes, the system calibrates using a local domain curve interpolation to dynamically adjust the force estimation curve based on the detected strain. The 3D camera provides visual observations, capturing the entire manipulation process. A visual-tactile joint learning framework processes both the tactile and visual data. The model uses two separate neural network branches to encode visual and tactile information, fusing these features to reconstruct the fine-grained surface deformation and the complete object geometry, using a temporal cross-attention module to effectively handle temporal consistency in the data. The model is trained on a dataset of 7680 samples generated from RFUniverse (a finite element method-based simulation environment), and tested on real-world data. The evaluation includes qualitative and quantitative analyses of object reconstruction accuracy for both deformable and rigid objects, comparing the proposed approach to vision-only methods and a previous visual-tactile method using a gel-based optical tactile sensor (VTacO).
Key Findings
The developed stretchable tactile glove demonstrated high accuracy in force measurement, achieving 97.6% accuracy after implementing the active strain interference suppression method, a 45.3% improvement over uncalibrated measurements. The visual-tactile joint learning framework successfully reconstructed hand-object states with an average reconstruction error of 1.8 cm across diverse objects (24 objects from 6 categories, including both deformable and rigid objects). The system effectively captured fine-grained details, even in areas occluded by the hand, and accurately represented the dynamic deformation of deformable objects. Comparisons with vision-only methods and the VTacO system showed significant improvements in reconstruction accuracy. Specifically, the chamfer distance for reconstructing a sponge was 0.467 cm, a 36% improvement over VTacO. The temporal transformer module within the framework significantly enhanced performance by leveraging the consistency of inter-frame features and force differences. The system demonstrated robust performance on various tasks involving deformable objects, such as kneading plasticine and pinching a sponge. The analysis of correlations between tactile sensing blocks revealed heightened synergistic effects between fingers and the palm after strain interference correction, highlighting the importance of the calibration process. Experiments showed that the proposed method effectively reduces the influence of hand pose variation on pressure readings. The system achieves an average inference runtime of 3–5 frames per second on an Nvidia RTX 4090 GPU.
Discussion
The findings demonstrate the successful integration of a highly accurate, stretchable tactile glove with a robust visual-tactile learning framework, addressing critical limitations in capturing forceful interactions with deformable objects. The active strain interference suppression method is a significant advancement in tactile sensor technology, achieving higher accuracy than previous passive approaches. The visual-tactile joint learning framework effectively leverages both visual and tactile information to achieve highly accurate object reconstruction, surpassing the performance of vision-only methods and a previous state-of-the-art visual-tactile approach. The results highlight the importance of incorporating tactile data, particularly with accurate strain compensation, for understanding complex hand-object interactions. This work significantly advances the field of HMI by providing a more comprehensive and accurate method for capturing and understanding human manipulation, paving the way for more advanced applications in VR, telemedicine, and robotics.
Conclusion
This research presents ViTaM, a novel visual-tactile system that significantly advances the capabilities of capturing and understanding forceful human-object interactions, particularly with deformable objects. The high-accuracy stretchable tactile glove and the integrated visual-tactile learning framework provide a robust solution for reconstructing hand-object states with high fidelity. Future work will focus on integrating ViTaM into larger robotic systems, allowing for more seamless and intuitive interaction between humans and robots in diverse environments. Furthermore, exploring the application of this technology for a deeper understanding of human behaviors and for enhancing the dexterity of robot manipulation in general scenarios is a promising direction for future research.
Limitations
While the ViTaM system demonstrates significant advancements, several limitations exist. The current prototype of the tactile glove has a limited number of sensing channels (1152), which could be further increased to capture even finer details of hand-object interactions. The system's performance is dependent on the quality of the visual input, and challenges may arise in scenarios with poor lighting or significant occlusion. The computational cost of the visual-tactile learning framework might limit real-time performance for very high-resolution sensors or complex manipulation tasks. Further investigation is needed to assess the system's robustness across a broader range of object materials, shapes, and manipulation tasks.
Related Publications
Explore these studies to deepen your understanding of the subject.