Engineering and Technology

Capturing forceful interaction with deformable objects using a deep learning-powered stretchable tactile array

C. Jiang, W. Xu, et al.

Discover the groundbreaking research by authors including Chunpeng Jiang and Wenqiang Xu, showcasing a visual-tactile recording system with unprecedented accuracy for interaction with deformable objects, essential for telemedicine and robotics applications!

00:00

~3 min • Beginner • English

Index

Introduction

Human–machine interaction increasingly requires capturing not only non-forceful gestures but also forceful hand–object interactions. Prior work primarily addressed recognition and localization tasks, leaving the full dynamic hand–object state, including occluded deformable regions, underexplored. Measuring distributed forces over a stretchable, wearable interface is necessary to sense local surface deformation, while vision provides global context for complete object geometry. However, stretchable tactile arrays suffer strain-induced interference, which corrupts normal force measurements when contacting deformable objects. Prior strain-insensitivity approaches emphasize structural or material designs in open-loop fashion without quantitative, adaptive suppression under real-time conditions. Furthermore, existing visual–tactile learning often uses camera-based tactile sensors and static settings, lacking temporal consistency in hand movement and object deformation. To address these challenges, the authors propose ViTaM, a visual–tactile recording and tracking system combining a high-density stretchable tactile glove and a joint learning framework to estimate dynamic hand–object states for both deformable and rigid objects. The system introduces closed-loop, adaptive strain interference detection and suppression to enable accurate force recording from stretchable interfaces and fuses visual and tactile temporal features to reconstruct fine-grained contact deformation and complete object geometry.

Literature Review

The paper situates its contribution within several strands of prior work: (1) Forceful HMI and manipulation: Prior studies largely focus on gesture and pose tracking with IMU, EMG, strain sensors, or video, with fewer tackling dynamic forceful manipulation and complete hand–object state recovery. (2) Strain-insensitive tactile sensing: Structural approaches (e.g., stretchable geometries, stress isolation, negative Poisson’s ratio) and material strategies (strain redundancy, microcracking, nanofiber encapsulation) mitigate strain effects but operate open-loop, lacking quantitative assessment and adaptive suppression under real-time conditions. (3) Visual–tactile joint learning: Camera-based tactile sensors (e.g., GelSight-like) with visual data have been used for pose and geometry, typically in static scenes without modeling temporal dynamics. The authors note these gaps—particularly the absence of closed-loop, adaptive strain suppression on stretchable interfaces and temporally consistent joint learning for deformable object reconstruction—and position their system as addressing them.

Methodology

Hardware and sensing: The ViTaM system uses a high-density, stretchable tactile glove integrated with 24×48 sensing units (up to 1152 channels; prototype demonstrates 456) covering palm and fingers, scanned via a multiplexed, current-mode multi-channel circuit with 12-bit ADCs (AD7928) and amplifiers (ADA4691-4). The glove employs modular construction: woven electrodes form row/column arrays over an assembled composite film; tactile sensing blocks include positive and negative strain sensors and a force sensor array. CNT-latex composite films are tailored to exhibit positive (2.9 wt% CNT) and negative (5 wt% CNT) stretching-resistive effects, enabling dual-channel strain detection with opposite trends under extension. Fabrication involves sewing conductive fabric electrodes, FPC integration, and modular assembly on a textile glove. Yield rate is validated at 97.15% across palm and finger regions. A bracelet houses processing and communication electronics (ESP32 MCU). System streams frames to a PC and VR interface, which maps voltages to pressures for real-time visualization. Adaptive strain interference suppression: The system performs closed-loop strain detection and calibration. Dual strain channels detect synchronized rise edges to flag strain presence; the magnitude εx is inferred from pre-characterized resistance–strain curves. Since full enumeration of force–output curves across all strains is impractical, the method uses local domain curve interpolation between nearest pre-calibrated strain curves (εup, εbottom) to form a strain-conditioned force estimation curve (curvex). This enables accurate force estimation under arbitrary strain states. Characterization and calibration: Mechanical/electrical tests assess sensitivity, response/recovery times, SNR, minimum detectable pressure, linearity, repeatability, and drift. Static calibration is performed for all sensing blocks on a PDMS testbed with 0–40 N forces in 0.05 N steps, fitting voltage–force curves (Supplementary Note 1). Circuit parameters (e.g., Rg ratios, Vref) are optimized for dynamic range. Repeatability is tested over 2000 load cycles. Data and setup: Visual data are captured via a high-precision top-down RGB-D camera (Photoneo MotionCam M+, 800×1120 RGB/depth, 10 fps, 1.55 m height). The tactile glove records at 13 Hz. Training data are generated in RFUniverse with FEM-based hand–object simulation and IPC contact modeling; each of 24 objects across 6 categories has 20 touches and 16 views, totaling 7680 samples. Real-world tests use unseen instances per category. Learning framework: Adapting VTacO, the tactile encoder is replaced by a 3-layer MLP and self-attention operating on an 1152-d vector, partitioned into 38 blocks aligned to hand anatomy; contact positions derive from hand shape estimation. Visual and tactile branches extract global/local features, fused temporally by a cross-attention temporal transformer. A winding number field (WNF) predictor infers signed occupancy from sampled positions; marching cubes reconstruct geometry per frame, enabling temporally consistent tracking. Ablations test the effect of removing temporal transformer. Inference runs at ~3–5 fps on an RTX 4090 GPU. Applications and analyses: Tasks include deformable object manipulation (e.g., plasticine dumpling-making: kneading, pressing, pinching) and grasping sponges; VR interface visualizes distributed forces. Correlation analyses (Spearman) across blocks quantify finger–palm synergies and the effect of strain calibration on revealing additional active blocks and stronger inter-block dependencies.

Key Findings

- The stretchable tactile glove achieves accurate force measurement on deformable interfaces via adaptive strain interference suppression, reaching 97.6% accuracy and improving by 45.3% over uncalibrated measurements. - Hardware capabilities: up to 1152 sensing channels (prototype 456), 24×48 array, 13 Hz sampling; yield rate 97.15%; SNR 70.64; minimum sensing limit 36 Pa; sensitivity 271.26 kPa^-1 within 0–100 kPa; response time 52 ms and recovery 80 ms; repeatability over 2000 cycles; consistent outputs across sensing areas; estimated glove and hardware costs are ~$3.38 and ~$26.63 respectively. - Visual–tactile reconstruction: On 24 objects across 6 categories (deformable and rigid), average reconstruction error is 1.8 cm over all sequences. Temporal fusion improves consistency over sequences. - Comparative performance: ViTaM outperforms vision-only baselines and the gel-based optical tactile system VTacO on categories including elastic, plastic, articulated, and rigid objects. For a sponge, chamfer distance is 0.467 cm, a 36% improvement over VTacO. - Strain calibration reveals more active sensing blocks and stronger finger–palm correlations in tasks (e.g., sponge grasping), improving interpretation of hand synergies and contact distributions. - Runtime: model infers at ~3–5 fps on an RTX 4090. - The glove’s force readings during real interactions (kneading dough, grasping sponge, pinching paper cup) are 12×, 16×, and 6× larger than empty-hand pose signals, indicating true interaction dominates over pose-induced noise, which the joint model can denoise under supervision.

Discussion

The study addresses the challenge of capturing full hand–object state during forceful manipulation, especially for deformable objects where occluded deformations elude vision alone. By introducing closed-loop adaptive strain detection and calibration on a stretchable tactile array, the system restores accurate normal force measurements despite strain-induced artifacts, enabling reliable contact deformation cues. The joint learning framework fuses tactile and visual temporal features, exploiting tactile signals to fill occluded, fine-grained contact details while vision constrains global geometry. Empirically, this synergy improves reconstruction completeness and precision for both deformable and rigid objects and enhances temporal consistency across sequences. Comparative results versus vision-only and gel-based tactile approaches demonstrate the importance of distributed, wearable tactile signals and temporal fusion for real-world, dynamic manipulation. The approach advances HMI and embodied AI by bridging physical interactions and digital representations, aiding applications in VR, telemedicine, and robotics.

Conclusion

This work presents ViTaM, a visual–tactile recording and tracking system that combines a high-density stretchable tactile glove with a temporal visual–tactile learning framework to reconstruct dynamic hand–object states. A key contribution is the adaptive strain interference suppression method that yields accurate force estimation (97.6%) on stretchable interfaces, markedly improving over uncalibrated readings. The model leverages temporal cross-modal fusion to reconstruct complete object geometry and fine-grained contact deformations across diverse objects, achieving strong quantitative and qualitative performance and outperforming vision-only and gel-based tactile baselines. Future directions include scaling the tactile system into mass-producible electronic skin for broader robot surfaces, enhancing temporal modeling and real-time performance, and extending to more general manipulation scenarios to further bridge human tactile perception and machine intelligence.

Limitations

- Prototype scale: while the architecture supports 1152 channels, the demonstrated prototype uses 456 sensing units. - Training–testing gap: training relies on FEM-based simulation data, with real-world tests on limited unseen instances per category; broader evaluation across more objects and tasks could further validate generalization. - Temporal throughput: the reconstruction model runs at ~3–5 fps on a high-end GPU, and the glove samples at 13 Hz; higher real-time throughput may be needed for certain applications. - Sensor calibration requirements: accurate performance depends on per-block calibration and interpolation between pre-characterized strain conditions; extreme or unmodeled strain states may require additional calibration data. - Sensing modality setup: current real-world capture uses a single top-down RGB-D camera view; multi-view setups might further reduce occlusions.

Related Publications

Explore these studies to deepen your understanding of the subject.

Medicine and Health

Recent Advancements and Perspectives in the Diagnosis of Skin Diseases Using Machine Learning and Deep Learning: A Review

J. Zhang, F. Zhong, et al.

Engineering and Technology

Stretchable and anti-impact iontronic pressure sensor with an ultrabroad linear range for biophysical monitoring and deep learning-aided knee rehabilitation

H. Xu, L. Gao, et al.

Engineering and Technology

Design of optical meta-structures with applications to beam engineering using deep learning

R. Singh, A. Agarwal, et al.

Medicine and Health

Design and Analysis of a Deep Learning Ensemble Framework Model for the Detection of COVID-19 and Pneumonia Using Large-Scale CT Scan and X-ray Image Datasets

X. Xue, S. Chinnaperumal, et al.

Listen, Learn & Level Up

Over 10,000 hours of research content in 25+ fields, available in 12+ languages.

No more digging through PDFs, just hit play and absorb the world's latest research in your language, on your time.

listen to research audio papers with researchbunny