Engineering and Technology
NeuralFeels with neural fields: Visuotactile perception for in-hand manipulation
S. Suresh
Ground-truth shape and pose: Real-world object meshes are acquired with a commercial dual-camera infrared scanner (Revopoint POP 3; ~0.05 mm precision). Objects are scanned on a turntable; low-texture objects use dotted stickers; hole-filling is applied to unseen regions. Some meshes are sourced from YCB and ContactDB sets. Pseudo ground-truth object pose in real-world is obtained by feeding synchronized RGB-D from three cameras (front-left, back-right, top-down; overlapping FoVs) into the pose pipeline with known shape; tracker runs at 0.5 Hz; initial orientation is manually annotated.
Tactile transformer (data, model, training): Architecture based on Dense Prediction Transformer (DPT) with a ViT backbone (pretrained ViT-small, embedding dim 384, patch size 16) and a convolutional decoder producing dense depth. Training data: 10K simulated tactile interactions on surfaces of 40 YCB objects (TACTO renderer). Optimization: Adam (batch size 100) with mean-squared error depth reconstruction loss; train/val/test split 60/20/20. Data augmentation: compose simulated renderings over real DIGIT backgrounds from 25 sensors; add pose noise (rotation/translation and surface-normal direction), vary indentation depth, randomize LED lighting (position/direction/intensity), add Gaussian RGB noise (σ=7 px), and apply random flips, crops, rotations. Reported sim test error: 0.042 mm average on simulated images.
Visual depth segmentation: Segment Anything Model (SAM) is prompted with positive and negative prompts; over-segmentation is pruned by selecting mask closest to expected area from simulation. At t=0, fingertip occlusion is disambiguated via a distance heuristic to decide if fingertips are in front/behind the object. Robot finger pixels are used as negative prompts to reduce false positives.
Neural field (shape) optimizer: Instant-NGP SDF with Adam (lr=2e-4, weight decay=1e-6). Positional encoding via hash table size 2^19; 3-layer MLP, width 64. Initialize with uniform weights and 500 shape iterations on first keyframe K0. For evaluation, freeze network and query a 200^3 feature grid within a 150 mm cube centered at initial pose x0; discard ray samples outside the bounding box; extract meshes via marching cubes and color by averaging colored object point clouds with a Gaussian kernel.
Pose optimizer: Vectorized SE(3) pose graph optimizer in Theseus with 20 Levenberg–Marquardt iterations (step size 1.0). Keyframe window size n=3. Run 2 pose iterations per shape iteration. Loss weights: wsdf=0.01, wreg=0.01, wicp=1.0. The optimizer aligns neural SDF renderings to predicted tactile depth by sampling 3D points from measured depthmaps, treating vision-based touch as another perspective camera.
Keyframing: Per timestep t, perform two checks to decide if visuotactile frames are added to keyframe set K: (1) Information Gain Check: average SDF rendering loss vs current depth > d_thresh (d_thresh=0.01 m) → add to K. (2) Image Time Check: time since last keyframe > t_max (t_max=0.2 s) → force add to K. Otherwise, use frames in backend optimizer at time t and discard to avoid redundancy while keeping recent info.
Compute/timings: Results reported by replaying trials at 1 Hz. System can run pose optimizer at 10 Hz and full backend at 5 Hz. Rotation policy runs at 2 Hz, limiting large object motions between updates. Experiments on Nvidia RTX 4090; aggregate evaluation on V100 cluster.
In-hand rotation policy: Train in simulation with access to physical-property embedding z_t (position, size, mass, friction, CoM). Policy maps joint angles q_t and z_t to PD targets a∈R^16; optimized with PPO across parallel environments. Reward: r = r_rot + λ_pose r_pose + λ_linvel r_linvel + λ_work r_work + λ_torque r_torque, where r_rot = clamp(ω·k, r_min, r_max), r_pose = ||q − q_init||^2, r_torque = −||τ||^2_2, r_work = −τ^T q̇, r_linvel = −||v̇||^2_2. Deployment uses an estimator to infer z_t from proprioceptive history; policy is trained to operate with DIGIT sensors on distal ends, relying on friction and maintaining elastomer contact. Training objects: spheres and cylinders with multiple aspect ratios around z-axis (length:diameter {0.8:1 to 1:1.2}). Time-to-fall metric (>20 s considered success) used for evaluation. Real-world note: higher control gains increase contact indentation but reduce rotation stability; more elastic sensors (e.g., GelSight) could increase contact but have durability issues.
Role and ablations of touch: Map final mesh vertices by nearest measurement (vision/touch) or mark as hallucinated (>5 mm from any measurement) to illustrate modality coverage. Touch accelerates shape completion and improves precision; tactile features (edges/patches) align with neural renders in both sim and real. Binary-contact approximation: threshold tactile images to detect contact; downsample to 2×2 and set depths to max detected—used for comparisons. Resolution ablation: downsample tactile images by 2×, 4×, 8× (240×320 → 30×40) to assess effect on SLAM performance; higher resolution preferred and binary-contact degrades performance sharply.
Additional analyses: Camera viewpoint study with three real-world RGB-D cameras shows closer viewpoints (front-left/back-right) yield lower pose error than a farther top-down camera. Class-specific metrics show symmetric objects (e.g., peach, pear) challenge depth-based pose; large objects (pepper grinder) suffer partial visibility for shape completeness; smaller objects may yield higher F-scores with 5 mm thresholds. Shape prior quality ablation: coarser voxelization (0.5 mm → 10 mm) increases pose drift, motivating good priors. Pose drift: initial higher drift reduces as shape model improves; lack of loop closures leads to gradual accumulation. Removing neural SDF loss (ICP + regularizer only) yields large pose errors and poor final shapes due to lack of frame-to-model constraints and ICP sensitivity to overlap and local minima.
- Policy performance: Across 8 FeelSight objects in simulation (5000 trials/object with randomized parameters), average success rate is 73.87 ± 17.48% (time-to-fall > 20 s). Object-wise examples: Rubik’s cube 92.78%, Peach 91.16%, Rubber duck 86.34%, Large dice 84.32%, Lego block 77.14%, Elephant 63.14%, Potted meat can 53.68%, Pear 42.42%.
- DIGIT sensing vs binary contact (5-trial averages): Pose error (lower is better): Simulation (Rubik’s cube) 3.07 ± 0.61 mm vs 4.46 ± 1.66 mm; Real (Bell pepper) 4.71 ± 0.68 mm vs 6.05 ± 0.97 mm. Shape (F-score higher is better): Simulation 0.98 ± 0.00 vs 0.67 ± 0.01; Real 0.76 ± 0.02 vs 0.55 ± 0.06 (p < 0.001).
- Neural SDF loss importance (5 real Rubik’s cube trials): Ours 5.862 mm pose error and 0.883 F-score vs ICP SLAM 13.486 mm and 0.545 (p < 0.001).
- Touch accelerates and improves shape quality: Over five Rubik’s cube trials, adding touch yields faster shape completion (recall) and improved precision over time in both sim and real; tactile depth optimization refines accuracy.
- Tactile feature alignment: Predicted tactile edges and patches are consistent with neural field renderings in sim and real, enabling effective pose updates from tactile depth.
- High-resolution touch benefits: Progressive downsampling (240×320→30×40) degrades pose and shape metrics; binary-contact baseline shows sharp performance drop. The pixel-sampling loss benefits from higher-resolution inputs.
- Viewpoint effects: Closer cameras (front-left ~27 cm; back-right ~28 cm) produce lower pose errors than a farther top-down camera (~49 cm), which suffers from worse depth and segmentation.
- Pose drift dynamics: Initial drift is higher due to unknown shape; as the SDF improves, drift stabilizes. Without loop closures, small errors accumulate over time.
- Shape prior quality: Coarser SDF voxelization increases pose drift, underscoring the value of high-quality shape priors for tracking.
- Modality coverage: Vision provides broad coverage; touch contributes significant local detail; neural SDFs can plausibly hallucinate unobserved regions based on nearby information.
- Compute: Real-time-capable components demonstrated (pose optimizer ~10 Hz, backend ~5 Hz), with results reported at 1 Hz playback; policy at 2 Hz keeps inter-step motions moderate.
The supplementary results demonstrate that integrating vision-based tactile sensing with neural SDF modeling leads to faster and more accurate object reconstruction during in-hand manipulation, which in turn improves pose tracking. Tactile depth from a high-resolution DIGIT sensor extends the effective field-of-view at contact, complementing incomplete or noisy visual depth, particularly under occlusions. The frame-to-model neural SDF loss is critical; without it, ICP-based alignment suffers from local minima, low overlap, and degraded shape, leading to larger pose errors. Ablations confirm that finer tactile image resolution yields better optimization signals than binary contact, and that camera proximity and segmentation reliability materially affect tracking accuracy. Improved shape priors reduce pose drift, evidencing the tight coupling between reconstruction fidelity and tracking. Collectively, these findings support the hypothesis that visuotactile fusion via neural fields enhances SLAM robustness for in-hand manipulation in both simulation and real-world deployments.
This work presents a visuotactile SLAM framework that fuses RGB-D and vision-based tactile depth within a neural SDF to jointly estimate object shape and pose during in-hand manipulation. The approach leverages a tactile transformer for depth prediction, an Instant-NGP SDF for shape, and a Theseus-based SE(3) optimizer with a frame-to-model neural loss. Results show that tactile inputs accelerate shape completion and improve precision, reduce pose errors versus binary/contact-only sensing, and substantially outperform ICP-based baselines. Policy-driven exploration enables rich contact interactions that benefit reconstruction. Future directions include: integrating loop-closure mechanisms to curb long-horizon drift; improving segmentation under challenging appearances and robot-induced occlusions; enhancing sim-to-real transfer for contact richness (e.g., controller tuning or alternative elastomer sensors) without compromising stability; and extending to fine manipulation tasks requiring even higher tactile resolution (e.g., insertions).
- Sim-to-real gap in tactile contacts: real contact patches are sparser due to controller rate and PD gains; more aggressive settings increase indentation but destabilize rotation.
- Sensor trade-offs: more elastic tactile sensors (e.g., GelSight) could improve contact but face durability and consistency issues over long experiments.
- Viewpoint and perception sensitivity: farther camera viewpoints and unreliable segmentation (e.g., top-down camera) increase pose errors; segmentation can fail on complex/textureless objects or when object resembles the robot hand.
- Object geometry challenges: symmetric objects (peach, pear) increase pose ambiguity; large/elongated objects (pepper grinder) have partial visibility; finger gait may not provide full tactile coverage.
- Lack of long-term loop closures leads to gradual pose drift accumulation.
- ICP limitations without frame-to-model constraints result in degraded tracking and shape; careful tuning and sufficient overlap are otherwise required.
- Binary-contact or low-resolution tactile inputs significantly reduce SLAM performance.
- Real-world cube rotation is harder due to hand morphology despite high simulation success, indicating morphology-driven transfer limitations.
Related Publications
Explore these studies to deepen your understanding of the subject.

