Capturing forceful interaction with deformable objects during manipulation benefits applications like virtual reality, telemedicine, and robotics. This paper presents a visual-tactile recording and tracking system for manipulation featuring a stretchable tactile glove with 1152 force-sensing channels and a visual-tactile joint learning framework to estimate dynamic hand-object states. An active suppression method based on symmetric response detection and adaptive calibration improves force measurement accuracy by 45.3% (to 97.6%). The learning framework processes visual-tactile sequences and reconstructs hand-object states, achieving an average reconstruction error of 1.8 cm across 24 objects from 6 categories.