logo
ResearchBunny Logo
Decoding lip language using triboelectric sensors with deep learning

Engineering and Technology

Decoding lip language using triboelectric sensors with deep learning

Y. Lu, H. Tian, et al.

Discover an innovative lip-language decoding system developed by Yijia Lu, Han Tian, Jia Cheng, Fei Zhu, Bin Liu, Shanshan Wei, Linhong Ji, and Zhong Lin Wang. Utilizing self-powered triboelectric sensors and a cutting-edge dilated recurrent neural network model, this research achieves an impressive 94.5% accuracy, facilitating barrier-free communication for the voiceless.

00:00
00:00
~3 min • Beginner • English
Introduction
The study addresses the challenge of decoding lip language—silent, mouth-movement-based communication used by individuals with vocal cord, laryngeal, or lingual injuries, in covert conversations, and in noisy environments. While user-friendly for speakers, lip language is difficult for others to interpret; most people can recognize only a few words by lip reading. Existing silent speech interfaces include magnet-, vision-, ultrasound-, inaudible acoustic-, and surface electromyography-based solutions. Vision-based methods suffer from sensitivity to head pose, lighting, occlusions, and fail when mouths are covered (e.g., masks during pandemics). The authors propose a lip-language decoding system that leverages non-invasive, contact triboelectric sensors to directly measure muscle movements around the mouth, overcoming visual modality limitations. The system aims to provide barrier-free communication for people with speech disorders by capturing lip muscle motion with flexible, self-powered sensors and decoding signals via deep learning. It evaluates sensor design/characteristics, compares lip-motion with audio signals under multiple conditions (silent vs voiced, speed, inter-speaker variation), and develops a dilated RNN with prototype learning to robustly classify lip-motion patterns with limited data.
Literature Review
Prior work on inaudible communication includes extensive research on sign language recognition using inertial and stretchable sensor arrays and computer vision, as well as silent speech interfaces using magnetic implants/sensors, vision-based lipreading (e.g., LipNet), ultrasound articulatory imaging, inaudible acoustic sensing, and sEMG-based methods. Deep learning has significantly improved recognition performance across these domains. However, visual lipreading is impacted by pose, lighting, and occlusion (e.g., masks). Triboelectric nanogenerators (TENGs) have emerged for self-powered sensing in human motion detection, HCI, respiratory and sound sensing, tactile/pressure sensing, wearables, and AI-enabled interfaces, offering low-cost, flexible, power-free operation suitable for capturing facial muscle movements. This work situates itself at the intersection of SSIs and TENG-based wearable sensing, proposing contact triboelectric sensors for robust lip-motion capture coupled with deep learning classification.
Methodology
System design: The lip-language decoding system (LLDS) comprises flexible triboelectric sensors positioned at key mouth muscle junctions (orbicularis oris, depressor anguli oris, risorius, zygomaticus, buccinator), a mask for positioning/fixation and pretension, readout electronics, and a neural network classifier. Sensors capture lip muscle movement during speech, signals are acquired and processed, and a trained model decodes them into text/speech outputs. Sensor structure and working principle: Sensors operate in contact-separation TENG mode with double electrodes. A polyurethane sponge spacer (with a central rectangular hole) separates nylon and PVC films, each backed by copper electrodes and packaged in polyimide to isolate from skin/sweat. During mouth opening, sensor compression reduces electrode-induced charges, producing current from PVC-side to nylon-side electrodes; during release (mouth closing), current reverses. The cycle produces characteristic bidirectional current corresponding to lip motion. Electrical characterization: Using a linear motor and dynamometer, sensors were characterized for open-circuit voltage and short-circuit current across forces (1–5 N), frequencies (1–5 Hz), areas (20–50 mm side lengths, thickness 2 mm), thicknesses (1–5 mm at 20 mm side), and series/parallel configurations. Load curves (external resistances 10^7–10^11 Ω), maximum output power/voltage, and durability (2000 press-release cycles at 5 N, 2 Hz) were measured. Material comparisons for tribo pairs (paper, PET, PTFE, PI, PVC, FEP vs nylon) guided selection (PVC chosen for toughness and performance). Artificial sweat effects were assessed. Data collection and signal analysis: Lip-motion signals were recorded for vowels, words, and phrases; synchronized audio was captured for comparison. Preprocessing included low-pass filtering (20 Hz) to remove power-frequency interference and amplitude normalization for waveform comparison. Effects of speaking speed (very slow to fast) and inter-speaker variability were evaluated. Timing offsets between lip-motion onsets and audio onsets were quantified for words in phrases. Signal decomposition/assembly across characters, words, and phrases was analyzed to study efficiency of continuous speech versus isolated words. Deep learning classifier: A multilayer dilated recurrent neural network (RNN) with gated recurrent units (GRUs) served as feature extractor. Two classifiers were compared: (1) softmax classification layer and (2) prototype learning (category prototypes in feature space; Euclidean-distance-based assignment). Architecture: 4-layer dilated RNN, 50 neurons per layer; dilation factors increase with depth to capture long-term dependencies with fewer parameters. Dataset and training: 20 categories (fruit words), 100 samples per category (80 train/20 test). Training for 500 epochs. Visualization: 2D feature projections to assess interclass separability and intraclass compactness. Small-sample experiments reduced training data to 80/60/40/20% of original to test robustness. Hyperparameter effects (layers/neurons) were explored. Applications and hardware: - Personal identity verification (PIV) for door unlocking via lip-motion recognition (host vs guest). - Directional control of a toy car using four commands (Go forwards/backwards, Left, Right). - Lip-motion-to-speech conversion for voice-off users. Mask fabrication used nonwoven fabric with exposed lip area; two sensors at mouth corners were connected in series. Gate hardware employed STM32 microcontroller and SG90 servo; car used BT08B Bluetooth to Arduino UNO controlling a DC motor. Data acquisition via NI 9215 and Keithley 6514 electrometer; LabVIEW handled real-time processing; audio via microphone.
Key Findings
- Sensor performance: - Sensitivity at pressures ≥1 N; voltage sensitivity ~0.376 V/N at 1 Hz. - Frequency effect (5 N, 1–5 Hz): current increased from 1.88 nA to 9.66 nA; voltage stable ~0.89–0.91 V (fluctuation <0.02 V). - Area effect (5 N, 1 Hz; side 20–50 mm; thickness 2 mm): voltage increased 1.10→3.08 V; current 1.52→4.39 nA. - Thickness effect (5 N, 1 Hz; 1–5 mm): voltage/current increased up to 2 mm then plateaued. - Series vs parallel: series produced stronger voltage due to effectively larger area. - Load characteristics: maximum output power 3.50 nW at ~97.1 MΩ; inferred internal resistance ~100 MΩ; maximum voltage varied with load (10^7–10^11 Ω). - Durability: minimal attenuation over 2000 cycles at 5 N, 2 Hz (voltage 1.40 V→1.38 V). - Artificial sweat had little effect on output. - Lip-motion vs audio: - Silent and voiced speech produced highly consistent lip-motion waveforms; vocalization did not distort lip signal shape. - Data volume: lip-motion signals ~12 KB vs sound ~576 KB (1/48), reducing acquisition/storage/transmission load. - Speaking speed: waveform shapes preserved from very slow to fast; relative amplitudes changed (e.g., normalized peak for “meet” decreased from 0.6725→0.0879 as speed increased). - Inter-speaker variability: distinct waveform traits and speeds among participants; down-troughs present for some speakers only, indicating potential for identity recognition. - Timing: lip-motion onset preceded audio by ~18–417 ms; larger advances for open-sound initials (/zh/, /k/) and smaller for nasals (/m/), with individual variation. - Phrase efficiency: continuous phrase reading shorter than concatenated single-word utterances—"zhi ma kai men" 1.648 s continuous vs 2.398 s single words (31.3% faster); "open sesame" 1.10 s vs 1.286 s (14.5% faster). - Deep learning recognition: - With 20 categories and 100 samples each (80/20 split), dilated RNN with softmax achieved 91.75% test accuracy; prototype learning achieved 94.50% test accuracy and faster convergence. - Small-sample setting: at 20% training data, prototype learning reached 85.23% test accuracy vs softmax 31.46% (~2.7x higher), demonstrating robustness to limited data. - Feature space visualization showed greater interclass separability and intraclass compactness with prototype learning. - Confusion matrix: overall accuracy 95%; 45% of words at 100% accuracy; 80% of words >90%; all words >80%. Some confusions between similar waveforms (e.g., Berry vs Date/Olive at ~10%). - Applications: Successful demonstrations of personal identity verification for door unlocking, robust within-speaker command recognition for toy-car control (distinct time/frequency domain patterns), and lip-to-speech conversion enabling voice-off communication.
Discussion
The study demonstrates that flexible, self-powered triboelectric sensors can reliably capture lip muscle movements in contact mode, providing robust signals even under conditions that impair vision-based lipreading (e.g., occluding masks). Electrical characterization confirms stable operation across realistic forces/frequencies, with durability and negligible sweat interference, supporting wearable use. Comparative analyses show that silent and voiced speech produce equivalent lip-motion waveforms, enabling communication for individuals with laryngeal or vocal cord impairments. The observation that lip motion precedes audible sound by tens to hundreds of milliseconds offers opportunities for anticipatory decoding. The proposed dilated RNN with prototype learning effectively classifies lip-motion time series with limited labeled data, outperforming a softmax approach and maintaining accuracy in small-sample regimes. Demonstrated applications—identity verification, device control, and lip-to-speech—highlight practical feasibility and the potential for personalization (leveraging inter-speaker signal differences) and assistive communication. Overall, the findings address the core challenge of decoding lip language via a low-cost, wearable sensing modality coupled with data-efficient deep learning, expanding SSI and HMI capabilities.
Conclusion
This work introduces a lip-language decoding system integrating flexible, self-powered triboelectric sensors with a dilated RNN based on prototype learning. The sensors provide stable, durable, low-resource signals directly tied to lip muscle movements, and the classifier attains high accuracy (94.5%) across 20 classes with limited samples, retaining robustness under reduced training data. The system validates that silent and voiced speech yield consistent lip-motion signatures, enabling voice-off communication. Demonstrations of identity verification, toy-car control, and lip-to-speech conversion underscore applicability for assistive technologies and human–machine interfaces. Future directions include expanding vocabularies and sentence-level decoding, improving sensor placement/arrays for richer spatial resolution, cross-speaker adaptation and personalization, integration into lightweight wearable platforms, and real-time, on-device inference for broader deployment in rehabilitation, HMI, robotics, security, and VR.
Limitations
- Dataset scope: primary classification experiments used 20 word classes (fruit names) with 100 samples each; broader vocabularies and continuous sentence-level recognition were not comprehensively evaluated. - Inter-speaker variability: substantial individual differences in lip-motion patterns suggest a need for speaker adaptation/personalization; cross-speaker generalization performance was not exhaustively reported. - Sensor placement and area constraints: optimal sensor area is constrained by facial anatomy; placement/mounting via mask may affect consistency across sessions and users. - Confusable classes: words with similar lip-motion waveforms (e.g., Berry vs Olive/Date) led to non-negligible confusions, indicating room for sensor or model improvements. - Environmental/usage conditions: while sweat insensitivity was shown, long-term wear, motion artifacts (e.g., head movement), and real-world, mobile scenarios were not fully characterized.
Listen, Learn & Level Up
Over 10,000 hours of research content in 25+ fields, available in 12+ languages.
No more digging through PDFs, just hit play and absorb the world's latest research in your language, on your time.
listen to research audio papers with researchbunny