logo
ResearchBunny Logo
Introduction
The lack of effective clinical treatments for speech impediments like aphasia and dysarthria fuels research into efficient non-acoustic communication. Silent speech interfaces (SSIs) offer a promising solution. Current methodologies primarily focus on surface electromyography (sEMG), but sEMG suffers from scalability issues due to low signal-to-noise ratio (SNR) and interelectrode interference. Visual monitoring, while offering high spatial resolution, is limited by environmental factors and inefficient data processing. Other biosignal-based SSIs, like those using electroencephalography (EEG) and electrocorticography (ECOG), also face challenges: EEG suffers from signal attenuation, and ECOG is invasive. While sEMG is non-invasive, its low SNR and interelectrode correlation hinder large vocabulary classification. Furthermore, sEMG systems often degrade due to sweat and sebum. Facial strain mapping using epidermal sensors offers an alternative. Previous strain gauge-based SSIs typically utilize stretchable organic materials, which suffer from device-to-device variation and poor long-term stability. Inorganic materials like semiconductors provide higher reliability and faster response times due to the piezoresistive effect, offering significantly higher gauge factors compared to metal-based sensors. This study proposes a novel SSI using single-crystalline silicon strain gauges with a 3D convolutional deep learning algorithm to address the limitations of existing SSIs.
Literature Review
Existing silent speech interfaces (SSIs) can be broadly categorized into visual and non-visual methods. Visual methods rely on optical cameras to capture facial movements, offering high spatial resolution but struggling with dynamic environments and light variations. Non-visual methods use various biosignals like EEG, ECOG, and sEMG. While EEG and ECOG offer rich brain activity information, EEG signals are attenuated by the skull and scalp, limiting word differentiation, and ECOG requires invasive craniotomy. sEMG, a non-invasive method, measures electrical activity in facial muscles but is limited by low SNR, interelectrode correlation, and susceptibility to environmental factors like sweat. Strain mapping using epidermal sensors offers advantages, but most previous work relied on stretchable organic materials with limitations in repeatability and long-term stability. This study seeks to leverage the high reliability and fast response time of inorganic semiconductor-based strain gauges for improved SSI performance.
Methodology
This study developed an SSI using ultrathin (<8 μm) single-crystalline silicon strain gauges with a serpentine design to achieve stretchability. Boron doping (5 × 10¹⁹ cm⁻³) minimized temperature-related resistance changes while maintaining a high piezoresistive coefficient. The high Young's modulus of silicon contributed to a fast strain relaxation time. Two perpendicularly placed strain gauges (<0.1 mm²) within each sensor captured biaxial strain information. Four biaxial sensors were strategically placed near the mouth based on an auxiliary vision recognition experiment identifying areas with significant areal changes during silent speech. Double-sided encapsulation protected the sensors from sweat and sebum. Data from 100 words (100 repetitions each from two participants) selected from the Lip Reading in the Wild (LRW) dataset were collected. A 3D convolutional neural network (CNN) was trained using these data, leveraging both spatial and temporal information. Five-fold cross-validation was performed to evaluate the model's generalization ability. The performance was compared to an sEMG-based SSI using electrodes with identical dimensions, and to other classifiers (correlation and SVM). t-SNE was employed to visualize high-dimensional features learned by the deep learning model, and R-CAM highlighted regions influential in classification.
Key Findings
The study demonstrated the feasibility of using ultrathin, crystalline-silicon-based strain gauges for silent speech recognition. Finite element analysis (FEA) validated the design of the biaxial strain sensor, demonstrating independent sensing properties in two orthogonal directions. Cyclic stretching tests verified the high reliability of the sensors, showing negligible resistance changes after 50,000 cycles of 30% stretching. The silicon-based strain gauges exhibited significantly higher sensitivity than metal-based gauges with the same structure. The 3D CNN model achieved a remarkable 87.53% word recognition accuracy on average across five folds for a vocabulary of 100 words, significantly outperforming an sEMG-based system (42.60%) with comparable electrode dimensions. The accuracy increased with the number of sensors. Analysis using t-SNE showed that words with similar pronunciations clustered closely, while R-CAM visualized the model's focus on characteristic signal regions for accurate classification. Even when tested on unseen data with slightly mismatched sensor locations and subject dependency, the model demonstrated promising performance, achieving up to 88% accuracy after transfer learning.
Discussion
The high accuracy achieved by the proposed SSI demonstrates the potential of using single-crystalline silicon-based strain gauges for silent speech recognition. The significant performance improvement over the sEMG-based system with identical sensor dimensions highlights the advantages of the strain gauge approach in terms of scalability and accuracy. The superior sensitivity of the silicon-based gauge, compared to a metal-based counterpart, validates the material choice. The use of a 3D convolutional deep learning model effectively captures spatiotemporal features of the strain data, leading to improved classification. The model’s ability to generalize well across different subjects and slight variations in sensor placement suggests robustness and practicality. The study successfully addresses limitations of prior SSI approaches by combining the advantages of high-gauge-factor inorganic materials with a powerful deep learning model.
Conclusion
This research successfully demonstrated a highly accurate and scalable silent speech interface using ultrathin crystalline-silicon strain gauges and a 3D convolutional deep learning algorithm. The system’s performance surpasses existing sEMG-based methods, especially for larger vocabularies. Future research could explore miniaturizing the sensors further, integrating them into more comfortable and unobtrusive wearable devices, and expanding the vocabulary to encompass a wider range of phonemes for potentially more natural and nuanced communication.
Limitations
While the study achieved high accuracy, certain limitations exist. The study involved a limited number of participants, potentially affecting the generalizability of the findings. The accuracy of the model may be impacted by individual differences in mouth movement patterns and the specific location of sensor placement. Future studies with a larger, more diverse participant group are needed to fully assess the system’s robustness. Additionally, testing in more realistic and less controlled environments would be necessary to demonstrate the system's resilience to various environmental factors.
Listen, Learn & Level Up
Over 10,000 hours of research content in 25+ fields, available in 12+ languages.
No more digging through PDFs—just hit play and absorb the world's latest research in your language, on your time.
listen to research audio papers with researchbunny