Engineering and Technology

Ultrathin crystalline-silicon-based strain gauges with deep learning algorithms for silent speech interfaces

T. Kim, Y. Shin, et al.

Explore a groundbreaking silent speech interface that utilizes advanced crystalline-silicon-based strain sensors, achieving impressive classification accuracy of 87.53%. This innovative research conducted by a team of experts at Yonsei University promises to revolutionize how we interact silently through technology.

00:00

~3 min • Beginner • English

Index

Introduction

The study addresses the challenge of enabling silent speech recognition (SSR) for nonacoustic communication, particularly beneficial for individuals with speech impairments such as aphasia or dysarthria. Existing approaches include visual monitoring of facial movements and nonvisual biosignal capture (EEG, ECoG, sEMG). Visual methods can offer high spatial resolution but are impractical in dynamic, real-world environments due to dependence on camera viewpoint, lighting, and background noise. EEG provides brain activity information but suffers from attenuation through skull and scalp, limiting discrimination among many words; ECoG improves SNR but is invasive; sEMG is noninvasive but limited by low spatial resolution, interelectrode interference, sweat/sebum artifacts, and skin irritation, restricting long-term use and scalability to larger vocabularies. Facial strain mapping with epidermal sensors is a promising alternative; prior efforts often used stretchable organic strain sensors that conform to skin but suffer from device-to-device variation and poor long-term stability, problematic for deep learning-based classification repeatability. Inorganic strain gauges (metal and semiconductor) offer higher reliability and fast response; however, metal gauges have low gauge factors (~2), while semiconductor gauges (e.g., silicon) leverage the piezoresistive effect for much higher gauge factors (~100). This work proposes a single-crystalline silicon nanomembrane (SiNM) strain gauge-based SSI with deep learning to overcome these limitations, aiming to capture biaxial facial strain robustly and classify a large vocabulary of silently uttered words with high accuracy and scalability.

Literature Review

The paper reviews SSR methods: vision-based lip reading and biosignal-based systems (EEG, ECoG, sEMG). Limitations noted include EEG’s attenuation and ECoG’s invasiveness; sEMG’s low SNR, crosstalk due to interelectrode spacing, susceptibility to perspiration artifacts, and skin irritation hinder scalability and long-term use. Prior facial strain sensing studies explored strain gauges for expression and speech recognition using stretchable organic materials, which provide conformal, stretchable interfaces but suffer from high device-to-device variability and poor long-term stability, undermining deep learning repeatability. Inorganic gauges (metal, semiconductor) offer reliability and fast strain relaxation. Metal gauges primarily change resistance via geometric effects (low gauge factor ~2). Semiconductor gauges exploit the piezoresistive effect with orders-of-magnitude higher resistance changes, enabling high gauge factors (~100). The authors contrast these approaches and position Si-based piezoresistive gauges as a robust, sensitive alternative for SSR. Comparisons with recent strain-gauge-based silent communication systems are noted (Supplementary Table 1).

Methodology

Hardware and sensor design: The system employs ultrathin (<8 μm total thickness) stretchable epidermal sensors embedding two single-crystalline silicon nanomembrane (SiNM) piezoresistive strain gauges (~300 nm thick) arranged perpendicularly to capture biaxial strain. Metal interconnects are Au(250 nm)/Cr(5 nm). Substrate and encapsulation are spin-coated polyimide double layers (~3.4 μm). Stretchability is achieved via mesh and serpentine designs to accommodate facial skin deformation. Biaxial gauges independently quantify orthogonal strain components. Silicon is boron-doped (5×10^19 cm−3) to minimize temperature-induced resistance changes while retaining high piezoresistive coefficient. The silicon gauge factor is governed by G = (ΔR/R)/(ΔL/L) = 1 + 2νπ, and fast relaxation time follows τ = η/E; high E of Si contributes to rapid response. Sensor placement: A heuristic vision-based area feature study (R-CAM on partitioned facial regions) identified key regions around the lower lip with large areal changes during silent speech. Four sensor sites (S1–S4) were selected below and lateral to the mouth, placed symmetrically due to near-symmetric facial motion. Characterization: Uniaxial tensile tests up to 30% strain along x and y axes were performed. Finite element analysis (FEA) of strain distribution confirmed that the gauge aligned with the stretch direction experiences dominant strain, with the orthogonal gauge showing minor response. Electrical tests showed stepwise increases in ΔR/R with increasing applied strain, demonstrating independent sensing of biaxial components. Reliability was evaluated by cyclic stretching (30% strain, 10 mm/s) up to 50,000 cycles on an elastomer mimicking skin modulus, showing negligible resistance drift. A metal-based gauge of identical structure was fabricated for comparison; SiNM gauge sensitivity exceeded metal gauge by ~42.7× at 10% strain, ~28.9× at 20%, and ~20.8× at 30%. Data acquisition: Four biaxial sensors (eight channels total) were attached near the subject’s mouth. During silent utterances, resistance changes over time were recorded via a DAQ system. Signals from the eight channels were organized into 2×4 heatmaps per time step and concatenated along time to form spatiotemporal 3D inputs. Sampling captured 2 s sequences at 300 frames per second. Dataset: Two subjects each silently uttered 100 words (selected from the LRW/LRW-1000 benchmark) with 100 repetitions per word, yielding 100 datasets per word across subjects (distribution noted in cross-validation description). Data were split into five folds for cross-validation. Deep learning model: A 3D convolutional neural network (3D-CNN) with seven 3D convolutional layers and three fully connected layers was used to encode spatiotemporal features. Kernel sizes were (3,3,3) with padding (1,1,1) and stride (1,1,1), except Conv3 with (3,1,3), padding (1,0,1), stride (2,1,2) for downsampling. Instance normalization and ReLU were applied per layer; no pooling was used to preserve localized spatial information. The final convolutional output was flattened to feed FC layers for 100-class classification. Cross-entropy loss and Adam optimizer were used. Five-fold cross-validation evaluated performance. Explainability: Relevance-weighted class activation maps (R-CAM) highlighted discriminative temporal-channel regions; t-SNE visualized feature embeddings. SEMG control experiments: Three sizes of epidermal sEMG electrodes were fabricated (exposed areas ~0.1, ~5.5, ~22.3 mm^2), encapsulated elsewhere by polyimide. For SNR tests, a pair of 2-channel electrodes and a commercial reference were placed on the buccinators and posterior mastoid, respectively. Signals were sampled at 1 kHz and preprocessed via a commercial EMG module (filters and amplifier) before DAQ. SNR increased with electrode size due to reduced surface impedance (SNRs: 1.517 small, 5.964 medium, 8.378 large). For word recognition comparison, four pairs of small-sized electrodes (same unit-cell scale as the strain gauge) were placed on buccinators, levator anguli oris, depressor anguli oris, and anterior belly of digastric. The same five-fold cross-validation and 3D-CNN pipeline were used for classification on 100 words. Additional analyses: Effects of training set size and number of channels on accuracy were evaluated. Comparison with correlation-based and SVM classifiers was performed. Transfer learning was used to adapt to unseen data (sensor placement/subject variation).

Key Findings

- The SiNM-based SSI achieved high-accuracy silent word recognition across 100 words using eight channels: five-fold cross-validation accuracies ranged from 80.10% to 91.55%, averaging 87.53%. - With fixed validation set (FOLD 5), increasing the number of trained examples per word improved accuracy from 23.70% (10 cases) to 87.50% (80 cases). The 3D-CNN outperformed SVM by at least 15% when ≥20 training examples were used. - Accuracy scaled with the number of channels: from 49.87% (2 channels) to 87.53% (8 channels). - The proposed 3D-CNN surpassed alternative classifiers: correlation model average accuracy 10.26%, SVM 76.30%, vs 3D-CNN up to 91.55% on a fold. - Transfer learning to unseen data (with sensor placement and subject domain shifts) improved accuracy substantially, up to 88%, indicating meaningful, generalizable features despite placement variability. - Hardware performance: SiNM gauges showed markedly higher sensitivity than identical-geometry metal gauges—~42.7× (10% strain), ~28.9× (20%), ~20.8× (30%). Reliable operation over 50,000 cycles at 30% strain with negligible resistance change. - sEMG comparison: Using small electrodes matched in area to the strain gauge unit cell, average recognition accuracy was 42.60% (100-word task; 46 datasets from Subject A, 54 from Subject B). t-SNE embeddings of sEMG features were scattered, indicating lower separability. SNRs increased with electrode size (small 1.517; medium 5.964; large 8.378), highlighting sEMG scalability limits due to SNR and spatial resolution constraints.

Discussion

The results demonstrate that ultrathin SiNM biaxial strain gauges combined with a 3D-CNN can robustly capture spatiotemporal facial strain patterns associated with silent articulation, enabling accurate classification over a large 100-word vocabulary. High gauge factor and rapid response of doped single-crystalline silicon, together with stable mesh/serpentine mechanics and double-sided encapsulation, yield reliable, high-SNR signals conducive to deep learning. Explainability analyses (R-CAM) show the model focuses on characteristic periods and channels (e.g., sensor S2 around specific time points), suggesting learning of meaningful articulatory dynamics rather than overfitting. t-SNE visualizations reveal well-clustered embeddings per word class; even phonetically similar words can be distinguished via subtle muscle movement differences. Scaling studies show accuracy improves with more channels and training data, indicating potential for even larger vocabularies or finer units (phonemes) by increasing sensor density. Compared to identically sized sEMG electrodes, the strain-based system provides far superior performance, highlighting its scalability advantage where sEMG suffers from low SNR and interelectrode interference at small scales. Transfer learning results indicate resilience to domain shifts (sensor placement, subject variability), supporting real-world applicability with minimal adaptation.

Conclusion

This work introduces a high-performance silent speech interface leveraging ultrathin, biaxial SiNM piezoresistive strain gauges and a tailored 3D-CNN to achieve state-of-the-art recognition accuracy (average 87.53%) on a challenging 100-word set with only eight channels. The sensors offer high sensitivity, stability over 50,000 strain cycles, and superior performance to identically structured metal gauges. Compared to equivalently sized sEMG electrodes, the proposed system yields markedly higher accuracy (42.60% for sEMG), underscoring its scalability and suitability for dense sensor arrays. The approach paves the way for high-spatiotemporal-resolution SSI platforms capable of scaling toward phoneme-level recognition and larger vocabularies. Future directions include: increasing channel counts for finer articulatory resolution; expanding to more subjects and diverse languages; robust domain adaptation across sessions and placements; and long-term, real-world validation including users with speech impairments.

Limitations

- Participant and dataset scope: Data were collected from only two subjects, which may limit generalizability across broader populations. - Domain sensitivity: Accuracy on unseen datasets can drop due to sensor placement mismatches and subject dependency; transfer learning mitigates but does not eliminate this sensitivity. - Vocabulary and task constraints: While large for strain-based SSI (100 words), the vocabulary is still limited compared to continuous speech or open vocabulary tasks. - sEMG comparison conditions: sEMG electrodes were downscaled to match strain gauge unit-cell size to assess scalability; while fair for scalability comparison, this may underrepresent sEMG performance achievable with larger electrodes or alternative placements.

Related Publications

Explore these studies to deepen your understanding of the subject.

Engineering and Technology

Ultrasensitive textile strain sensors redefine wearable silent speech interfaces with high machine learning efficiency

C. Tang, M. Xu, et al.

Engineering and Technology

Deep-learning-based image segmentation integrated with optical microscopy for automatically searching for two-dimensional materials

S. Masubuchi, E. Watanabe, et al.

Computer Science

A deep learning framework for gender sensitive speech emotion recognition based on MFCC feature selection and SHAP analysis

Q. Hu, Y. Peng, et al.

Computer Science

Deep learning-based multi-criteria recommender system for technology-enhanced learning

L. Salau, H. Mohamed, et al.

Listen, Learn & Level Up

Over 10,000 hours of research content in 25+ fields, available in 12+ languages.

No more digging through PDFs, just hit play and absorb the world's latest research in your language, on your time.

listen to research audio papers with researchbunny