Engineering and Technology
Encoding of multi-modal emotional information via personalized skin-integrated wireless facial interface
J. P. Lee, H. Jang, et al.
The study addresses the challenge of accurately detecting and encoding emotional information, which is inherently abstract, complex, and personalized. Existing single-modal approaches (facial images, speech, text) suffer from environmental constraints and ambiguity. To overcome these limitations, the authors propose a personalized skin-integrated facial interface (PSiFI) capable of simultaneous multi-modal sensing of facial strain (non-verbal expression) and vocal vibration (verbal expression), robust to environmental factors. The purpose is to enable reliable, real-time emotion recognition and encoding of comprehensive emotional context for human–machine interaction, using a self-powered triboelectric sensor platform, wireless data processing, and machine learning (CNN with transfer learning) for personalized adaptation.
Prior work on emotion recognition used modalities such as facial expression analysis (image-based), speech processing, and text-based sentiment/emotion detection. These approaches face challenges including sensitivity to lighting, occlusions, noise, and linguistic ambiguity. Physiological-signal-based methods (EEG, EMG, ECG, GSR) offer robustness to environment but often require bulky equipment limiting everyday use. Flexible skin-integrated devices have shown promise for real-time detection (facial strain, speech, gestures) but typically rely on resistive strain sensors needing external power, operating at low frequency ranges, and offering single-modality correlation. Recent multi-modal emotion recognition efforts combine modalities to extract higher-level embedded information and overcome individual modality weaknesses. Triboelectric nanogenerators (TENGs) provide self-powered sensing with broad design flexibility, supporting wearable, personalized, multi-modal systems.
Device and system design: The PSiFI is a personalized, transparent, stretchable facial mask integrating triboelectric sensors (TES) for two modalities: facial strain (non-verbal) and vocal cord vibration (verbal). Personalized fit is achieved via 3D face reconstruction from scanned photos to design a custom mask layout matching facial contours (glabella, eye, nose, lip, chin, and a neck spot for vocal cord).
Sensors: TES are configured in a single-electrode mode using two layers: a PEDOT:PSS-coated PDMS stretchable electrode and a PDMS dielectric layer. The electrode is fabricated by a semi-curing method (coating PEDOT:PSS dispersion before full PDMS cure) to achieve high transparency and conductivity. The dielectric for the strain unit is nanostructured PDMS (nanowire surface) produced via ICP-RIE using Au nanoparticle masks to enhance triboelectric contact area. The dielectric for the vibration unit is hole-patterned PDMS (acoustic holes) serving as diaphragms to ventilate air and flatten frequency response while reducing stiffness. Layers are spaced using double-sided tape at the ends to maintain consistent operation.
Materials: PDMS (10:1 base:curing agent), PEDOT:PSS dispersion modified with DMSO and IPA, Au nanoparticles for etch masks. Films: electrode PDMS cast and semi-cured at 90 °C for 5 min before PEDOT:PSS anchoring; dielectric nanowires etched with Ar/O2/CF4 plasma (15/10/30 sccm) for 300 s under 400 W ICP and 100 W bias; hole arrays laser-cut in 100 µm PDMS (min hole diameter 500 µm).
Assembly and attachment: Stretchable conductors (1 cm × 1 cm) are connected via FFC and medical silicone tape; dielectric films (50 µm) are placed atop as the triboelectric interface. Devices are mounted to skin with biocompatible ultrathin transparent medical tape (Tegaderm), ensuring conformal contact and long wear without irritation.
Wireless data processing: A compact transmitter circuit, powered by a portable battery, acquires multi-channel triboelectric signals and wirelessly transmits to a receiver board connected to a laptop for storage and processing. Signals from strain sensors capture distinct facial expression patterns; vibration sensor signals capture speech syllables and short sentences. Vocal signals are transformed using FFT to frequency-domain representations for pattern recognition.
Machine learning: Separate classifiers are trained for non-verbal (facial strain) and verbal (vocal vibration) inputs due to low interdependence of concurrent modalities. A 1D-CNN is used for facial expression classification and a 2D-CNN for speech classification (post-FFT images). Data augmentation (jittering, scaling, time-warping, magnitude-warping) is applied to expand training sets 8-fold. Pre-training protocol: one participant repeats each expression 20 times (verbal and non-verbal), collecting 100 patterns per expression; 70 patterns are used for training (then augmented) and 30 for testing. Transfer learning: pre-trained parameters are fine-tuned with new users’ data, requiring only 10 repetitions per expression to achieve personalized real-time classification.
Characterization: Strain unit tested with a linear motor applying cyclic loading; open-circuit voltages recorded (Keithley 6514). Response time evaluated; frequency performance assessed at constant strain (40%) across 0.5–3 Hz; durability tested over 3000 cycles at 40% strain. Vocal unit characterized with different open ratios (OR = 5%, 10%, 20%), support thicknesses, and hole counts; outputs measured on an oscilloscope (2.5 GS/s) across frequencies including typical adult fundamental ranges (male: 100–150 Hz; female: 200–250 Hz). Sensitivity to vibration acceleration (0.1–1.0 g) quantified; effects of structure on output analyzed.
- Developed a personalized, skin-integrated, transparent, stretchable, self-powered facial interface (PSiFI) that simultaneously senses facial strain and vocal vibrations using triboelectric sensors.
- Strain sensing unit: First demonstration (to authors’ knowledge) of bidirectional triboelectric strain sensing (buckling and stretching), with linear responses and sensitivities of approximately 5 mV (buckling, up to ~50% strain) and 3 mV (stretching, up to ~90% strain). Response time below 20 ms with negligible latency. Stable performance at constant 40% strain across 0.5–3 Hz. Mechanical durability maintained over 3000 cycles at 40% strain, exceeding typical facial strain requirements.
- Vibration sensing unit: Acoustic hole patterning (optimized OR ≈ 10%) produced the widest flat frequency bandwidth and best output performance covering typical vocal fundamental ranges (male ~100–150 Hz; female ~200–250 Hz). Structural effects: increased support thickness reduced output (larger gap lowers effective contact), while increasing hole number (at fixed OR) enhanced diaphragm deflection and output. Sensitivity to vibration acceleration reached ~5.78 V/g with holes versus ~2.07 V/g without holes (~2.8× improvement).
- Wireless multi-channel data acquisition enabled distinct signal patterns for facial expressions (happiness, surprise, disgust, anger, sadness) and spoken content (syllables such as “A”, “B”, “C”, and sentence “I love you”). Vocal signals benefited from FFT-based frequency-domain features.
- Machine learning with data augmentation and transfer learning achieved real-time, personalized classification: combined verbal/non-verbal classification accuracy ~93.3% without obstacles; ~80.0% accuracy even when the user wore a facial mask.
- Application demonstration: A VR-based digital concierge (Unity) reacted to users’ emotional speech and expressions, providing services across scenarios (smart home mood feedback, office keyword search/interpretation, entertainment content curation), illustrating practical HMI integration.
The PSiFI addresses core limitations of emotion recognition by integrating multi-modal, self-powered sensing directly on the skin, thereby reducing susceptibility to environmental factors (lighting, occlusion, noise) and enabling barrier-free communication. Bidirectional strain sensing and optimized vibration sensing provide rich, complementary emotional cues. The wireless architecture and rapid-adapting CNNs (via transfer learning and augmentation) allow personalized, real-time emotion recognition without extensive data collection per user. These capabilities facilitate encoding emotional information into application-specific parameters, enhancing human–machine interaction in contexts such as VR, smart environments, and assistive systems. The demonstrated VR concierge highlights the system’s potential to deliver user-oriented services tuned to affective states, advancing practical adoption of emotion-aware interfaces.
This work presents a personalized, skin-integrated, wireless facial interface that self-powers multi-modal sensing of facial strain and vocal vibrations to encode emotional information in real time. Key contributions include: the first bidirectional triboelectric strain sensor for facial expressions, optimized acoustic hole-patterned vibration sensors for speech, a compact wireless data pipeline, and a CNN-based framework leveraging data augmentation and transfer learning for rapid personalization. The system achieves high accuracy even with occlusions and demonstrates practical HMI applications in VR via a digital concierge. Future research could expand modalities (e.g., additional physiological signals), validate across larger and more diverse populations, improve generalization to spontaneous, in-the-wild interactions, and integrate advanced, context-aware models to further enrich affective computing and digital transformation.
- User-specific calibration is required; transfer learning reduces but does not eliminate per-user data collection (10 repetitions per expression).
- Dataset size per user and number of participants appear limited; generalizability across demographics, languages, and spontaneous emotions needs broader validation.
- Evaluation focuses on selected expressions and short speech content; performance in complex, continuous conversation and diverse emotional states remains to be tested.
- Lab-based characterization (controlled frequencies/accelerations) may not capture all real-world variability (motion artifacts, sweat, long-term wear).
- The VR concierge demonstrations are proof-of-concept; comprehensive user studies on usability, comfort, and long-term reliability are needed.
Related Publications
Explore these studies to deepen your understanding of the subject.

