logo
Loading...
Encoding of multi-modal emotional information via personalized skin-integrated wireless facial interface

Engineering and Technology

Encoding of multi-modal emotional information via personalized skin-integrated wireless facial interface

J. P. Lee, H. Jang, et al.

Discover a groundbreaking advance in human emotion recognition with the personalized skin-integrated facial interface (PSIFI) developed by Jin Pyo Lee and colleagues. This innovative system leverages flexible sensors and machine learning to identify emotional expressions in real time, even underneath masks. Experience how it transforms interaction in virtual environments!... show more
Introduction

The paper addresses the challenge of decoding and encoding human emotional information, which is abstract, complex, and personalized. Traditional emotion recognition methods based on facial images, speech audio, or text are hindered by lighting, noise, occlusions (e.g., masks), and linguistic ambiguity. To overcome these limitations, the authors aim to exploit multi-modal sensing that combines verbal (speech) and non-verbal (facial expression) cues using a wearable, skin-integrated interface. The research goal is to develop a personalized, self-powered, transparent and stretchable facial mask (PSIFI) that captures facial strains and vocal-cord vibrations and to encode these signals in real time using machine learning, enabling robust emotion recognition and downstream applications such as VR-based interactions.

Literature Review

The authors review: (1) Physiological-signal-based emotion recognition using EEG, EMG, ECG, and GSR, which can be robust to environmental factors but typically require bulky equipment. (2) Flexible, skin-integrated wearables enabling real-time monitoring of facial expressions, speech, gestures, and physiological signals; resistive strain sensors have been used for facial expression detection but need external power, operate at low frequency ranges, and often provide single-modality one-to-one mappings. (3) Prior work largely focuses on single-modal data, limiting resilience and the capture of high-level emotional context. (4) Multimodal emotion recognition through combining modalities (e.g., EEG and eye tracking) can improve robustness and extract higher-level features. The review motivates a need for a highly wearable, personalized, multimodal interface capable of reliable data capture across contexts and supported by classification techniques that adapt to individuals.

Methodology

System design: The PSIFI system integrates multimodal triboelectric sensors (TES) for facial strain and vocal-cord vibration sensing, a wireless data-processing circuit, and deep-learning classifiers. Personalized fit is achieved via 3D face reconstruction from scanned images to fabricate a conformal facial mask.

Sensors and materials: TES operate in single-electrode triboelectric mode. Layers: (i) Electrode layer: PEDOT:PSS-coated PDMS prepared by a semi-curing process to embed conductive polymers, yielding stretchable, transparent electrodes with improved optical, mechanical, and electrical performance. (ii) Dielectric layer: PDMS with nanostructured surface (via ICP-RIE using Au nanoparticle masks with Ar/O2/CF4 plasma) for the strain sensor; and hole-patterned PDMS diaphragms (laser-punched, smallest hole 500 µm) for the vibration sensor to act as acoustic holes that ventilate air, flatten frequency response, and reduce stiffness. Layers are separated at the ends by double-sided tape spacers for consistent operation cycles. All components are stretchable and skin-friendly.

Sensor placement: Strain sensors are placed at glabella, eye, nose, lip, and chin to capture facial muscle activations; a vibration sensor is placed at the vocal cord area for speech-related vibrations.

Working mechanisms: Strain unit uses triboelectric potential between PDMS (negative) and PEDOT:PSS electrode (positive). Bidirectional sensing is achieved by contact-area changes under stretching and buckling. Vibration unit uses diaphragm deflection; open ratio (OR) of holes balances diaphragm deflection and effective triboelectric contact area.

Characterization: Strain unit tested under cyclic stretch/buckle using a linear motor; sensitivity S = ΔV/Δε. Buckling: linear response up to ~50% strain with sensitivity ~5 mV per unit strain; stretching: linear up to ~90% with sensitivity ~3 mV per unit strain. Response time <20 ms with no apparent latency; frequency robustness from 0.5–3 Hz at 40% strain; mechanical durability over 3000 cycles at 40% strain without apparent degradation.

Vibration unit optimization: Compared OR = 5, 10, 20; OR ≈ 10 provided best output and widest flat frequency bandwidth, covering typical male (100–150 Hz) and female (200–250 Hz) fundamental voice ranges. At 100 Hz, hole-patterned sensors showed sensitivity 5.78 V/g, ~2.8× higher than pristine (no holes). Increasing support thickness reduces output (larger gap, smaller effective contact). For the same OR, more (smaller) holes increase diaphragm deflection, enhancing output. Typical parameters: PDMS diaphragm thickness 50 µm; baseline array 32 holes.

Wireless data pipeline: A compact transmitter board powered by a small Li-ion battery acquires multi-channel TES signals and wirelessly sends them to a receiver board connected to a laptop for storage and processing.

Signal processing and ML: Facial strain signals (time-domain) and vocal vibration signals (transformed to frequency-domain via FFT) are separately classified. Models: 1D CNN for non-verbal facial strain; 2D CNN for verbal speech. Data augmentation (jittering, scaling, time-warping, magnitude-warping) enhances training. Pre-training: For each expression, 100 signal patterns acquired from a participant repeating 20 times; 70 used for training (augmented 8×) and 30 for testing. Transfer learning: Pre-trained weights adapted to new users with only 10 repetitions per expression for rapid personalization and real-time classification. Wireless real-time classification is demonstrated.

VR application: Integration with Unity3D via wireless serial (Blueinno2) and Python (PySerial). User and concierge avatars (readyplayer.me) interact in VR scenarios (home, office, theater) responding to recognized emotional speech and expressions.

Key Findings
  • Developed a personalized, transparent, stretchable, self-powered facial mask (PSIFI) integrating triboelectric strain and vibration sensors with wireless data transmission and ML classifiers for real-time, multimodal emotion recognition.
  • Strain sensor: bidirectional (buckling and stretching) triboelectric strain sensing demonstrated for facial expressions with sensitivity ~5 mV per unit strain (buckling, linear up to ~50%) and ~3 mV per unit strain (stretching, linear up to ~90%); response time <20 ms; consistent output from 0.5–3 Hz at 40% strain; durable over 3000 cycles at 40% strain.
  • Vibration sensor: hole-patterned PDMS diaphragm optimized at OR ≈ 10 achieved highest output and widest flat frequency bandwidth covering typical human fundamental voice frequencies; sensitivity 5.78 V/g at 100 Hz (~2.8× higher than pristine without holes). Increased support thickness reduced output; increased hole count (at same OR) improved output.
  • Machine learning with data augmentation and transfer learning enabled rapid personalization (10 repetitions per expression) and real-time classification.
  • Real-time multimodal classification showed high accuracy: 93.3% without obstruction; with obstruction (e.g., wearing a facial mask) accuracy was reported at 80.0% (text) and also shown as 86.66% in figure snapshots.
  • Demonstrated a VR digital concierge that adapts services (smart home mood feedback, office keyword assistance, entertainment content curation) based on recognized emotional context.
Discussion

By combining triboelectric strain sensing of facial muscle movements with triboelectric vibration sensing of vocal-cord activity, the PSIFI captures complementary non-verbal and verbal cues, mitigating limitations of camera- and microphone-only systems under variable lighting, noise, and occlusions. The self-powered nature of TES and the personalized, conformal, transparent mask improve comfort and signal fidelity. The bidirectional strain sensing, fast response, and durability support accurate, real-time detection of facial expressions, while the optimized hole-patterned vibration sensor captures speech-relevant vibrations across typical voice frequencies with enhanced sensitivity. Machine learning with data augmentation and transfer learning addresses inter-user variability, enabling quick adaptation to new users with minimal calibration and supporting high real-time accuracy, even when partially obstructed by a mask. The integration into a VR digital concierge showcases how encoded emotional information can drive personalized human–machine interactions across daily-life scenarios.

Conclusion

The study presents a personalized, skin-integrated, wireless, self-powered multimodal interface (PSIFI) that encodes human emotional information by fusing facial strain and vocal-cord vibration signals. Key contributions include: (1) a transparent, stretchable, triboelectric-based bidirectional strain sensor and optimized vibration sensor; (2) a wireless acquisition system enabling real-time operation; (3) a machine-learning framework with data augmentation and transfer learning for rapid personalization; and (4) a VR digital concierge demonstration leveraging emotional context for interactive services. Future work could expand modality coverage (additional physiological signals), increase user population and datasets for broader generalization, refine on-device/edge inference, and explore more complex, continuous affective states and real-world deployments beyond laboratory and VR settings.

Limitations
  • Participant scale and dataset diversity appear limited; pre-training relied on one participant with subsequent rapid personalization for others, which may constrain generalizability across populations (age, gender, skin types, accents).
  • Calibration is still required for new users (10 repetitions per expression), which, while lightweight, imposes a setup step for deployment.
  • Emotion classes and speech content were limited (five basic expressions; a small set of syllables/phrases), not covering continuous or nuanced affective states.
  • Performance under diverse real-world conditions (sweat, motion artifacts, prolonged wear, varying adhesives/skin conditions) was not extensively evaluated; triboelectric outputs may vary with environmental and skin factors.
  • While sensors are self-powered, the wireless system uses a battery-powered transmitter; fully autonomous long-term operation and power management were not characterized.
  • Reported mask-obstructed accuracy varied between figure (86.66%) and text (80.0%), indicating variability and the need for more systematic benchmarking.
Listen, Learn & Level Up
Over 10,000 hours of research content in 25+ fields, available in 12+ languages.
No more digging through PDFs, just hit play and absorb the world's latest research in your language, on your time.
listen to research audio papers with researchbunny