
Medicine and Health
Real-time deep learning-assisted mechano-acoustic system for respiratory diagnosis and multifunctional classification
H. K. Lee, S. U. Park, et al.
Discover an innovative approach to real-time respiratory pattern monitoring using a wireless mechano-acoustic sensor integrated with a deep learning system. This groundbreaking research conducted by Hee Kyu Lee, Sang Uk Park, Sunga Kong, Heyin Ryu, Hyun Bin Kim, Sang Hoon Lee, Danbee Kang, Sun Hye Shin, Ki Jun Yu, Juhee Cho, Joohoon Kang, Il Yong Chun, Hye Yun Park, and Sang Min Won could revolutionize COPD diagnosis, providing a promising alternative to traditional spirometry.
~3 min • Beginner • English
Introduction
The study addresses the need for robust, real-time, noninvasive monitoring and classification of respiratory-related signals, including COPD severity, using epidermal mechano-acoustic sensing at the laryngeal area coupled with deep learning. Existing clinical diagnostics like spirometry, while standard for COPD, require trained staff, clinical settings, and strong patient cooperation, contributing to underutilization and reduced diagnosis rates. Flexible epidermal sensors and digital processing can capture high-resolution physiological signals with improved coupling and SNR, but prior systems often relied on microphones susceptible to environmental noise and lacked real-time operation. The purpose is to develop and validate a wireless, flexible, skin-mounted accelerometer-based mechano-acoustic system with a mobile interface and a server-side multi-modal deep learning pipeline to classify spoken phrases, user and gender characteristics, and COPD severity in real time.
Literature Review
Prior work has demonstrated that mechano-acoustic and acoustic sensing of vocal fold oscillations can reflect respiration, phonation, and cardiac activity, typically processed in the frequency domain. Microphone-based COPD studies extracted acoustic parameters (e.g., jitter, shimmer) to distinguish patient characteristics, but these approaches are noise-prone, lack real-time monitoring, and often have limited operational durations. Long-term monitoring has shown systematic changes in COPD patients’ outputs relative to short-term observations, underscoring the need for stable attachment and comfort. Mechano-acoustic approaches are less susceptible to ambient noise than microphones and can maintain high SNR via conformal skin contact. Emerging deep learning methods for speech and biomedical acoustics motivate multi-feature spectral representations (mel spectrum, MFCC, chroma) and CNN-based models for improved classification.
Methodology
Hardware and device: A flexible, epidermally mounted mechano-acoustic sensor is built on a flexible PCB housing active and passive components, including a triaxial accelerometer and a Bluetooth Low Energy (BLE) SoC. The sensor is encapsulated in PDMS (elastic modulus ~1.32 MPa) to ensure conformal contact and durability during repeated bending. A polypropylene film (elastic modulus ~1.3 GPa) enhances structural integrity and reduces skin strain during removal. Medical-grade silicone/acrylic adhesive provides stable, biocompatible skin attachment with adhesion strength ~−90 N m−1 and peel strength ~−60 N m−1 after 24 h. Power is supplied by a 3.3 V CR2032 coin cell (240 mAh). The system samples the accelerometer at 1 kHz (z-axis SAADC channel; accelerometer bandwidth ~500 Hz) and streams data wirelessly via BLE to a tablet application and then to a server.
Wireless pipeline and UI: Sensor readings are transmitted to the SoC GPIO, digitized, and relayed via BLE to a tablet app that provides real-time visualization and buffering. The app forwards raw data to a server through a cloud database, enabling two-way communication: the server retrieves data for processing and returns classification results to the app. End-to-end round-trip (capture to classification result) is about 1 second with minimal packet loss.
Power/thermal: With a 240 mAh cell, the device operates ~160 h in BLE advertising and ~46 h in continuous streaming. Over 24 h of transmission, surface temperature stabilized at ~1.5 °C above baseline (~25.7 °C), posing minimal skin risk.
Signal preprocessing: Server-side processing includes amplitude normalization, FFT-based spectral conversion, power spectral density estimation, scaling by signal variance, smoothing (moving average), and adaptive band-pass selection around spectral formants via bilateral sweeps from the peak to determine cutoffs. Motion artifacts are characterized predominantly at low frequencies; filters are tuned accordingly. Feature representations include spectrograms, mel spectrograms, MFCCs, and chroma.
Classification models:
- Phrase (word) classification: A CNN with three convolutional layers (channels 32→64→128) processes 2D spectral inputs of size ~122×65. Dataset: 400 samples per phrase; split 8:1:1 for train/val/test. Achieved ~95% validation accuracy.
- User classification: Spectral density profiles vary by individual (e.g., laryngeal differences). A similar CNN pipeline yields ~95% validation accuracy; test performance corroborated via confusion matrix.
- Gender classification: Based on known vocal frequency ranges (males ~150–250 Hz; females ~200–400 Hz), training on balanced data from two male and two female participants achieved accuracy and recall comparable to phrase classification; confusion matrix showed clear separation.
- COPD severity classification: Multi-modal CNN architecture concatenates learned representations from mel spectrum (2D CNN), MFCC (1D CNN), and chroma (1D CNN). Training for 200 epochs with 8:1:1 splits. Single-feature models underfit (validation accuracy as low as ~61%; MFCC-only ~69%). Combining two features improved performance, and the final three-feature fusion model achieved high validation accuracy for binary classification of FEV1 ≥ 60% vs < 60% of predicted.
Data and cohorts: For COPD severity, participants were split by FEV1 threshold of 60% predicted: n=21 (≥60%) and n=29 (<60%). Spirometry values were blinded during mechano-acoustic data collection. No observed differences in age, sex, smoking status, or BMI between groups (per Supplementary Table 1). The system also collected word/user/gender data; buffering parameters included 200 ms for word data and 25 ms for COPD; transmission windows for word tasks were ~4000 samples (~11 s) via cloud database.
Mechanical and reliability tests: Cyclic bending (radius ~5.5 cm, 1000 cycles) and indentation were conducted; adhesion strength tests compared several tapes and silicone sheets. Power consumption measured via Nordic PPK II; temperature monitored with IR camera over 24 h.
Key Findings
- The flexible, epidermal mechano-acoustic sensor with BLE streaming enables real-time, robust capture of laryngeal vibrations up to 1 kHz with high SNR and reduced susceptibility to environmental noise compared to microphones.
- End-to-end closed-loop latency from capture to server-side classification result is ~1 s with minimal packet loss.
- Operational endurance: ~160 h (advertising) and ~46 h (streaming) on a 240 mAh CR2032; temperature rise over 24 h ~1.5 °C, indicating safe thermal profile.
- Adhesion provides stable long-term attachment (adhesion ~−90 N m−1; peel ~−60 N m−1 after 24 h), supporting repeatable measurements.
- Phrase classification: CNN achieved ~95% validation accuracy across 10 selected phrases (8:1:1 split), with confusion matrix confirming high performance.
- User classification: Validation accuracy ~95%, consistent with distinct user-specific spectral patterns.
- Gender classification: Accuracy and recall comparable to phrase classification, aided by distinct frequency band distributions between males and females; confusion matrix shows clear separation.
- COPD severity classification: Multi-modal CNN fusing mel spectrum, MFCC, and chroma achieved high accuracy (reported 95% on validation/test plots) for binary FEV1 threshold (≥60% vs <60% predicted) on cohorts n=21 and n=29, outperforming single-feature models (as low as 61–69% validation accuracy).
- Motion artifacts concentrated in lower frequencies can be mitigated by adaptive band-pass filtering informed by scaled PSD and formant-centered cutoffs, improving downstream classification performance.
Discussion
The integrated mechano-acoustic and deep learning system addresses the limitations of traditional microphone-based and clinic-bound assessments by enabling robust, real-time, on-body monitoring with low latency and strong noise immunity. By capturing skin-coupled laryngeal vibrations at 1 kHz, the system preserves salient spectral-temporal features for multiple tasks: speech phrase recognition, user and gender identification, and clinically relevant COPD severity classification. The multi-modal feature fusion (mel, MFCC, chroma) overcomes underfitting observed with single-feature models and leverages complementary information to improve classification accuracy. The ability to differentiate COPD severity with high accuracy (for FEV1 thresholding) suggests potential as a screening or monitoring tool to complement spirometry, particularly in settings where spirometry is underutilized. Additionally, device endurance, thermal safety, and stable adhesion support long-term and ambulatory use, which is valuable for capturing longitudinal changes in respiratory status. The findings demonstrate that skin-mounted accelerometer-based mechano-acoustic sensing can yield performance comparable to microphone-based datasets under low noise, while maintaining superior robustness in noisy environments, making it suitable for real-world deployment.
Conclusion
This work introduces a wireless, flexible, epidermal mechano-acoustic sensing platform integrated with a real-time, server-connected deep learning pipeline capable of multifunctional classification. Contributions include: (1) a robust skin-interfaced sensor with strong adhesion and safe thermal characteristics; (2) a low-latency mobile-cloud interface enabling real-time inference; (3) tailored preprocessing and adaptive spectral filtering; (4) CNN-based models achieving high accuracies for phrase, user, and gender classification; and (5) a multi-modal fusion network that classifies COPD severity (FEV1 ≥ 60% vs < 60% predicted) with high accuracy, outperforming single-feature baselines. Future work should expand datasets (including diverse demographics and disease severities), validate against clinical gold standards across multiple centers, explore additional respiratory and speech-related conditions, and investigate on-device inference for offline or edge scenarios.
Limitations
- Dataset size and composition are limited, particularly for COPD (binary FEV1 groups n=21 and n=29), which may constrain generalizability and necessitate external validation.
- Single-feature models showed underfitting, indicating sensitivity to feature selection and the need for richer predictors; performance gains relied on multimodal fusion.
- While comparisons to microphones suggest better noise robustness, controlled head-to-head benchmarks across varied acoustic environments are limited in the presented text.
- The study notes that some features may not directly diagnose specific speech or pulmonary diseases; broader clinical validation is required for diagnostic claims beyond severity stratification.
- Certain implementation details (e.g., reliance on server/cloud connectivity) may limit use in low-connectivity environments; on-device inference was not demonstrated.
- Adhesive performance and comfort were characterized, but very long-term wear and effects across diverse skin types and activities warrant further evaluation.
Related Publications
Explore these studies to deepen your understanding of the subject.