logo
ResearchBunny Logo
Automatic diagnosis of COVID-19 related respiratory diseases from speech

Medicine and Health

Automatic diagnosis of COVID-19 related respiratory diseases from speech

K. Shekhar, N. B. Chittaragi, et al.

Discover how Kushan Shekhar, Nagaratna B Chittaragi, and Shashidhar G Koolagudi are harnessing the power of speech recognition to identify COVID-19 related illnesses. Their innovative system leverages cough and breath samples, achieving an impressive accuracy of 84%. This research paves the way for speech-based analysis to support healthcare professionals in combating respiratory diseases.... show more
Introduction

The study investigates whether respiratory audio signals (cough and breath) can be used to automatically diagnose COVID-19-related illnesses. Motivated by the need for scalable, contactless, and rapid screening during the COVID-19 pandemic—where patient volume overwhelms healthcare capacity—the authors explore machine learning methods on crowdsourced respiratory sounds to distinguish healthy from affected individuals. They focus on evaluating standard spectral/prosodic features and excitation source features, and assess classification performance using shallow models. The work aims to determine the effectiveness of standard audio features for non-speech respiratory sounds, classify different coughs and breaths, analyze the utility of excitation source features, and examine the feasibility of ML-based automatic diagnosis systems for respiratory diseases.

Literature Review

Early efforts during the pandemic examined audio-based COVID-19 detection from crowdsourced datasets. Brown et al. (2020) used handcrafted spectral/temporal features (e.g., MFCCs, spectral centroid, roll-off, ZCR) and transfer-learned VGGish embeddings on cough/breath data with classifiers such as SVM, logistic regression, and gradient boosting. Bagad et al. (2020) evaluated larger-scale hospital-collected datasets, focusing on cough and using ResNet-18 alongside handcrafted features. Hassan et al. (2020) studied cough, breath, and voice from 80 subjects using spectral features with RNNs, reporting promising preliminary results. Related works also explore paralinguistic challenges (e.g., INTERSPEECH 2021), pertussis diagnosis from cough, and broader ML in healthcare. The reviewed literature indicates feasibility of audio-based diagnosis, frequent use of spectral features/MFCCs, and mixed results depending on data type, features, and classifiers. The present work extends this by incorporating excitation source features (e.g., GCIs, instantaneous frequency) rarely used in medical audio analysis.

Methodology

Data: A crowdsourced respiratory sound dataset collected by the University of Cambridge via web and mobile apps (release as of May 22, 2020). The study uses cough and breath samples separately and combined.

Features: Two feature sets were constructed.

  • Feature set-1 (standard audio features): spectral centroid, spectral rolloff, zero-crossing rate, root mean square energy (RMSE), MFCCs, delta MFCCs (first derivative), and delta-delta MFCCs (second derivative). Based on preliminary tests, 20 MFCC coefficients (and corresponding deltas) were used.
  • Feature set-2 (feature set-1 + excitation source features): added instantaneous frequency and statistics of glottal closure instants (minimum, mean, maximum, standard deviation).

MFCC extraction procedure: pre-emphasis; framing (20 ms, 10 ms shift) with Hamming window; DFT; Mel filterbank; log Mel spectrum; DCT to obtain MFCCs; plus delta and delta-delta.

Excitation source features: GCIs derived via Zero Frequency Filtering (ZFF). Speech passed through two ideal resonators at 0 Hz, trend removal produced ZFF signal; GCIs correspond to negative-to-positive zero crossings. Instantaneous fundamental period obtained from successive GCIs, with F0 = 1/T0. Implemented in MATLAB (custom processing to derive ZFS, convolution-based trend removal, frame length 20 ms). Extracted instantaneous frequency and GCI statistics; exported to CSV and merged with Python-extracted features.

Experimental setup: Feature extraction with Librosa (default SR 22050 Hz, frame size 2048 samples ~93 ms and hop 512 samples ~23 ms for standard features), storage in CSV, data handling with pandas/CSV. Classifiers: logistic regression, SVM with RBF kernel, and SVM with linear kernel implemented via scikit-learn. Data split 80/20 train-test; due to potential multiple samples per user, k-fold cross-validation (2-, 5-, 10-fold) was used, with 5-fold selected for main experiments due to best performance. Performance metrics reported: accuracy, AUC, precision, recall.

Key Findings
  • MFCC-only baselines:
    • Cough: Using 20 MFCCs with SVM (RBF) performed best: accuracy 78.88%, AUC 79.68%, precision 80.85%, recall 75.21% (vs 13 MFCCs notably lower).
    • Breath: SVM (RBF) with 20 MFCCs achieved accuracy 76.35%, AUC 74.38%, precision 79.17%, recall 61.25%. Breath results were similar for 13 vs 20 MFCCs under SVM.
  • Feature set-1 (standard audio features: MFCCs + deltas + spectral features):
    • Cough: SVM (RBF) best with accuracy 80.11%, AUC 79.87%, precision 78.23%, recall 77.57%.
    • Breath: Logistic regression and SVMs yielded comparable results. Best recall with SVM (linear): accuracy 77.24%, AUC 78.17%, precision 74.21%, recall 79.76%.
    • Combined (Cough + Breath): Best overall performance with SVM (RBF): accuracy 83.84%, AUC 83.93%, precision 88.41%, recall 77.22%. Logistic regression also strong: accuracy 81.41%, AUC 81.23%.
  • Feature set-2 (feature set-1 + excitation source features):
    • Breath showed consistent improvements across classifiers:
      • Logistic regression: accuracy 82.93%, AUC 81.51%, precision 81.92%, recall 74.88%.
      • SVM (linear): accuracy 81.38%, AUC 83.01%, precision 79.46%, recall 83.28% (highest AUC/recall for breath).
      • SVM (RBF): accuracy 80.51%, AUC 81.67%, precision 83.86%, recall 76.19%.
    • Cough: Mixed effects; logistic regression improved (accuracy 80.54%, AUC 80.92%), but SVM (RBF) decreased (accuracy 77.17%, AUC 77.02%).
    • Combined (Cough + Breath): No improvement over feature set-1; best with SVM (RBF) accuracy 82.17%, AUC 82.24%, precision 86.02%, recall 74.35% (lower than 83.84%/83.93% with feature set-1).
  • Overall best result: Using standard audio features (feature set-1) with combined cough+breath and SVM (RBF) achieved accuracy 83.84% and AUC 83.93% (rounded as 84%/84% in the narrative).
Discussion

The findings demonstrate that respiratory audio can support automatic screening for COVID-19–related conditions. Standard spectral/prosodic features (MFCCs, deltas, spectral centroid/rolloff, ZCR, RMSE) are effective, especially when combining cough and breath inputs. The complementary information in cough and breath yields superior performance over either modality alone, addressing the research objective of distinguishing healthy vs affected individuals from respiratory sounds.

Excitation source features (GCIs and instantaneous frequency) contributed positively for breath (notably boosting AUC and recall with SVM linear and logistic regression) but did not improve the best overall combined-data performance and sometimes degraded cough performance under SVM RBF. This suggests source-related cues may be more salient in breathing patterns than in cough transients for this dataset/configuration. SVM with RBF consistently provided strong performance, particularly with the standard feature set and combined data, indicating non-linear boundaries between classes in the feature space.

These results support the feasibility of ML-based, speech-signal-driven pre-screening tools to alleviate clinical workload, while highlighting the importance of multimodal respiratory inputs and careful feature selection.

Conclusion

The study proposes and evaluates an automatic diagnosis system for COVID-19-related respiratory diseases using cough and breath audio. Two feature sets were examined: (1) standard audio features (spectral/prosodic, MFCCs and derivatives) and (2) standard features augmented with excitation source features (GCIs and instantaneous frequency). Using shallow classifiers (logistic regression, SVM linear, SVM RBF), the best overall performance was achieved with standard audio features and combined cough+breath inputs using SVM RBF (accuracy ≈84%, AUC ≈84%), outperforming a referenced baseline.

Future work includes: collecting larger and more diverse datasets to enable deep learning; incorporating additional excitation-source features (e.g., strength/energy of excitation, residual MFCCs); adding speech/voice samples alongside cough and breath; and adopting subject-disjoint evaluation protocols to better assess generalization.

Limitations
  • Dataset size and composition were limited (release as of May 22, 2020), restricting the use of deep learning and potentially limiting generalizability.
  • Only cough and breath samples were available; speech/voice samples were not included in experiments.
  • The 80/20 split with k-fold CV may not ensure subject-level disjointness, risking data leakage if multiple samples per user appear across folds.
  • Excitation source features were limited to GCIs and instantaneous frequency; broader source features were not evaluated.
Listen, Learn & Level Up
Over 10,000 hours of research content in 25+ fields, available in 12+ languages.
No more digging through PDFs, just hit play and absorb the world's latest research in your language, on your time.
listen to research audio papers with researchbunny