Medicine and Health
Sounds of COVID-19: exploring realistic performance of audio-based digital testing
J. Han, T. Xia, et al.
Since December 2019, COVID-19 has caused hundreds of millions of infections and millions of deaths, highlighting the need for fast, affordable, and scalable testing beyond RT-PCR and CT imaging, which have logistical, cost, and sensitivity limitations in real-world deployment. Recent machine learning advances suggest that respiratory sounds (cough, breathing, voice) may contain discriminative signatures for COVID-19 detection. Prior work reported AUC-ROC values from 0.70 to 0.99 across models, and physiological rationale exists that COVID-19 may affect vocal fold motion and respiratory acoustics. However, many studies risk bias and methodological flaws: small or unrepresentative cohorts, potential overfitting and data leakage (e.g., sample-level splits allowing the same user in train and test), lack of comparisons with other respiratory conditions, and unclear reporting. This paper asks two questions: (1) What is the realistic performance of an audio-based COVID-19 prediction model under unbiased, participant-independent evaluation? (2) How do biases (age, gender, language) and methodological choices (e.g., random sample splits) affect performance? The authors crowd-sourced a large dataset via a mobile/web app, controlled for language (English) and demographics during model development, and adhered to TRIPOD guidelines to provide transparent, clinically-relevant reporting and analysis. Contributions include: large-scale real-world data collection and careful cohort selection; systematic evaluation across demographics, prevalence scenarios, and health/smoking subgroups; and explicit demonstrations of how biased data and unrealistic splits inflate performance.
Multiple AI systems have been proposed for COVID-19 detection using imaging and audio. Systematic reviews report wide AUC-ROC ranges (0.70–0.99) for COVID-19 prediction models. Prior audio-based works indicate that respiratory and voice signals may carry COVID-19 signatures, with some studies claiming very high accuracy. Nonetheless, concerns include: very small cohorts (e.g., 19–88 subjects in several studies), risk of overfitting with complex models, data leakage from using the same participants across train/test via cross-validation, and limited comparison to other respiratory diseases or confounders. Calls for more rigorous methodological standards emphasize transparent cohort diagrams, user-independent splits, and analyses of potential confounders such as demographics and language.
Data collection: A multilingual app (web, Android, iOS; launched April 2020) collected demographics, medical history, current symptoms (11 types), self-reported COVID-19 test results, and audio: three voluntary coughs, three to five breathing sounds, and three readings of a fixed sentence. As of April 26, 2021, 36,364 participants contributed 75,201 samples. Exclusions: samples without test results (61,615), tests older than 14 days relative to recording (110), non-English audio (8,102) to avoid language confounding for the main analysis, and low-quality or incomplete recordings (134). Final dataset: 5,240 samples from 2,478 English-speaking participants (514 positive; 1,964 negative). Study design and splits: From the 2,478 participants, an unbiased training/validation cohort of 800 participants (400 positive, 400 negative) was constructed with balanced age and gender distributions across splits. The remaining participants formed a testing pool. A balanced test set of 200 participants (100/100) with varied demographics was held out. Additional evaluation groups were sampled from the testing pool to simulate varying prevalence levels (5%, 10%, 20%) and to analyze subgroups by medical history (e.g., asthma, high blood pressure), smoking status (never, ex-, current), and hospitalisation status. To study bias and unrealistic evaluations, additional data configurations were created: (a) sample-level random splits allowing samples from the same participant in both train and test; (b) gender-biased splits (negative group 85% female); (c) age-biased splits with negatives either all >39 or all <39; (d) language-biased splits by replacing some English speakers with Italian speakers and over-representing positive Italians in training. Model architecture: A CNN with three input branches (breathing, cough, voice) processes 0.96 s non-overlapping log-mel spectrogram segments (64 mel bins; 25 ms window; 10 ms hop; 125–7,500 Hz). Each modality uses VGGish (pre-trained on Audioset) to extract 128-D embeddings per frame, followed by average pooling to obtain one latent vector per recording. The three modality vectors are concatenated and passed to a binary classifier with two dense layers (hidden size 96 with ReLU; output size 2 with Softmax). Training: Audio was resampled to 16 kHz mono, trimmed for leading/trailing silence, and normalized. Optimization used Adam with binary cross-entropy, batch size 1, differential learning rates (1e-6 for VGGish, 1e-5 for dense layers), learning rate decay (0.9), L2 regularization (1e-6), and a two-phase training procedure (model selection around the 15th epoch on validation, then fine-tuning). Evaluation: Participant-independent testing; metrics included ROC-AUC, sensitivity, specificity with two-sided 95% CIs estimated via 1,000-sample bootstrap. Ethics approval was obtained (University of Cambridge CS Dept., ID #722).
Primary performance (balanced, participant-independent test set; 100 positive/100 negative): • All three modalities combined achieved ROC-AUC 0.71 (95% CI 0.65–0.77), sensitivity 0.65 (0.58–0.72), specificity 0.69 (0.62–0.76). • Single-modality results: cough AUC 0.66 (0.60–0.71), sensitivity 0.59 (0.51–0.66), specificity 0.66 (0.58–0.73); breathing AUC 0.62 (0.56–0.68), sensitivity 0.64 (0.56–0.71), specificity 0.56 (0.48–0.63); voice AUC 0.61 (0.55–0.67), sensitivity 0.57 (0.49–0.64), specificity 0.60 (0.52–0.67). Demographic subgroups (combined modalities): • Gender: Male (58/52): AUC 0.71 (0.63–0.78), sensitivity 0.59 (0.49–0.68), specificity 0.74 (0.63–0.83). Female (42/46): AUC 0.73 (0.65–0.80), sensitivity 0.71 (0.61–0.81), specificity 0.65 (0.55–0.75). • Age: 16–39 (55/54): AUC 0.65 (0.56–0.73), sensitivity 0.57 (0.46–0.68), specificity 0.65 (0.55–0.75). 40–59 (36/34): AUC 0.76 (0.67–0.85), sensitivity 0.72 (0.61–0.82), specificity 0.68 (0.55–0.81). 60+ (4/6): AUC 0.91 (0.77–1.00), sensitivity 0.88 (0.60–1.00), specificity 0.88 (0.69–1.00) (small N). Symptom status (combined modalities): • Asymptomatic (18/73): AUC 0.75 (0.60–0.88), sensitivity 0.50 (0.25–0.76), specificity 0.85 (0.77–0.92). • Symptomatic (144/89): AUC 0.66 (0.59–0.73), sensitivity 0.67 (0.59–0.74), specificity 0.56 (0.45–0.66). Prevalence scenarios (from testing pool; positives/negatives): • 5% (80/1520): AUC 0.71 (0.66–0.75), sensitivity 0.65 (0.57–0.73), specificity 0.65 (0.63–0.67). • 10% (100/900): AUC 0.69 (0.65–0.74), sensitivity 0.65 (0.57–0.72), specificity 0.65 (0.63–0.68). • 20% (100/400): AUC 0.69 (0.65–0.74), sensitivity 0.65 (0.58–0.72), specificity 0.63 (0.59–0.67). Medical history and smoking subgroups: • Asthma (10/142): AUC 0.59 (0.42–0.77), sensitivity 0.33 (0.07–0.64), specificity 0.62 (0.55–0.68). • High blood pressure (6/90): AUC 0.71 (0.56–0.85), sensitivity 0.54 (0.25–0.82), specificity 0.69 (0.62–0.76). • No medical history (97/1517): AUC 0.69 (0.65–0.74), sensitivity 0.64 (0.57–0.72), specificity 0.65 (0.63–0.67). • Smoking status: Never (55/840): AUC 0.73 (0.67–0.79), sensitivity 0.70 (0.61–0.80), specificity 0.66 (0.63–0.68). Ex- (17/285): AUC 0.71 (0.62–0.80), sensitivity 0.64 (0.47–0.82), specificity 0.67 (0.62–0.71). Current (22/257): AUC 0.57 (0.46–0.68), sensitivity 0.47 (0.31–0.66), specificity 0.63 (0.58–0.68). • Hospitalisation: Yes (3/6): AUC 0.83 (0.45–1.00), sensitivity 1.00, specificity 0.67 (0.25–1.00). No (97/1517): AUC 0.69 (0.65–0.74), sensitivity 0.64 (0.57–0.72), specificity 0.65 (0.63–0.67). Bias and unrealistic evaluation effects: • Sample-level random splits (with same-participant data across train/test) inflated performance versus user-independent splits; sensitivity 0.84 (0.75–0.92) and specificity 0.78 (0.68–0.87) for seen participants; overall AUCs in biased/unrealistic settings could reach up to 0.90 (95% CI 0.85–0.95). • Gender bias (85% of negatives female) yielded skewed metrics in the female subgroup: sensitivity 0.23 (0.14–0.33) versus specificity 0.93 (0.90–0.97), indicating the model learned gender confounds. • Age bias in negatives (all >39 or all <39) led to subgroup-specific inflated specificity consistent with the biased training distribution. • Language bias (over-representation of positive Italian speakers) led to sensitivity as low as 0.25 (0.15–0.36) in the English subgroup and near-zero specificity in the Italian subgroup; voice modality was most affected, cough less so. Threshold analysis: For asymptomatic screening, lowering the decision threshold can increase sensitivity and Youden Index; for symptomatic cohorts, slightly increasing the threshold can improve specificity (demonstrated on test data; further calibration needed for clinical deployment).
The study demonstrates that an audio-based, multi-modal (cough, breathing, voice) deep learning model can achieve realistic, participant-independent COVID-19 detection performance around ROC-AUC 0.7 across diverse demographics and under prevalence levels closer to real-world settings. Performance is relatively consistent across gender and age groups (noting small-N uncertainty for 60+), and the model does not appear to misclassify based on common comorbidities (asthma, high blood pressure) or smoking status, with predicted probabilities for negatives showing no significant subgroup differences. These findings support potential utility for pre-screening, especially for non-hospitalised cases, with the ability to tune decision thresholds for different use contexts (e.g., maximizing sensitivity for asymptomatic screening vs. specificity for symptomatic users). Crucially, the work quantifies how methodological choices and data biases can substantially inflate results: random sample-level splits that leak user identity, and demographic or language imbalances can produce over-optimistic metrics and skewed sensitivity/specificity in subgroups, undermining clinical applicability. By enforcing user-independent splits and controlling demographics and language (English-only) during model development, the study provides a realistic performance baseline. These insights guide responsible deployment and evaluation of digital audio-based testing tools, emphasizing robust cohort design, transparent reporting (TRIPOD), and careful consideration of confounding variables and threshold calibration for different operational goals.
A deep learning approach using cough, breathing, and voice recordings can realistically detect COVID-19 with ROC-AUC ≈0.71 under participant-independent evaluation on a large, crowdsourced dataset. The multi-modal model outperforms single modalities and exhibits stable performance across demographics and varying prevalence, with robustness to certain comorbidities and smoking status. The study explicitly demonstrates that unrealistic experimental designs and biased datasets can significantly overestimate performance (AUC up to 0.90) and distort subgroup metrics, underscoring the need for unbiased cohort construction and user-independent validation. As an affordable, non-invasive screening tool, audio-based models could support at-home pre-screening and resource prioritization. Future work should include external clinical validation, calibration and uncertainty estimation, exploration of accent and broader language effects, evaluation against other respiratory infections (e.g., influenza), integration with additional biosignals and self-reported symptoms, and systematic threshold tuning for different deployment scenarios.
• Ground truth is based on self-reported COVID-19 test results; crowdsourced labels may contain noise. • Temporal mismatch up to 14 days between test and audio could mislabel cases; RT-PCR itself has imperfect sensitivity/specificity. • The study focuses on English speakers to control language confounding; effects of accents and multilingual deployment remain untested here. • Limited representation of elderly participants reduces certainty for 60+ subgroup estimates. • Not evaluated against a broader set of specific respiratory infections (e.g., influenza, rhinovirus) due to low prevalence and ground-truth challenges. • Potential confounders (e.g., race/ethnicity) not fully controlled. • Threshold adjustment analyses are proof-of-concept on the test set; comprehensive calibration for clinical criteria remains future work. • Lack of external clinical validation; generalizability to clinical settings requires further studies.
Related Publications
Explore these studies to deepen your understanding of the subject.

