logo
ResearchBunny Logo
The unreliability of crackles: insights from a breath sound study using physicians and artificial intelligence

Medicine and Health

The unreliability of crackles: insights from a breath sound study using physicians and artificial intelligence

C. Huang, C. Chen, et al.

This prospective study reveals the reliability challenges of identifying crackles and wheezes in breath sounds, conducted by Chun-Hsiang Huang, Chi-Hsin Chen, Jing-Tong Tzeng, An-Yan Chang, Cheng-Yi Fan, Chih-Wei Sung, Chi-Chun Lee, and Edward Pei-Chuan Huang. While both physicians and AI performed well for wheezing, crackles proved less reliable, raising important questions for medical decision-making. Discover the intricacies of this fascinating research!

00:00
00:00
~3 min • Beginner • English
Introduction
The study addresses the longstanding variability and subjectivity in respiratory auscultation, which affects clinical decision-making. While auscultation is non-invasive, real-time, and inexpensive, inter-observer agreement is limited and heavily dependent on experience and training. Advances such as digital stethoscopes, spectrogram visualization, and machine learning have improved breath sound acquisition and analysis. Wheezes are typically high-pitched, musical, and longer in duration, whereas crackles are brief, discontinuous, and lower-frequency, potentially making them harder to perceive and classify. The research question is whether the intrinsic characteristics of different adventitious sounds (wheezes versus crackles) contribute to differences in human and AI identification performance. The purpose is to compare physicians and AI in identifying breath sounds and to assess the reliability of these sounds for clinical use, with particular interest in the robustness of deep learning models to different sound characteristics. The study is important for informing the role of auscultation and AI tools in acute care settings and for understanding which breath sounds can reliably guide clinical decisions.
Literature Review
Prior work indicates auscultation’s reproducibility can be acceptable but inter-observer variability remains problematic and linked to clinician experience and preferences. Meta-analyses suggest limited diagnostic accuracy of conventional lung auscultation and recommend complementary imaging modalities. Technological advances, including digital stethoscopes and spectrogram visualization, have improved detection of high-frequency sounds like wheezes and enhanced inter-rater agreement. Machine learning approaches have increased objectivity and performance in respiratory sound analysis, with better performance often reported for wheeze detection than crackle detection. Studies highlight the need for standardized terminology, improved training, and advanced equipment for classifying crackles, which are transient and susceptible to masking by background breath noise. Comparable AI successes in other auscultatory domains (e.g., heart murmurs) suggest potential but also reveal that sound type characteristics critically influence classification difficulty.
Methodology
Design and setting: Cross-sectional comparative study in the emergency department (ED) of National Taiwan University Hospital Hsinchu Branch, a tertiary medical center, from January 2021 to February 2022. Participants: Non-trauma patients aged >20 years presenting to the ED. Exclusions: Pregnant individuals, out-of-hospital cardiac arrest, transfers from other facilities, or discharge against medical advice. Data acquisition: Breath sounds recorded with a CarDiLaRT Electronic Stethoscope D5101 (0–8000 Hz; emphasis 20–1000 Hz). Recordings were 10 seconds each from four chest sites on both lungs (upper: midclavicular line, 2nd intercostal space; lower: anterior axillary line at inferior scapular rim). Audio was exported as 16-bit WAV, resampled at 16 kHz, and converted to mel-spectrograms. No post-processing or filtering was applied prior to labeling. Labeling process: Each recording was labeled once by one of five physicians (pretraining yielded κ=0.7 on demo recordings). Categories: normal, wheezing, crackles, unknown, and no breath sounds (definitions provided for each). Six AI models were built: five “AI doctor” models trained on each physician’s labels, and one All-data AI model trained on all labels. Each recording was then labeled by the five AI doctor models and the All-data AI model. Doubtful labels were identified when there was discrepancy among the original physician’s label, the All-data AI model’s label, and the majority output of the five AI doctor models. Doubtful recordings were re-assessed by two additional physicians, and final labels were assigned by majority vote to serve as the gold standard. Outcomes and analysis: Primary outcome was the efficacy of the initial labeling physician and the All-data AI model in identifying different breath sounds. Performance metrics included sensitivity, specificity, and AUROC with 95% CIs (Clopper-Pearson). Categorical variables were summarized as counts and percentages.
Key Findings
- Labeled dataset: 11,532 breath sound files; 579 doubtful labels detected. After re-labeling and exclusion, 305 recordings had gold-standard labels for performance evaluation. - Wheezing identification: Physicians vs All-data AI—Sensitivity 89.5% vs 86.6%; Specificity 96.4% vs 95.2%. Both demonstrated high performance with high AUROC (no significant difference reported between physician and AI). - Crackle identification: Physicians vs All-data AI—Sensitivity 93.9% vs 80.3%; Specificity 56.6% vs 65.9%. Despite relatively high sensitivity, specificity was low, and AUROC was lower for crackles than for wheezes. - Normal breath sounds were frequently misclassified as crackles, contributing to crackle specificity issues. - Overall, wheeze detection was robust for both humans and AI, whereas crackle detection remained challenging and less reliable for clinical decision-making.
Discussion
The findings indicate that the acoustic properties of breath sounds strongly influence both human and AI classification. Wheezes, being high-pitched and longer in duration, are acoustically salient and easier to recognize, yielding high sensitivity and specificity for both physicians and AI. In contrast, crackles are brief and discontinuous, making them harder to perceptually isolate and more susceptible to masking by background breath noise. This leads to frequent misclassification, particularly confusion with normal breath sounds, producing low specificity and lower AUROC. The study’s dual examination framework (physician and AI) increased label robustness and highlighted shared limitations across human and machine perception. Despite leveraging multiple deep learning models and different training sets, AI performance for crackles lagged behind wheezes, mirroring human difficulty. These results emphasize that clinical decisions based on crackles should be cautious and corroborated with further testing (e.g., imaging, additional clinical assessment). They also underscore the importance of standardized definitions, training, improved acquisition hardware, and potentially enhanced signal processing and model architectures to better capture crackle characteristics.
Conclusion
Physicians and deep learning models both perform substantially better in identifying wheezes than crackles or normal breath sounds. Given the low specificity and reduced AUROC for crackle detection, medical decisions predicated on crackles should be made cautiously and validated through additional examinations. Future work should prioritize methods to distinguish crackles from normal breath sounds, including standardized terminology, larger and more diverse labeled datasets, improved preprocessing to manage noise, and model strategies tailored to transient acoustic events.
Limitations
- Single-center ED setting with adult, non-trauma patients limits generalizability. - Each recording initially labeled by a single physician; inter-observer variability remains a factor despite pretraining and re-assessment of doubtful cases. - Only a subset (305 recordings) achieved gold-standard labels after discrepancy resolution, potentially constraining performance estimates. - Breath sounds were recorded with clinical fidelity including ambient noise, which, while realistic, may degrade crackle detection and affect performance metrics. - Definitions of wheezing and crackles are inherently subjective, potentially influencing labeling and model training.
Listen, Learn & Level Up
Over 10,000 hours of research content in 25+ fields, available in 12+ languages.
No more digging through PDFs, just hit play and absorb the world's latest research in your language, on your time.
listen to research audio papers with researchbunny