logo
ResearchBunny Logo
Relative importance of speech and voice features in the classification of schizophrenia and depression

Psychology

Relative importance of speech and voice features in the classification of schizophrenia and depression

M. Berardi, K. Brosch, et al.

This intriguing study conducted by Mark Berardi and colleagues explores how speech and voice features can differentiate between individuals with depression, schizophrenia, and healthy controls. Their findings reveal significant insights into the communication patterns associated with these mental health conditions, suggesting promising avenues for future research.

00:00
00:00
~3 min • Beginner • English
Introduction
The study addresses whether interpretable, hypothesis-driven acoustic speech and voice features can accurately classify schizophrenia spectrum disorder (SSD) and major depressive disorder (MDD) versus healthy controls (HC), and how key features relate to symptom severity. Prior attempts to find reproducible biomarkers for psychiatric disorders have struggled, but nonverbal acoustic features (prosody, temporal features, voice quality) show promise. SSD and MDD both show altered speech patterns (e.g., reduced speech rate, increased pause duration), with potential distinctions linked to positive symptoms (e.g., formal thought disorder). Traditional null-hypothesis significance testing is limited for high-dimensional acoustic data; interpretable machine learning (IML) offers scalable, explainable alternatives. The authors hypothesized that (1) MDD and SSD speech samples can be accurately classified from HC using interpretable features, and (2) important features correlate with depression and positive/negative symptom severity.
Literature Review
Prior work shows atypical acoustic measures in SSD and MDD including prosodic, spectral (e.g., MFCC), and temporal features such as speech rate and pauses. Meta-analyses in SSD report decreased proportion of spoken time, decreased speech rate, and increased pause duration, with correlations to psychopathology (alogia, positive/negative symptoms). Similar reductions in speech rate and increased pause duration are seen in MDD. Machine learning studies have classified SSD or MDD from HC with accuracies around 72–92%, but few include both diagnoses together. Most ML work has focused on depression, often using large, hard-to-interpret feature sets. Interpretable ML approaches, such as permutation feature importance, are recommended to combine scalability with explanatory insight. Articulatory coordination features (ACFs) have been informative for depression and potentially schizophrenia, and voice pathology measures (e.g., CPPs, LHR) have been underexplored in psychiatric speech analytics.
Methodology
Design: Cross-sectional, proof-of-principle study using interpretable ML on speech recordings from a picture description task. Participants: n=60 (20 SSD, 20 MDD, 20 HC) drawn from the Marburg/Münster Affective Disorders Cohort Study. Each participant contributed one session. MDD and HC were 1:1 matched to the SSD group on age and sex. Exclusion criteria: current/past substance dependency, traumatic brain injury, neurological disease, verbal IQ<80; HC excluded if first-degree relative had psychiatric disorder. Diagnoses via German SKID-I; interrater reliability ICC>0.86. Ethics approved; informed consent obtained. Task and Recording: Picture description using TAT images 1, 2, 4, 6. Participants described each picture for 3 minutes (total 4 samples per participant). Audio recorded with Olympus WS-853 at ~35 cm mouth-to-mic distance. Preprocessing: Each recording segmented into four samples. Examiner speech and excessive noise manually removed. One SSD participant replaced due to excessive noise. Feature Extraction: Sample-level acoustic features largely from Low et al. (2020): - Tempo: speech rate, articulation rate, talking rate. - Pauses: pause duration, pause duration SD, pause rate, pauses per minute (PPM). - Prosodic intonation: f0 SD, kurtosis, skewness. - Prosodic stress: intensity SD, kurtosis, skewness, energy velocity. - Spectral: MFCC1–MFCC13 means. - Articulatory coordination: ACF1, ACF2, ACF3 (vocal-tract-variable-based features). - Voice quality (novel for this application): CPPs mean, SD, kurtosis, skewness; LHR mean, SD, kurtosis, skewness. Modeling: Pairwise binary SVM models for HC vs SSD, HC vs MDD, SSD vs MDD with polynomial kernels (degree 1, 2, 3). Five-fold cross-validation with stratification by class and participant. Box constraints tuned via Bayesian optimization (expected improvement); kernels auto-scaled. Implemented in MATLAB R2021b. Performance metrics (accuracy, precision, recall, F1) averaged across folds. Feature Importance: Post hoc permutation feature importance. For each feature, model tested on permutations, with FI defined as the drop in accuracy; averaged over 20 permutations. Entire pipeline (including randomized CV splits and FI computation) repeated 100 times to average metrics and FI. Statistics: For top 25% important features, group differences (HC vs MDD, HC vs SSD) assessed via ANOVA (alpha=0.05) with Bonferroni correction; normality and homoscedasticity checked (Levene’s, Welch-ANOVA or Mann–Whitney U as needed). Pearson correlations (two-tailed) between important features and clinical scales: HAM-D, SANS, SAPS, and subscales (alogia, flat affect, FTD); Bonferroni corrected.
Key Findings
- Classification performance (5-fold CV, averaged): - SVM linear: Accuracy=0.793 (HC vs SSD), 0.736 (HC vs MDD), 0.653 (SSD vs MDD). - SVM 2-degree polynomial: Accuracy=0.933 (HC vs SSD), 0.900 (HC vs MDD), 0.916 (SSD vs MDD). - SVM 3-degree polynomial (best): Accuracy=0.947 (HC vs SSD), 0.920 (HC vs MDD), 0.932 (SSD vs MDD); precision up to 0.965; recall up to 0.933; F1 up to 0.949. - Top 25% important features across models (3-degree kernel): ACF2, ACF1, intensity kurtosis, MFCC1, PPM, CPPs skewness, f0 SD, LHR SD, LHR. Uniquely important features by model: CPPs SD (HC vs SSD), talking rate (HC vs MDD), MFCC2 (SSD vs MDD). - Group differences (percent change vs HC; ANOVA): Features significantly different for both MDD and SSD vs HC included ACF2 (MDD −4.7%, p=0.008; SSD −4.1%, p=0.044), ACF1 (MDD +3.8%, p=0.010; SSD +3.6%, p=0.033), MFCC1 (MDD −2.6%, p=0.008; SSD −3.1%, p=0.001), PPM (MDD −16.3%, p<0.001; SSD −17.2%, p<0.001), talking rate (MDD −8.0%, p=0.017; SSD −10.2%, p=0.003), CPPs SD (MDD −10.3%, p<0.001; SSD −7.7%, p=0.001). Features significant only in MDD: intensity kurtosis (+29.7%, p=0.006), CPPs skewness (+345.4%, p=0.022), LHR (−5.8%, p=0.010). Features significant only in SSD: f0 SD (−14.5%, p=0.011), LHR SD (−6.7%, p=0.007). MFCC2 not significant in either. - Correlations with symptoms (Bonferroni-corrected): In SSD, LHR SD correlated with HAM-D (r=0.50) and SAPS (r=0.40); CPPs skewness with SANS (r=0.47) and SAPS FTD (r=−0.42); intensity kurtosis negatively with SAPS (r=−0.39); MFCC1 positively with SAPS FTD (r=0.46); PPM negatively with SAPS FTD (r=−0.42). Patterns suggest links between voice/speech variability and positive/negative symptoms. - Overall, articulatory coordination (ACF1/ACF2), pauses per minute (PPM), and variability measures (intensity kurtosis, f0 SD, CPPs measures, LHR metrics) emerged as most informative. SSD showed more pronounced reductions in variability (f0 SD, LHR SD) than MDD.
Discussion
Findings demonstrate that a compact, interpretable feature set can robustly classify SSD and MDD from HC with high accuracy, supporting speech as a potential biomarker. The use of structured, extemporaneous picture descriptions likely preserved clinically relevant speech alterations while maintaining comparability, possibly contributing to high performance. Articulatory coordination features (ACF1/ACF2) consistently distinguished both patient groups from HC and showed relationships suggestive of competing influences of positive (e.g., formal thought disorder) and negative (e.g., alogia) symptoms. Reduced PPM and talking rate align with psychomotor slowing seen in both disorders. Variability features (f0 SD, intensity kurtosis, CPPs skewness, LHR SD) captured monotone and dysphonic aspects of speech; SSD demonstrated greater reductions in variability, indicating more pronounced aprosody. Inclusion of voice pathology measures (CPPs, LHR) provided added discriminability, especially between SSD and MDD, highlighting physiological voice quality degradation as relevant. The correlations with symptom scales, particularly in SSD, partially validate the clinical relevance of these acoustic markers and point toward specific symptom dimensions (e.g., FTD) reflected in speech behavior.
Conclusion
Speech recordings from a picture description task, analyzed with interpretable acoustic and voice features, can accurately classify SSD and MDD versus HC. Permutation-derived feature importance highlighted articulatory coordination and speech/voice variability as key contributors, suggesting potential speech biomarkers reflecting psychomotor slowing, alogia, flat affect, and formal thought disorder. Future work should expand to multi-class models, incorporate multimodal data, examine symptom severity and subtypes, include diverse speech tasks and languages, and validate in larger, independent cohorts to enable clinical translation and monitoring.
Limitations
- Clinical heterogeneity: participants were from inpatient/outpatient settings and various illness stages (acute vs chronic), which can differentially affect speech parameters. - Group differences in education level may confound verbal performance and acoustic processing. - Single speech task (picture description) limits generalizability; other tasks (e.g., sustained vowels, varied linguistic complexity) may reveal additional abnormalities. - Single language limits cross-linguistic generalization; voice patterns can vary by language. - Medication use not controlled; antidepressants and antipsychotics were differentially distributed and can affect voice/speech (e.g., CPPs changes, f0 SD increases with medication). - Modest sample size (n=60; 240 samples) and lack of an independent hold-out test set reduce generalizability and may inflate performance estimates.
Listen, Learn & Level Up
Over 10,000 hours of research content in 25+ fields, available in 12+ languages.
No more digging through PDFs, just hit play and absorb the world's latest research in your language, on your time.
listen to research audio papers with researchbunny