Introduction
Distinguishing between major depressive disorder (MDD) and schizophrenia spectrum disorder (SSD) remains a challenge. While transdiagnostic and multivariate approaches have been explored, reproducible biomarkers remain elusive. Recent research highlights the potential of speech features as objective, reproducible, and efficient biomarkers for these disorders. Speech production involves the complex coordination of numerous muscles and neurobiological processes, making acoustic analysis a valuable tool to detect abnormalities. Prior studies have identified atypical acoustic measurements in both MDD and SSD, including variations in prosody, voice quality, spectral features, and temporal aspects. Meta-analyses have reported effects such as decreased spoken time, reduced speech rate, and increased pause duration in individuals with schizophrenia, often correlating with clinical ratings. Similar findings, such as decreased speech rate and increased pauses, have been observed in MDD. While similarities exist, particularly concerning negative symptoms, potential distinctions arise due to differences in positive symptoms like formal thought disorder. Most previous research employed null-hypothesis significance testing (NHST), which has limitations in handling complex data. Machine learning (ML) approaches offer a more robust alternative, enabling analysis of the intricate relationship between speech patterns and psychiatric disorders. Previous ML studies have shown promising results in classifying patients with MDD and SSD from healthy controls, but these often utilized large, difficult-to-interpret feature sets. Interpretable machine learning (IML) combines the advantages of NHST with the computational power of ML, facilitating insight into the underlying mechanisms. Permutation feature importance is a model-agnostic approach to IML that helps identify the crucial features contributing to classification. This study uses IML to investigate speech acoustics as objective classifiers for depression and schizophrenia, aiming to identify important features and assess their correlation with symptom severity.
Literature Review
Previous research has explored speech acoustics as potential biomarkers for MDD and SSD, with studies reporting atypical acoustic measurements such as changes in prosody, voice quality, spectral features, and temporal aspects in both disorders. Meta-analyses have revealed consistent findings, such as decreased speech rate and increased pause duration in both MDD and SSD, although distinctions exist due to differences in positive symptoms. Most previous work employed null-hypothesis significance testing, with limitations in handling complex data. Machine learning (ML) approaches offer a promising alternative, with studies demonstrating accuracy in classifying MDD and SSD from healthy controls. However, previous ML applications often involved numerous abstract features making interpretation challenging. Interpretable machine learning (IML), employing methods like permutation feature importance, promises to provide insights into the relevant speech features and their relationships with symptom severity. The present study builds on this foundation, aiming for greater interpretability and a more nuanced understanding of the relationship between speech features and specific psychiatric symptoms.
Methodology
Participants (20 with SSD, 20 with MDD, and 20 healthy controls (HC)) were selected from the Marburg/Münster Affective Disorders Cohort Study. Age and sex matching was employed between groups. Exclusion criteria included substance abuse, traumatic brain injury, neurological diseases, and low verbal IQ. Diagnoses were assessed using the German version of the Structured Clinical Interview for DSM-IV (SKID-I) and psychopathological scales. A picture description task based on the Thematic Apperception Test (TAT) was used to elicit spontaneous speech, with four speech samples per participant. Speech samples were segmented, and examiner speech and excessive noise were manually removed. Feature extraction involved calculating speech tempo, pause, prosodic intonation, prosodic stress, and speech spectrum features, as well as additional features like pauses per minute (PPM) and articulation coordination features (ACFs). Vocal quality features (based on cepstral peak prominence (CPPs) and low-to-high ratio (LHR)) were also included. Three pairwise classification models (HC vs. SSD, HC vs. MDD, SSD vs. MDD) were used, employing Support Vector Machines (SVMs) with three polynomial kernels (linear, 2-degree, and 3-degree). Five-fold cross-validation was applied, with Bayesian hyperparameter optimization. Permutation feature importance was calculated to determine the relative importance of each feature in classification. Statistical relationships between the top 25% most important features and the three groups were determined using ANOVA (or Mann-Whitney U tests if assumptions were not met). Pearson correlations were calculated between these features and symptom severity scores (HAM-D, SANS, SAPS, and subscales).
Key Findings
The SVM models with 2-degree and 3-degree polynomial kernels demonstrated high accuracy in classifying the three groups (accuracy >0.90). The most important features across all models, and their importance relative to the 3-degree polynomial SVM, included articulation coordination features (ACF2, ACF1), intensity kurtosis, MFCC1, PPM, CPPs skewness, f0 SD, LHR SD, and LHR. Several features showed statistically significant differences between the patient groups and HC. Specifically, ACF2, ACF1, MFCC1, PPM, talking rate, and CPPs SD were significantly different in both MDD and SSD compared to HC. Intensity kurtosis, CPPs skewness, and LHR were significantly different in MDD compared to HC, while fo SD and LHR SD were significantly different in SSD compared to HC. Moderate correlations were observed between some features and symptom severity scores in SSD: LHR SD with HAM-D and SAPS; CPPs skewness with SANS and SAPS FTD; intensity kurtosis with SAPS; and MFCC1 and PPM with SAPS FTD. The features generally reflect aspects of articulation coordination, speech variability, and the number of pauses. These findings suggest these aspects differ between HC and the clinical groups.
Discussion
The high classification accuracy of the SVM models supports the hypothesis that speech features can effectively discriminate between healthy controls and patients with MDD and SSD. The identified important features—articulation coordination, speech variability, and pause frequency—reflect aspects of psychomotor slowing, alogia, and flat affect, all related to the core symptoms of these disorders. The moderate correlations between features and symptom severity scores in SSD further substantiate the link between speech patterns and symptom expression. The inclusion of vocal quality features, such as those derived from CPPs and LHR, offers a novel approach with potential for improving the classification accuracy and understanding of voice quality changes in these disorders. These findings have implications for developing objective, easily obtainable biomarkers for MDD and SSD. Future research can expand upon these findings using multimodal data and more refined symptom measures to further refine these biomarkers.
Conclusion
This study demonstrates the potential of speech and voice features as objective biomarkers for MDD and SSD. Interpretable machine learning models achieved high accuracy in classifying these disorders. Key features reflecting articulation coordination, speech variability, and pause frequency showed distinct differences across groups and moderate correlations with symptom severity. Future research should focus on multi-class classification, symptom severity, subtypes, and multimodal approaches for more refined diagnostic tools and monitoring of symptom changes.
Limitations
The study's limitations include the relatively small sample size, potential confounders such as education level, and the use of a single speech task and language. The cross-sectional design and the potential impact of medication also limit the generalizability of findings. A larger, more diverse sample with multiple speech tasks and longitudinal data are needed for confirmation and broader application.
Related Publications
Explore these studies to deepen your understanding of the subject.