
Psychology
Using machine learning of computerized vocal expression to measure blunted vocal affect and alogia
A. S. Cohen, C. R. Cox, et al.
This groundbreaking study by Alex S. Cohen and colleagues delves into machine learning's potential to model blunted vocal affect and alogia with impressive accuracy. The findings reveal correlations with cognitive performance and offer insights into digital phenotyping for serious mental illness, raising intriguing questions about the nature of vocal expression in schizophrenia.
~3 min • Beginner • English
Introduction
Blunted vocal affect (BvA; reduced prosody) and alogia (reduced verbal production) are diagnostic features of schizophrenia and occur across depressive, post-traumatic, neurocognitive, and neurodegenerative disorders. They are typically rated clinically during interview, relate to poor functional outcomes, and lack established biological etiology or effective treatments. Despite decades of work, prior computerized acoustic analyses show weak and inconsistent convergence with clinical ratings, likely due to limited and small feature sets and insufficient consideration of speaking context. This study asks whether machine learning applied to a large and diverse acoustic feature set can accurately model clinically rated BvA and alogia from brief speech samples, whether accuracy depends on speaking task, how ML-predicted scores relate to demographics, diagnosis, symptoms, cognition and social functioning, and which acoustic features are most predictive. The overarching goal is to evaluate digital phenotyping of negative symptoms as an efficient, objective assessment approach.
Literature Review
Previous studies using small acoustic feature sets (often 2–10 features) found weak or null associations between objective speech measures and clinically rated negative symptoms, with large heterogeneity across studies. Meta-analyses reported relatively small effects for most features (except pause duration) compared with large differences observed via clinical ratings. Context effects (task demands, cultural/environmental factors), and limited feature sets may contribute to the poor convergence. Human vocal expression can be quantified with thousands of features; for example, the INTERSPEECH emotion recognition challenges use thousands of features. Thus, larger, conceptually diverse feature sets and task-specific modeling may improve validity.
Methodology
Design and participants: Pooled archived data from two studies of stable outpatients with serious mental illness (SMI) meeting federal criteria, N=121 (57 in Study 1; 64 in Study 2). Diagnoses included schizophrenia (~63%; n=76), major depressive disorder (n=18), bipolar disorder (n=20), and other SMI (n=7). Mean age 41.88 (SD=10.95; range 18–63); 61% male; ~51% Caucasian and 48% African-American. Exclusion criteria included neurological disorders and recent substance dependence. IRB-approved; written informed consent obtained. Measures: Clinical symptoms assessed via the Expanded Brief Psychiatric Rating Scale (BPRS) and the Scales for the Assessment of Positive and Negative Symptoms (SAPS/SANS). Global SANS BvA and global alogia ratings served as ML criteria. Cognitive functioning was measured with RBANS global index; social functioning with the Social Functioning Scale (available for Study 1). Speaking tasks: Two tasks were recorded under standardized conditions. Picture Task (Study 1): participants described thoughts/feelings about 40 IAPS images (20 s per image across two sessions). Free Recall Task (Studies 1 and 2): 60 s descriptions of daily routines, hobbies, living situation, or autobiographical memories. Research assistants read instructions and remained silent during recording. Acoustic feature extraction: Two complementary toolsets were used, yielding a large feature set. CANS (Computerized assessment of Affect from Natural Speech) provided 68 macroscopic speech features (e.g., number of utterances, pause length, intonation, emphasis). GeMAPS provided 88 psychophysically grounded features (e.g., MFCCs, spectral, formant measures). Recordings with fewer than three utterances were excluded. Four “conceptually critical” features were identified a priori: for BvA, intonation (within-utterance SD of F0) and emphasis (within-utterance SD of intensity); for alogia, mean pause time and number of utterances. Data preprocessing: All features were normalized and Winsorized at ±3.5 SD. Machine learning: Lasso regularized regression with 10-fold cross-validation. Cases were stratified to balance positive/negative classes in each fold. Binary labels for model training were defined from SANS: “moderate or greater” = positive; “absent” = negative; “mild” or “questionable” cases were excluded during model building to sharpen class separation. After training, models generated continuous “predicted” scores (0 to 1) for all recordings, including previously excluded cases. Stability selection (subsampling-based) informed feature importance and robustness. Analytic plan: (1) Build models to predict BvA and alogia clinical ratings from acoustic features across tasks; (2) Evaluate whether model performance varies by speaking task and test cross-task generalization; (3) Examine associations of ML-predicted and clinically rated scores with demographics, diagnosis, other symptoms, cognition, and social functioning, using correlations and regressions; (4) Identify key features via model weights and stability selection, and test whether conceptually critical features add explanatory power for functioning beyond ML-predicted and clinical ratings.
Key Findings
Model accuracy: Average training accuracy was high: BvA 0.90 and alogia 0.95; test-set accuracies were similar (within ~5% of training) and well above chance (0.50). Detailed cross-validated performance (hit rate, correct rejection, accuracy): Blunted vocal affect models: All-task model: training hit 0.74, correct rejection 0.95, accuracy 0.90; test hit 0.65, correct rejection 0.92, accuracy 0.85. Picture-task model: training hit 0.84, correct rejection 0.97, accuracy 0.94; test hit 0.70, correct rejection 0.93, accuracy 0.87. Free-speech model: training hit 0.88, correct rejection 0.99, accuracy 0.96; test hit 0.60, correct rejection 0.93, accuracy 0.85. Alogia models: All-task model: training hit 0.75, correct rejection 0.98, accuracy 0.95; test hit 0.66, correct rejection 0.96, accuracy 0.92. Picture-task model: training hit 0.89, correct rejection 1.00, accuracy 0.99; test hit 0.76, correct rejection 0.99, accuracy 0.96. Free-speech model: training hit 0.96, correct rejection 0.98, accuracy 0.97; test hit 0.82, correct rejection 0.92, accuracy 0.89. Task specificity: Models trained on one task generalized poorly to the other (adjusted accuracies ~0.50–0.63); e.g., alogia model trained on Picture Task dropped from 0.99 (training) to 0.50 applied to Free Speech. Convergence with clinical ratings and interrelations: ML-predicted scores correlated strongly with clinical ratings: BvA r=0.73 (p<0.001) and alogia r=0.57 (p<0.001). Predicted BvA and predicted alogia were modestly related (r=0.25, p<0.01); clinical BvA and alogia correlated r=0.46 (p<0.01). Associations with symptoms and functioning (Table 2 highlights): Predicted BvA correlated with BPRS Negative (r=0.63*) and SANS Blunted Affect (r=0.62*). Predicted alogia correlated with SANS Alogia (r=0.57*). Predicted scores were not significantly associated with agitation, positive or bizarre symptoms, or affect factors, suggesting specificity to negative symptoms. Functioning: More severe predicted BvA was linked to poorer cognition (r≈-0.29, p<0.10) and poorer social functioning (r≈-0.28). Predicted alogia related to poorer social functioning (r=-0.31*). In regressions, ML-predicted and clinical ratings showed largely redundant contributions; when both entered, neither added unique variance to functioning. Diagnostic group differences: Schizophrenia patients showed higher ML-predicted and clinically rated BvA than mania and depression (BvA Fs≈4.30–4.33, p=0.04); effects for alogia were not significant. Effect sizes for ML BvA: schizophrenia vs mania d=0.50; vs depression d=0.79. Demographics: No significant BvA differences by gender or race for predicted or clinical ratings. Predicted alogia was higher in men than women (t=4.41, p<0.01, d=1.20); clinical alogia showed a trend (t=1.77, p=0.08, d=0.44). Age was not significantly related to predicted or clinical scores. Key acoustic features (stability selection): Alogia: StddevUnvoicedSegmentLength (pause length variability; conceptually related to alogia). BvA: mfcc2_sma3_stddevNorm (variability of MFCC2) and logRelF0-H1-A3_sma3nz_amean (spectral tilt proxy). Both models: F2frequency_sma3nz_amean (average second formant frequency). Notably, top stable features often were not those deemed conceptually critical in operational definitions (e.g., FO and intensity variability for BvA; pause length and utterance count for alogia). Additional analyses: Conceptually critical features explained unique variance in cognition beyond ML-predicted and clinical ratings: Pause Mean (for alogia) and Emphasis (for BvA) significantly predicted cognitive functioning after controlling demographics and symptom measures.
Discussion
Findings demonstrate that machine learning applied to a large, diverse acoustic feature set can accurately predict clinically rated blunted vocal affect and alogia from brief speech samples, supporting the feasibility of digital phenotyping for negative symptoms. Accounting for speaking context improved performance, and cross-task generalization was poor, indicating task specificity and the need to model context explicitly. ML-predicted scores paralleled clinical ratings in their associations with negative symptom dimensions and functional outcomes and did not introduce additional demographic biases beyond those in clinical ratings. Importantly, the most stable predictive features were primarily spectral and formant-based (e.g., MFCCs, F2) rather than the features typically emphasized in psychopathology research (e.g., pitch and intensity variability, pause counts). This suggests clinicians may intuitively weigh broader spectral aspects of speech or, alternatively, that clinical ratings imperfectly capture theoretically critical aspects of BvA and alogia. Conceptually critical features such as pause mean and emphasis contributed uniquely to cognitive functioning, implying that operational definitions of these symptoms capture functionally relevant information that may be underweighted in clinical ratings and ML models trained on them. Overall, the study highlights both the promise and challenges of establishing ground truth for computational psychiatry: whether to target clinician ratings, theory-driven constructs, or functional outcomes, and the potential for integrating multiple criteria using model-of-models approaches.
Conclusion
This study provides proof-of-concept that negative symptoms—blunted vocal affect and alogia—can be digitally phenotyped from brief speech using ML with high accuracy, especially when tailoring models to specific speaking tasks. ML-predicted scores align closely with clinical ratings and reproduce expected links to negative symptom severity and functioning. Spectral and formant features emerged as key predictors, while theoretically central features (intonation, emphasis, pause metrics) contributed uniquely to cognition, suggesting refinements to symptom operationalization and model targets may be warranted. Future work should: expand to more naturalistic and varied speaking contexts; validate on independent and diverse cohorts; consider integrating acoustic features with linguistic/content analyses; clarify ground-truth criteria (ratings vs theory vs functional outcomes); and evaluate clinical utility for monitoring treatment response and relapse risk.
Limitations
Primary versus secondary negative symptoms and chronicity could not be disentangled; models may perform differently across subtypes or secondary causes (e.g., depression, medication effects). Speaking tasks were constrained and free-speech samples were fewer, limiting generalizability; cross-validation did not include external datasets, raising residual overfitting concerns. Extreme levels of negative symptom severity were underrepresented. Medication effects were not controlled. The sample was from a limited geographic region with limited linguistic and cultural variability, and potential demographic biases in clinical criteria can propagate to ML models. Acoustic analysis alone omits other communication channels (e.g., lexical content, discourse coherence), which may be informative. Models were task-specific with poor cross-task generalization, underscoring context dependence.
Related Publications
Explore these studies to deepen your understanding of the subject.