
Psychology
Automatic language analysis identifies and predicts schizophrenia in first-episode of psychosis
A. Figueroa-barra, D. D. Aguila, et al.
This groundbreaking study, conducted by Alicia Figueroa-Barra and colleagues, reveals that automated language analysis can effectively distinguish between healthy controls and chronic schizophrenia in Spanish-speaking patients. Notably, the research also predicts schizophrenia diagnosis in first-episode psychosis patients with an impressive accuracy rate, tapping into the power of semantic coherence.
~3 min • Beginner • English
Introduction
The study addresses whether automated language analysis can (1) distinguish healthy controls (HC), first-episode psychosis (FEP), and chronic schizophrenia (SZ) in Spanish-speaking clinical interviews, and (2) prospectively predict which FEP patients convert to SZ. Schizophrenia is a severe, chronic disorder with language and communication abnormalities, but most automated language biomarker work has focused on English and often remains at pilot stage. The authors propose that three language domains—verbal fluency, verbal productivity, and semantic coherence—contain diagnostic and prognostic information that can be operationalized automatically from unstructured, ecological psychiatric interviews in Spanish. The purpose is to evaluate 30 linguistic features across these domains for cross-sectional classification and longitudinal prediction, and to compare their utility relative to clinical and demographic information. The importance lies in developing scalable, language-aware tools that may aid earlier, more accessible psychosis assessment beyond resource-intensive modalities.
Literature Review
Prior work shows people with SZ exhibit impairments in fluency (e.g., increased pauses), productivity (poverty of speech), and coherence (topic shifts, referential inconsistencies). Verbal fluency studies (mostly English) link pauses and hesitations to thought disorder and negative symptoms; Spanish studies also report longer pauses in FEP and correlations with negative symptoms. Verbal productivity measures (e.g., word counts, type-token ratio, pronoun use) differentiate HC vs SZ and predict risk in CHR populations; pronouns and deictics have shown predictive value in English. Semantic coherence has been quantified via automated methods (e.g., LSA) and manual indices (e.g., Communication Disturbance Index), with evidence that coherence abnormalities predict psychosis. Multilingual research indicates potential generalizability: Spanish reports of coherence loss and candidate features; Italian semi-automated clustering; Dutch transformer models for psychiatric classification; Portuguese graph-based syntactic coherence predicting diagnosis and longitudinal outcomes. Gaps remain regarding SZ onset prediction in non-English contexts and comparative contribution of linguistic vs clinical features, motivating the present Spanish-language study focused on VF, VP, and SC.
Methodology
Design: Cross-sectional and 18-month longitudinal study using automated language analysis.
Participants: 133 Spanish-speaking subjects: HC (n=49), FEP (n=40), chronic SZ (n=44). HCs were Chilean Spanish speakers without self-reported psychiatric disorders/substance abuse. SZ diagnoses confirmed by three psychiatrists using DSM-IV SCID/PANSS. FEP defined as up to two years after first psychotic episode. At follow-up, 28/40 FEP converted to SZ; 12 transitioned to other non-schizophrenic psychoses.
Clinical measures: PANSS positive, negative, general, and total scores; demographics (age, gender, education), illness duration, family history of psychosis.
Data collection: Unstructured clinical interviews (patients) and open-ended interviews (HC). Audio durations: HC 32–83 min (mean 53.5±10.2); patient interviews 5–102 min (mean 28.6±16.5). Manual transcription performed.
Preprocessing: Audio denoising; pauses defined as silences >2 s between speech segments. Text preprocessing removed punctuation, phonetic marks, fillers/onomatopoeias, and stop words; lemmatization applied. Stop word list expanded with 73 Chilean expressions. Words embedded with word2vec trained on Chilean Spanish to capture semantics.
Features: 30 total language features
- Verbal fluency (4): number of pauses >2 s; question–answer pairs per time; total words per time; different words per time.
- Verbal productivity (20): lexical volume (total and different words per answer/time), type–token ratios (TTR at multiple spans: 250/500/750/1000 words), mean word length (syllables/word), counts of determiners and pronouns (total and unique), normalized by responses and averaged per response.
- Semantic coherence (6): cosine similarity between adjacent segments (question–answer and sliding windows of 5–6 words) using summed word vectors; assessed as minimum and mean coherence values.
Statistics and modeling: Normality assessed via Shapiro–Wilk; group differences via Mann–Whitney U (pairwise) and Kruskal–Wallis (three-group). Correlations between features and PANSS computed (Pearson), with Bonferroni correction (k=30). Variable importance and selection via correlation screening and random forest rankings. Classification: Random forest for cross-sectional pairwise classifications (HC vs SZ; HC vs FEP; HC vs FEP+SZ) using top 10 decorrelated features. Longitudinal prediction (FEP converters vs non-converters) evaluated using feature sets: demographics only; PANSS only; PANSS+demographics; language only; language+demographics; PANSS+language; and an overall top-10 mixed set.
Code and data: Analyses in Python; code available at https://github.com/busmangit/nlpezq. Data access upon request due to privacy.
Key Findings
Sample and symptom summaries:
- N=133: HC=49; FEP=40; SZ=44. PANSS total: FEP 117.7±14.1; SZ 145.1±13.1 (P<0.01). Age and education differed across groups (P<0.01).
- Follow-up: 28/40 FEP converted to SZ; 12 transitioned to other non-schizophrenic psychoses (50% mood disorders).
Feature correlations and structure:
- Nine clusters of correlated linguistic variables identified (Pearson). Expected groupings observed (e.g., TTR500 with TTR1000).
- Significant correlations with negative symptoms (uncorrected):
• FEP: possessive pronouns (r=0.38, P=0.0153); minimum cosine similarity at 6-word span (r=0.33, P=0.0427).
• SZ: demonstrative pronouns (r=−0.49, P=0.007); relative pronouns (r=−0.30, P=0.0455); question–answer pairs per time (r=−0.40, P=0.0065); different words per time (r=0.30, P=0.0464); TTR500 (r=0.32, P=0.0343); pauses near-significant (r=−0.29, P=0.0503). After Bonferroni (k=30), only the negative PANSS association with demonstrative pronouns remained significant.
Cross-sectional classification (top 10 ranked, decorrelated features; random forest):
- HC vs SZ: 80.97% accuracy.
- HC vs FEP+SZ: 85.93% accuracy.
- HC vs FEP: 91.11% accuracy.
Language marker group differences:
- Verbal fluency: all 4 markers differed across groups (P<0.001), including pauses >2 s and speech rate proxies (e.g., QA pairs/time, words/time).
- Verbal productivity: raw and normalized lexical volume (unique/total words, TTR) distinguished groups; proposed new marker mean word length (syllables/word) also differentiated groups.
- Syntactic markers: differences in determiners and pronouns; indefinite pronouns and determiners significantly different (P<0.001 and P<0.01 respectively); personal and interrogative pronouns near-significant (P<0.01).
- Semantic coherence: mean cosine similarity over 5–6 word windows had strong discriminative power; mean coherence outperformed minimum coherence.
Longitudinal prediction (FEP converters vs non-converters):
- Demographics only: 43.33% accuracy.
- PANSS only: 67.5% accuracy.
- PANSS + demographics: 65.83% accuracy.
- Language only: 75.83% accuracy.
- PANSS + language: improved over PANSS alone; category-wise contributions: semantic coherence 77% > verbal productivity 64% > fluency 56% for FEP prediction.
- Combined (top 10 mixed features): 77.5% accuracy, driven mainly by semantic coherence features; PANSS total ranked 4th among top predictors.
Error patterns and diagnostics:
- Classifiers overestimated conversion when using demographics alone; inclusion of language improved case-level agreement with reference outcomes.
- Misclassifications concentrated among non-converters with affective disorders, reflecting overlap of formal thought disorder features early in illness.
Comparative context:
- Achieved accuracies are comparable to some neuroimaging/EEG biomarker studies but with lower cost and greater clinical feasibility.
Discussion
The findings demonstrate that automated language features extracted from unstructured Spanish clinical interviews can robustly discriminate HC from FEP and SZ and predict conversion from FEP to SZ. This addresses both research aims: (1) cross-sectional group discrimination with high accuracy using a small, decorrelated feature set, and (2) longitudinal prediction where language features alone outperform specialized clinical ratings (PANSS) and demographics, and a combined feature set performs best. The results generalize prior English-centric work to Spanish, supporting cross-language invariance for many linguistic biomarkers while noting potential language-specific nuances (e.g., word length distributions, stopword sets). Semantic coherence emerged as the most informative dimension for both diagnosis and prognosis, suggesting that downstream discourse effectiveness integrates impairments across fluency and productivity. Associations between negative symptoms and select language features further link linguistic markers with clinical profiles. Clinically, language analysis offers an accessible, scalable screening adjunct relative to resource-intensive neuroimaging or EEG, with the potential to support early detection and monitoring in diverse and low-resource settings.
Conclusion
Automated analysis of verbal fluency, productivity, and semantic coherence from Spanish clinical interviews can (1) distinguish HC, FEP, and chronic SZ with high accuracy and (2) predict SZ conversion among FEP, with semantic coherence features contributing most strongly. Many linguistic phenomena appear broadly language-invariant, though syntactic markers (e.g., determiners, pronouns) may require language-specific handling. The approach achieves results comparable to some neuroimaging/EEG methods while being simpler to deploy clinically. Future work should expand to larger, more balanced samples, incorporate fully automated speech-to-text pipelines to avoid transcription, test generalization across dialects/languages, refine syntactic/semantic feature extraction, and integrate multimodal clinical data for improved differential diagnosis (e.g., distinguishing affective psychoses).
Limitations
- Controls were exclusively Chilean Spanish speakers; comorbidities such as substance use were self-reported.
- Groups differed on demographic variables (age, education), introducing potential confounds.
- No record of refusals at recruitment.
- The chosen predictive model (random forest) was relatively simple; interpretability is broad but may limit performance compared to more complex models.
- Limited sample size with class imbalance in longitudinal analysis increases risk of overfitting and affects generalizability.
Related Publications
Explore these studies to deepen your understanding of the subject.