logo
ResearchBunny Logo
Deconstructing heterogeneity in schizophrenia through language: a semi-automated linguistic analysis and data-driven clustering approach

Psychology

Deconstructing heterogeneity in schizophrenia through language: a semi-automated linguistic analysis and data-driven clustering approach

V. Bambini, F. Frau, et al.

This fascinating study conducted by Valentina Bambini and colleagues examines the speech patterns of individuals with schizophrenia, revealing two distinct linguistic profiles. One group exhibits higher fluency and a rich psychological lexicon, while the other showcases a broader vocabulary but lower fluency. This research emphasizes the complexity of language disturbances in schizophrenia and the importance of using data-driven approaches.

00:00
00:00
~3 min • Beginner • English
Introduction
The study addresses how to deconstruct clinical heterogeneity in chronic schizophrenia using automated language analysis. Language and communication impairments are core in schizophrenia, affecting speech, grammar, lexicon, discourse, and pragmatics, and are tied to cognition, positive/negative symptoms, and everyday functioning. While computational language methods have been used mainly for diagnosis and prognosis in at-risk and first-episode populations, less is known about chronic patients’ language profiles. Given heterogeneity in chronic schizophrenia and prior success of clustering approaches for symptoms and cognition, the authors hypothesize that multi-level, semi-automated linguistic features can identify distinct subgroups whose language profiles relate to psychopathology, cognition, and functioning.
Literature Review
Prior computational studies show high accuracy in distinguishing schizophrenia from controls and relatives, differential diagnosis, and predicting psychosis onset and outcomes using speech/linguistic features. In chronic schizophrenia, automated speech and lexical features (e.g., articulation rate, pauses, mean length of utterance, type-token ratio) relate to negative symptoms and brain white matter integrity. LIWC-derived features have linked word counts and emotional word use to symptoms and functioning. Other findings include fewer words per sentence and increased self-referential pronouns without consistent differences in lexical variety. However, most work correlates single features with clinical variables; machine-driven clustering of chronic patients’ linguistic heterogeneity has been largely neglected. Clustering has decomposed heterogeneity in negative symptoms, cognition, and sociocognition and related clusters to outcomes and treatment response, suggesting potential for language-based clustering to reveal clinically meaningful subtypes.
Methodology
Design: Cross-sectional study combining semi-automated multi-level linguistic analysis with unsupervised clustering, followed by between-cluster comparisons on clinical, cognitive, sociocognitive, and functional measures. Sample: N=67 adults with DSM-5 schizophrenia, Italian native speakers, recruited at IRCCS San Raffaele Scientific Institute (Milan). Mean age 39.75±11.04 years; education 11.94±2.72 years; illness duration 15.60±10.70 years; age of onset 24.28±6.36 years. All on antipsychotics ≥3 months (61 atypical/6 typical), mean CPZ-equivalent 440.98±200.57 mg/day. Exclusions: severe TBI/neurological disorder, intellectual disability, recent substance abuse (≤6 months), severe psychotic exacerbation (≤3 months). Ethics approved; informed consent obtained. Assessment: Psychopathology via PANSS (Positive, Negative, General totals; Disorganization composite). Neurocognition via Italian BACS (VM, DS, TMT, SF, SC, ToL; adjusted subscores and equivalent total). Social cognition via ToM Picture Sequencing Task (Sequencing, Questionnaire, total). Functioning via Quality of Life Scale (QLS; Interpersonal Relations, Instrumental Role, Personal Autonomy, total). Assessments conducted by trained clinicians. Speech elicitation and processing: Semi-structured autobiographical interview (APACS Interview task; ~5–6 min). Audio recorded (44.1 kHz), total participant speech ~4 h 29 min. Transcription with CLAN; interviewer turns, non-verbal vocalizations, false starts removed. Utterance segmentation per CLAN criteria, checked with PRAAT for pauses/intonation. Inter-rater reliability on 25% subsample: words ICC=0.99 (95% CI .99–1), utterances ICC=0.80 (95% CI .51–.91). Linguistic features: Nine measures across four domains: - Lexical richness: type-token ratio (NLTK); mean lexical frequency from COLFIS (via R). - Fluency: mean length of utterance (LIWC words-per-sentence based on segmented utterances), mean gap duration (between question and answer), mean silent+filled pause duration, pause-to-word ratio (PRAAT extraction; filled pauses manually checked). - Personal pronouns: percentage (LIWC, Italian dictionary). - Psychological lexicon: percentages of affective words and cognitive mechanisms words (LIWC). Statistical analysis: Linguistic features z-standardized; PCA with varimax rotation; components with eigenvalue >1 retained. PC scores used in k-means clustering; number of clusters chosen by silhouette method (starting from 25 random centroids). Validation via Linear Discriminant Analysis (LDA): random-split (75/25, 50/50, 25/75; 50 iterations each) and leave-one-out cross-validation. Between-cluster comparisons: t-tests for demographics, medication, PANSS, QLS, BACS (total and subscores), ToM PST; normality by visual inspection; homoskedasticity via F-test; FDR-adjusted p-values where appropriate. Exploratory within-cluster correlations between linguistic PCs and BACS/ToM subscores.
Key Findings
PCA identified 4 components (variance explained: 39.80%, 13.52%, 12.23%, 11.30%; total 76.85%). Loadings: PC1 captured lexical richness and fluency (type-token ratio negative; lexical frequency and mean length of utterance positive; gap and pause durations negative). PC2 captured pausing-related fluency (pause duration and pause-to-word ratio negative). PC3 captured frequency of personal pronouns (positive). PC4 captured psychological lexicon (affective and cognitive mechanisms words negative loadings indicate higher PC4 corresponds to lower psychological lexicon frequency). Clustering: K-means yielded two clusters (average silhouette width 0.46). Cluster sizes: Cluster 1 n=47; Cluster 2 n=20. Profiles: - Cluster 1: higher fluency (longer utterances, shorter/fewer pauses and gaps), higher frequency of personal pronouns, lower lexical richness (lower type-token ratio, higher lexical frequency), higher psychological lexicon. - Cluster 2: lower fluency (shorter utterances, longer/more pauses and gaps), lower frequency of personal pronouns, higher lexical richness (higher type-token ratio, lower lexical frequency), lower psychological lexicon. Validation: LDA random-split accuracies—training/testing: 75% train 0.96±0.02 / 0.95±0.05; 50% train 0.97±0.02 / 0.93±0.03; 25% train 0.98±0.03 / 0.89±0.07. Leave-one-out cross-validation correctly assigned 94% of participants. Between-cluster comparisons: No differences in age, education, illness duration, age of onset, antipsychotic class, CPZ-equivalent dose, BACS total, or ToM PST total (ps ≥ 0.102). Functioning (QLS): Cluster 1 higher Interpersonal Relations (20.91±6.08 vs 14.40±5.83; t(64)=4.05; p<0.001), Personal Autonomy (28.96±7.04 vs 18.80±8.01; t(64)=5.17; p<0.001), and Total (54.70±14.08 vs 36.05±14.42; t(64)=4.91; p<0.001); Instrumental Role not significant (4.83±5.45 vs 2.85±4.94; t(64)=1.39; p=0.169). Psychopathology (PANSS): Cluster 1 lower Positive (16.23±3.76 vs 18.70±4.52; t(65)=-2.31; p=0.024), Negative (19.72±4.71 vs 23.50±3.95; t(65)=-3.14; p=0.010), General (37.15±6.62 vs 41.55±4.94; t(65)=-2.67; p=0.019), and Disorganization (20.23±5.10 vs 23.45±4.19; t(65)=-2.49; p=0.021). Cognitive subscores (BACS) and ToM PST subscores did not differ significantly (ps ≥ 0.172). Exploratory correlations: In Cluster 2, stronger associations between linguistic PCs and cognition—PC1 with BACS Verbal Memory and Tower of London, and PC3 with BACS Digit Sequencing and Tower of London (p<0.05). No significant PC–cognition correlations in Cluster 1.
Discussion
A multidimensional semi-automated language assessment uncovered two distinct linguistic profiles in chronic schizophrenia. These profiles are not simple high/low severity categories but reflect different configurations: a more fluent, pronoun- and psychological-lexicon–rich yet lexically less varied profile (Cluster 1) versus a less fluent, pronoun- and psychological-lexicon–poor yet lexically more varied profile (Cluster 2). The profiles relate meaningfully to clinical status: Cluster 1 shows lower overall PANSS symptom severity (including disorganization) and better quality of life, particularly in interpersonal relations and autonomy, whereas Cluster 2 shows worse clinical and functional outcomes. The counterintuitive inverse relation between fluency and lexical variety is explained by verbosity with lexical repetition in more fluent speakers versus rarer word use amid reduced output in less fluent speakers. Language–cognition coupling differed by subgroup: in Cluster 2, poorer fluency/lexical and pronoun metrics related to weaker memory, working memory, and planning, suggesting tighter dependence between language and cognition in more impaired individuals; in Cluster 1, language and cognition appeared relatively independent. Overall, global linguistic profiling from speech metrics to semantic categories better maps to clinical heterogeneity than single measures and may inform individualized assessment and monitoring.
Conclusion
The study demonstrates that combining semi-automated, multi-level linguistic analysis with data-driven clustering identifies clinically meaningful subgroups in chronic schizophrenia. A fluent but lexically repetitive, psychologically richer profile aligns with lower symptoms and better functioning, while a dysfluent, pronoun/psychological-lexicon–reduced but lexically varied profile aligns with higher symptom burden and poorer functioning. These insights support integrating automated language profiling into clinical practice to tailor language-focused interventions and to monitor disease course and treatment response, potentially serving as objective endpoints in trials. Future work should validate these clusters in larger, diverse, and independent cohorts, extend to more ecological and longer speech tasks including higher-order discourse/pragmatics, and perform longitudinal studies to assess stability and prognostic value.
Limitations
- Single-site, modest sample; need replication in larger, independent, and more diverse cohorts (e.g., consortium-based) to enhance generalizability and cross-cultural applicability. - Cross-sectional design; lack of longitudinal data to test stability of linguistic profiles and prognostic/monitoring utility. - Speech elicitation limited to a brief semi-structured autobiographical interview; more ecological, longer, and emotionally evocative tasks may capture broader communicative behaviors. Practical constraints may be mitigated by combining elicitation tasks with corpus-based approaches. - Language profiling focused on speech and lexical features; omission of higher-order discourse and pragmatic measures that are also impaired in schizophrenia. - Social cognition assessed with a single ToM task; other domains (e.g., emotion recognition) were not evaluated and might relate differently to language features. - Data sharing restrictions due to privacy concerns limit public availability of datasets.
Listen, Learn & Level Up
Over 10,000 hours of research content in 25+ fields, available in 12+ languages.
No more digging through PDFs, just hit play and absorb the world's latest research in your language, on your time.
listen to research audio papers with researchbunny