Introduction
Language disorders are a core feature of schizophrenia, affecting over 70% of individuals. These disturbances can manifest across various linguistic levels, from basic speech characteristics and grammar to complex pragmatic interpretation. Specific impairments include altered pausing and prosody, reduced grammatical skills, diminished lexical richness, and semantic processing deficits. These difficulties correlate strongly with cognitive deficits and are linked to both positive (formal thought disorder, disorganization) and negative (poverty of speech) symptoms, ultimately impacting daily functioning, particularly community integration and social interaction. Computational approaches offer a means for quick and detailed quantitative linguistic analysis, proving successful in distinguishing individuals with schizophrenia from controls and predicting psychosis onset in at-risk individuals and diagnostic outcomes in first-episode psychosis. However, automated methods have been less applied to the study of chronic schizophrenia, despite the significant heterogeneity within this population. Previous studies using computational methods have explored the association between automatically-extracted linguistic features and psychopathological, functional, or cognitive characteristics. The present study aimed to extend the application of semi-automated linguistic approaches to understand the clinical heterogeneity of schizophrenia by combining computational linguistic methods with data-driven clustering techniques. The objectives were to identify subgroups of individuals with chronic schizophrenia based on automatically-extracted linguistic features and to determine whether these linguistic subtypes are associated with differences in psychopathological, daily functioning, cognitive, and sociocognitive aspects.
Literature Review
Existing research highlights the prevalence and multifaceted nature of language disorders in schizophrenia, affecting various aspects of language processing. Studies have demonstrated associations between specific linguistic features (e.g., reduced fluency, altered lexical richness, impaired semantic processing) and clinical symptoms (positive and negative), cognitive function, and daily living. Computational linguistics has emerged as a valuable tool, achieving high accuracy in differentiating schizophrenia patients from healthy controls and predicting psychosis onset and long-term outcomes. However, investigations using automated methods on chronic schizophrenia patients remain limited. While some studies have established correlations between specific linguistic features and clinical characteristics, a comprehensive, data-driven exploration of heterogeneity within chronic schizophrenia populations is lacking. This gap is crucial given that these patients exhibit extensive heterogeneity across various clinical domains. This research has successfully employed machine learning clustering to delineate subgroups of chronic schizophrenia patients based on negative symptom patterns, cognitive deficits, and functional outcomes, demonstrating that distinct subgroups may have unique treatment responses and functional trajectories. This study builds upon this foundation by employing a multidimensional language analysis integrated with data-driven clustering to identify subtypes of chronic schizophrenia patients based on their linguistic profiles.
Methodology
This study included 67 individuals diagnosed with schizophrenia according to DSM-5 criteria. Participants were Italian native speakers, excluding those with severe neurological conditions, intellectual disabilities, substance abuse, or severe psychotic exacerbation. Assessments encompassed psychopathology (PANSS), neurocognition (BACS), social cognition (ToM PST), and daily functioning (QLS). A semi-structured autobiographical interview from the APACS test was administered to collect speech samples. These samples were transcribed using CLAN software, followed by automatic extraction of linguistic features using PRAAT, R Studio, NLTK, and LIWC. The features encompassed lexical richness (type-token ratio, lexical frequency), fluency (mean length of utterance, pause durations, pause-to-word ratio), frequency of personal pronouns, and psychological lexicon (affective words, words related to cognitive mechanisms). A Principal Component Analysis (PCA) was conducted to reduce data dimensionality, identifying four meaningful principal components (PCs) that encompassed various dimensions of language. K-means clustering, an unsupervised machine-learning algorithm, was used to group participants into distinct clusters based on their PC scores. The optimal number of clusters was determined using the silhouette method. The stability of the resulting cluster solution was validated using Linear Discriminant Analysis (LDA) with both random split samples and leave-one-out cross-validation procedures. Finally, t-tests compared the resulting clusters across demographic, clinical, cognitive, sociocognitive, and functional measures. Additional analyses explored correlations between cognitive and sociocognitive subscores and the linguistic-based PCs within each cluster to further investigate relationships between language and other cognitive domains.
Key Findings
The PCA revealed four principal components representing aspects of lexical richness, fluency, personal pronoun usage, and psychological lexicon. K-means clustering identified two distinct clusters of participants based on these linguistic features. Cluster 1 (n=47) showed higher fluency, more personal pronouns, lower lexical richness, but a greater use of the psychological lexicon. Cluster 2 (n=20) exhibited lower fluency, fewer personal pronouns, higher lexical richness, and less frequent use of the psychological lexicon. LDA validation confirmed the robustness of the two-cluster solution, showing high classification accuracy across various training-testing partitions. While there were no significant differences between clusters regarding age, education, illness duration, or age of onset, Cluster 1 showed significantly lower scores on the PANSS (positive, negative, general scales, and disorganization) and higher scores on the QLS (interpersonal relations, personal autonomy, and total score), indicating less severe symptomatology and improved quality of life. Additional analysis of BACS and ToM PST subscores revealed no overall difference between clusters, but showed a stronger correlation between linguistic profile and cognitive abilities (verbal memory, planning, working memory) within Cluster 2. No significant correlations were found between linguistic PCs and TOM PST subscores in either cluster.
Discussion
The findings demonstrate that a multidimensional, automated language analysis, combined with data-driven clustering, successfully identifies distinct linguistic profiles in individuals with chronic schizophrenia. These profiles are associated with meaningful differences in psychopathology and functional outcomes. Cluster 1, characterized by greater fluency and increased use of psychological lexicon, exhibited better clinical outcomes and quality of life. Cluster 2, demonstrating reduced fluency and lower use of the psychological lexicon, presented more severe symptoms and poorer functioning. The lack of overall differences in cognition and social cognition between clusters suggests that linguistic profiles are not simply indicators of overall severity but rather reflect distinct patterns of impairment. The stronger correlation between linguistic features and cognition in Cluster 2 highlights the potential for complex interactions between language and cognitive abilities in schizophrenia. These findings underscore the importance of comprehensive, multidimensional language assessment in schizophrenia, going beyond simple fluency measures to capture subtle variations in language profiles that are associated with clinical heterogeneity.
Conclusion
This study demonstrates the utility of combining semi-automated linguistic analysis with data-driven clustering to identify distinct linguistic profiles in chronic schizophrenia patients, associated with differences in psychopathology, functioning, and the relationship between language and cognition. Future research should focus on replicating these findings in larger, independent samples, incorporating longitudinal data to assess the stability of linguistic profiles over time, and exploring the application of this approach in personalized treatment strategies and clinical trial endpoints.
Limitations
While the study employed robust validation techniques for the cluster analysis, future research is needed to replicate these findings in independent samples. The use of a semi-structured interview, while practical, may not fully capture the breadth of an individual's communication style compared to more ecologically valid settings. The relatively small sample size and homogeneous demographic characteristics of the sample may limit the generalizability of the findings to more diverse populations. Finally, inclusion of higher-order language processing aspects (discourse, pragmatics) in future studies could provide a more comprehensive linguistic profile.
Related Publications
Explore these studies to deepen your understanding of the subject.