Introduction
Schizophrenia (SZ) is a severe neurodevelopmental psychotic disorder impacting emotional, behavioral, sensory, psychomotor, and cognitive functions. Diagnosis often requires extensive clinical follow-up, particularly challenging in teenagers. Research focuses on identifying biomarkers to improve diagnosis, with language biomarkers offering insights into thought processes. Individuals with SZ typically exhibit impaired communication in fluency, verbal productivity, and coherence. Existing studies primarily focus on English speakers and employ diverse methodologies. Recent research explores automated English language assessment to classify HC vs SZ, but this remains in the pilot stage due to the need for better understanding of cross-language variations, discriminative and predictive language feature capabilities, and clinically relevant tasks. This study leverages unstructured psychiatric interviews to investigate three aspects of language (fluency, verbal productivity, and coherence) to differentiate HC, FEP, and chronic SZ in Spanish-speaking individuals.
Literature Review
Verbal fluency (VF) encompasses discursive continuity, including pauses and hesitations. Studies link pauses with formal thought disorders and neurocognitive impairment. Automated pause assessment has shown promise in classifying HC vs SZ in English speakers. Verbal productivity (VP) involves the number of words and sentences produced. Low VP, or poverty of speech, is a characteristic of SZ. Automated VP analysis techniques are being explored as predictors in at-risk populations. Semantic coherence (SC) concerns the logical organization of meaning. In SZ, abrupt topic changes and inaccurate word use affect coherence. Manual and automated approaches exist to assess SC, with latent semantic analysis (LSA) showing potential in predicting psychosis. While several studies address schizophrenia in languages other than English (Spanish, Italian, Dutch, Portuguese), few predict SZ onset based on non-English language analysis. This study aims to automate language biomarker analysis of VF, VP, and SC in unstructured Spanish-speaking interviews to distinguish between HC, FEP, and SZ, and to predict SZ conversion in FEP patients.
Methodology
One hundred and thirty-three interviews (HC = 49; FEP = 40; chronic SZ = 44) were recorded and transcribed. HCs were Spanish-speaking individuals without psychiatric disorders or substance abuse. SZ diagnosis was confirmed using the DSM-IV structured clinical interview, and PANSS was used to measure symptom severity. FEP was defined as up to two years post-first psychotic episode. Thirty language features (4 verbal fluency, 20 verbal productivity, 6 semantic coherence) were extracted from the transcripts. Pause detection involved identifying temporal separations exceeding 2 seconds between speech segments, with noise reduction applied. Text processing involved removing punctuation, phonetic transcriptions, and stop words, and lemmatization was performed. Word2vec was used to codify words for improved classifier performance. Verbal fluency features included pause counts and question-answer pairs per time. Verbal productivity features encompassed lexical volume, type-token ratio (TTR), and word length. Semantic coherence was assessed using cosine similarity between sentence vectors. The Shapiro-Wilk test was used to check for normality. Mann-Whitney U and Kruskal-Wallis tests were used to compare groups. Random forest analysis was used for variable ranking and classification. Cross-sectional analysis aimed to distinguish between groups, while longitudinal analysis aimed to predict SZ conversion in FEP.
Key Findings
Cross-sectional analysis using the top ten language features achieved 80.97% accuracy (HC vs SZ), 85.93% (HC vs FEP + SZ), and 91.11% (HC vs FEP). Longitudinal analysis, incorporating demographics, PANSS, and language features, yielded 77.5% accuracy in predicting SZ conversion in FEP patients. Semantic coherence features were particularly important in both analyses. In FEP, possessive pronouns and minimum cosine similarity were significantly correlated with negative symptoms. In SZ, demonstrative and relative pronouns, question-answer pairs per time, different words per time, and TTR500 were significantly correlated. Language-only analysis provided 75.83% accuracy in predicting SZ conversion, outperforming demographic information (43.33%) and PANSS information (67.5%). Analysis of feature contributions indicated that semantic coherence was more informative than VF or VP for both group classification and FEP diagnosis prediction. Affective disorders were the most difficult to differentiate from SZ.
Discussion
This study demonstrates the potential of automated language analysis for classifying and predicting SZ in Spanish-speaking populations. The high accuracy achieved in both cross-sectional and longitudinal analyses validates the use of language biomarkers for SZ diagnosis and prognosis. The importance of semantic coherence highlights the impact of disrupted meaning processing in SZ. The findings suggest that language analysis offers a readily accessible and cost-effective alternative to neuroimaging techniques for SZ assessment in clinical settings, especially in resource-limited environments. Further research should explore the integration of language analysis with other diagnostic tools and expand to other languages and cultural contexts.
Conclusion
This study demonstrates the effectiveness of automated language analysis in identifying and predicting schizophrenia, particularly in first-episode psychosis patients, within a Spanish-speaking population. The results highlight the importance of semantic coherence as a key indicator. Future research could explore larger, more diverse samples, investigate other language features, and evaluate the integration of this method with existing clinical tools to improve diagnostic accuracy and early intervention strategies.
Limitations
The study's limitations include a homogenous sample of Chilean Spanish speakers for the HC group, potential demographic biases between groups, lack of refusal data, the use of a relatively simple classification method, limited sample size potentially leading to overfitting, and unbalanced classes in the longitudinal analysis. Comorbidities such as substance abuse were based on self-report, which may introduce inaccuracies.
Related Publications
Explore these studies to deepen your understanding of the subject.