Psychology

Natural language processing methods are sensitive to sub-clinical linguistic differences in schizophrenia spectrum disorders

S. X. Tang, R. Kriz, et al.

This groundbreaking research conducted by Sunny X. Tang and colleagues delves into how natural language processing (NLP) can uncover subtle linguistic differences in individuals with schizophrenia spectrum disorders. By analyzing speech samples, the study reveals intriguing patterns in word usage and coherence that surpass traditional clinical ratings, highlighting NLP's valuable role in identifying hidden language disturbances in SSD.

00:00

~3 min • Beginner • English

Index

Introduction

Language disturbance has long been recognized as a hallmark of psychosis, ranging from marked disorganization in threshold schizophrenia spectrum disorders (SSD) to less pronounced phenotypes among subthreshold psychosis-spectrum conditions. Speech is considered the observable surface phenomena that reveal, in part, the thoughts of the inner mind; thus, thought disorder in psychosis has often been equated with speech disturbance. Novel digital phenotyping and NLP methods offer opportunities to capture speech in ecologically valid settings and quantify objective parameters reflecting underlying thought disturbance, potentially advancing assessment, monitoring, and research into the psychosis disease process. Prior work has used varied NLP techniques to characterize language phenotypes in psychosis with mixed success, and many studies have relied on older NLP methods that do not incorporate contextual information. In this exploratory study, the authors applied multiple NLP methods to characterize speech changes in SSD across three levels: individual words, parts-of-speech (POS), and sentence-level coherence, comparing them to a clinical gold standard (TLC). They leveraged BERT, a state-of-the-art contextual embedding model, hypothesizing that speech from individuals with SSD would show abnormalities at each level and that NLP-based measures would better discriminate SSD from HC language than clinical ratings. The sample was not enriched for overt thought/language disorder to reflect the broader range present in SSD, leading to unexpected findings such as increased incomplete words and highlighting methodological pitfalls, which are detailed in the paper.

Literature Review

A variety of NLP techniques have been used to characterize language phenotypes in psychosis, especially SSD. Early work using Latent Semantic Analysis quantified decreased coherence in SSD speech, predicting human ratings and discriminating SSD from controls with 80–82% accuracy. Subsequent studies modeling decreased linguistic cohesion also found significant group differences; decreased semantic coherence may predict conversion to psychotic disorders among at-risk youth with 70–80% accuracy. Other approaches include graph analysis, semantic density measures, and automated metaphor detection. NLP measures of semantic cohesion have explained variance in neurocognition beyond clinician ratings, and automated speech recognition/scoring in smartphone verbal memory tests has predicted human ratings. Multi-level analyses (generic features, word identity, dictionaries, n-grams) have discriminated SSD from controls with around 74% accuracy. However, some studies report negative or conflicting results. Emotion word usage has been similar between SSD and controls despite affective flattening, and cohesion was not significantly reduced in first-episode psychosis vs. controls in one study. Acoustic and linguistic measures have correlated poorly with clinician ratings of negative symptoms. In general, NLP measures have struggled to track negative symptoms like affective flattening and alogia. Many psychiatric NLP studies have used older non-contextual embeddings (e.g., GloVe, Word2Vec) that cannot incorporate bidirectional context (e.g., distinguishing “banks” in different contexts). This study addresses these gaps by applying BERT to a psychiatric sample to capture sentence-level directionality and context.

Methodology

Sample: Two cohorts were recruited at the University of Pennsylvania with written informed consent and IRB approval. SSD participants were stable outpatients diagnosed via semi-structured interviews and consensus case conferences with schizophrenia or schizoaffective disorder (DSM-IV); individuals with intellectual disabilities were excluded. HC participants had no major psychiatric disorders. Recordings were open-ended interviews: Cohort 1 discussed themselves; Cohort 2 recounted positive/neutral memories. Cohort 1 total recording durations (participant + interviewer + pauses) were shorter than Cohort 2, but participant word counts did not differ. SSD participants had lower education than HC; this was covaried in analyses. Transcriptions were verbatim by human annotators, including non-verbal vocalizations (e.g., laughter) and disfluencies (e.g., “um”), with de-identification of names/dates. Transcriber v1.5.2 was used for Cohort 2; standard word processing software for Cohort 1. Clinical assessments: A blinded psychiatrist rated recordings and transcripts using the Scale for the Assessment of Thought, Language and Communication (TLC), covering 18 items, a global severity (0–4), and a summation score per published formula. The modal global severity score was 0 for both groups, reflecting generally mild/absent language disorder among SSD. NLP analyses for individual words and POS: Individual word usage was compared between SSD and HC. Pronouns were aggregated by type (e.g., first-person singular: I, my, me). Incomplete words (mid-word stoppages) were counted in aggregate. Odds ratios for word usage in SSD vs HC were log-transformed and weighted with an informative Dirichlet prior to emphasize words unique to the documents. POS tagging used spaCy (en_core_web_sm); token counts per POS category were computed per 100 words and compared between groups. Sentence-level analyses using BERT: Two BERT-based methods assessed sentence-level differences. (1) Next-sentence predictability: Interviews were sentence-split using NLTK at intuitive punctuation points per transcribers. Sentence pairs were extracted (the second sentence from the subject), and BERT predicted the likelihood that the second follows the first. Weighting by sentence length did not change results. Note: initial attempts using spaCy sentence splitting produced erroneous splits at incomplete/filler/repeated words (more frequent in SSD), biasing next-sentence predictability downward for SSD; final analyses used transcriber-informed segmentation. (2) Embedding approach: Dialogue turns were defined from interviewer prompt to the next interviewer prompt. A single embedding et was computed for each interviewer turn by averaging word-level BERT embeddings. For each subsequent subject sentence s, a sentence-level embedding es was computed, and the mean difference between et and es was calculated, approximating tangentiality/derailment as distance from prompt. Linear regressions were fit on sentence-wise embedding distances to compare intercepts/slopes between groups. Analyses focused on sentence-to-sentence coherence; within-sentence incoherence was not examined. Statistical analysis: Normality (Shapiro–Wilk) indicated non-normal TLC global and sum; three SSD outliers with high TLC scores were identified via boxplot. Group differences for TLC measures used Wilcoxon Rank Sum (two-sided alpha=0.05). Other measures met parametric assumptions. ANCOVA models compared POS frequencies, covarying education, cohort, age, sex, and race; effect sizes were Cohen’s d. Mean BERT next-sentence probability per participant was modeled via linear regression with group as predictor, covarying for cohort, education, and demographics. Multiple comparisons were not corrected due to the exploratory nature. For classification, three naive Bayes models (scikit-learn) with leave-one-out cross-validation were trained: (a) TLC ratings only, (b) linguistic features only (log-odds ratios, BERT scores, POS counts per 100 words, number of incomplete words), (c) linguistic features + TLC sum. Missing values were imputed (SimpleImputer) and features standardized (StandardScaler). Models were also run including education level to assess its effect. Fivefold cross-validation confirmed patterns with higher accuracy.

Key Findings

- Clinical TLC ratings: Clinically significant language disorder present in 4 SSD and 0 HC; three SSD with TLC global ≥2 were outliers. No significant group differences in TLC global or total scores. Largest effect sizes: poverty of content of speech (Cohen’s d=0.70), illogicality (d=0.51). TLC global: HC 0.0±0.0 vs SSD 0.5±1.0, p=0.13, d=0.56; TLC total: HC 0.9±1.7 vs SSD 4.4±9.2, p=0.10, d=0.46. - Recording characteristics: Mean sentence length was shorter in SSD (14.4±4.3 words) than HC (17.5±3.1), p=0.04, d=0.81. Word count similar (HC 1748.8±448.0 vs SSD 1782.3±908.2, p=0.92). - Individual words: SSD associated with more first-person singular and second-person pronouns; HC associated with more first-person plural pronouns, filler “um,” and laughter. SSD associated with filler “uh” and more incomplete words. Incomplete words alone discriminated SSD vs HC with AUC=0.88, accuracy=90%. Word usage alone (leave-one-out CV) yielded AUC=0.80, accuracy=76%. - Parts-of-speech (per 100 words; ANCOVA covarying education, cohort, age, sex, race): SSD produced fewer adverbs (p=0.001, d=1.66), adjectives (p=0.03, d=0.82), determiners (p=0.03, d=0.83), but more pronouns (p=0.03, d=-0.71). Other POS differences were not significant (e.g., interjections overall NS). Excluding 3 SSD TLC outliers: adverbs (p<0.001), adjectives (p=0.04), determiners (p=0.03) remained; pronouns trend-level (p=0.09). - Sentence-level BERT next-sentence predictability: No significant group effect (HC 0.96±0.03 vs SSD 0.94±0.04; linear model Beta for group=0.01, p=0.28). Weighting by sentence length or excluding sentences <5 words did not change this null result. - BERT embedding distance trajectories (interviewer prompt to participant responses): SSD responses began farther from prompts and diverged further with additional sentences. SSD intercept=0.260 (95% CI [0.257, 0.263]) vs HC intercept=0.247 (95% CI [0.242, 0.252]); SSD slope=6.6e-4 (95% CI [2.6e-4, 1.1e-3]) significantly >0; HC slope=1.5e-5 (95% CI [-6.2e-4, 6.5e-4]) not different from 0. Results robust to excluding 3 SSD outliers. - Classification (naive Bayes, leave-one-out CV): Clinical-only model AUC=0.58, accuracy=68%. NLP-only model AUC=0.91, accuracy=87%. NLP+clinical model AUC=0.86, accuracy=81%. Including education level did not materially change accuracies: NLP-only 87%, NLP+TLC 77%, TLC-only 68%. Fivefold CV confirmed similar patterns with higher accuracy.

Discussion

Applying multi-level NLP analyses to speech from SSD and HC revealed subclinical linguistic differences not captured by blinded TLC ratings. SSD speech showed greater pronoun use and incomplete words, with fewer adverbs and determiners. Word-level analyses highlighted increased first-person singular pronouns in SSD, versus first-person plural in HC; SSD also favored the filler “uh,” while HC favored “um.” These findings align partially with prior reports of altered pronoun usage in SSD and at-risk populations and may reflect differences in self-concept, social cognition, or metacognition. The “um” vs “uh” asymmetry, not commonly analyzed in prior psychiatric NLP work, parallels findings in autism spectrum disorder and could indicate differences in delay signaling or planning, though mechanisms remain speculative. The marked increase in incomplete (partial) words among SSD is, to the authors’ knowledge, novel; this feature alone strongly discriminated groups (AUC 0.88) and may have been overlooked historically due to transcription practices or listener insensitivity to disfluencies. At the sentence level, BERT-based embedding trajectories suggested increased tangentiality among SSD: responses started further from interviewer prompts and drifted further away, whereas HC remained stable. In contrast, mean next-sentence predictability did not differ when sentences were properly segmented, underscoring methodological sensitivity to sentence parsing in disfluent speech. Overall, NLP-derived measures outperformed clinical TLC ratings for group discrimination, supporting the potential of computational language markers as sensitive indicators of subtle thought/language disturbance in SSD. These findings contribute to growing evidence that NLP features can relate to cognitive and social deficits and may serve as objective biomarkers, while also highlighting technical considerations (e.g., training corpora, segmentation) critical for psychiatric speech analysis.

Conclusion

This exploratory study is the first to apply BERT-based contextual methods to a psychiatric speech sample to assess sentence-level coherence alongside word- and POS-level features. Despite low clinician-rated language disorder, NLP features sensitively distinguished SSD from HC, with the NLP-only model achieving high discriminative performance. Key contributions include identifying robust POS shifts, differential pronoun and filler usage, and a novel increase in incomplete words among SSD, as well as evidence of increased tangentiality via embedding trajectories. These results suggest NLP measures may provide clinically relevant biomarkers of language disturbance in SSD. Future work should: replicate findings in larger, more diverse samples; develop and validate speech-specific NLP tools and shared clinical speech corpora; examine within-sentence coherence; assess relationships between NLP features, symptom dimensions, cognition, and medication effects; and evaluate prognostic utility (e.g., course, treatment response).

Limitations

- Small sample size and heterogeneity across cohorts; powered primarily to detect large effects, increasing risk of Type II errors for smaller effects. - Exploratory analyses without correction for multiple comparisons, raising potential for overfitting and limited generalizability. - Low prevalence of clinically evident thought disorder limited correlation analyses between TLC and NLP measures. - Technical sensitivities: automated sentence parsing can bias coherence metrics (e.g., next-sentence predictability) in the presence of disfluencies; many NLP tools are trained on written text rather than speech. - Analyses focused on sentence-to-sentence coherence; within-sentence incoherence was not examined. - Potential confounding by medication effects on language production; consistent medication data were not available. - Education level differed between groups (accounted for statistically), but residual confounding is possible.

Related Publications

Explore these studies to deepen your understanding of the subject.

Medicine and Health

Identification of clinical disease trajectories in neurodegenerative disorders with natural language processing

N. J. Mekkes, M. Groot, et al.

Medicine and Health

Differences in regional brain structure in toddlers with autism are related to future language outcomes

K. Duan, L. Eyler, et al.

Computer Science

A natural language processing approach to detect inconsistencies in death investigation notes attributing suicide circumstances

S. Wang, Y. Zhou, et al.

Psychology

Natural Language Processing markers in first episode psychosis and people at clinical high-risk

S. E. Morgan, K. Diederen, et al.

Listen, Learn & Level Up

Over 10,000 hours of research content in 25+ fields, available in 12+ languages.

No more digging through PDFs, just hit play and absorb the world's latest research in your language, on your time.

listen to research audio papers with researchbunny