
Psychology
Natural Language Processing markers in first episode psychosis and people at clinical high-risk
S. E. Morgan, K. Diederen, et al.
Discover how twelve cutting-edge NLP markers can differentiate between individuals at high risk for psychosis, first-episode patients, and healthy controls. This groundbreaking research, conducted by Sarah E. Morgan and colleagues, reveals significant insights into speech patterns and coherence, highlighting the impact of speech generation methods.
~3 min • Beginner • English
Introduction
Psychotic disorders typically emerge after adolescence, often preceded by a clinical high-risk (CHR-P) phase. Although several clinical, cognitive, neuroimaging, and peripheral blood measures associate with transition risk, there is a need for accurate, non-invasive, and clinically translatable predictors to enable targeted preventive interventions. Formal Thought Disorder (FTD), reflected in disorganized or incoherent speech, is a core feature of psychosis. Automated methods now quantify speech disorganization from transcripts, moving beyond subjective, labor-intensive ratings. Prior studies have used semantic coherence (e.g., LSA-based), tangentiality, referential cohesion, and speech graph connectivity to distinguish psychosis and even predict transition in CHR-P. However, it remains unclear which NLP measures best capture psychosis-related abnormalities, how these measures relate to each other, and which speech elicitation tasks are optimal. This study evaluates 12 NLP measures to differentiate CHR-P, first episode psychosis (FEP), and healthy controls using TAT picture descriptions; examines inter-measure relationships; tests whether measures differentiate converters within CHR-P; and compares task effects (TAT, DCT story retell, and free speech) on discriminability.
Literature Review
Prior work has quantified incoherence via semantic techniques such as LSA and modern embeddings to measure semantic coherence, tangentiality, and similarity across utterances. Reduced coherence and increased variance, reduced possessive pronouns, and low semantic density have predicted transition with ~80–90% accuracy in CHR-P cohorts. Graph-theoretic analyses of speech have shown reduced connectivity in schizophrenia and predicted diagnosis months in advance. Additional evidence points to abnormalities in referential cohesion and pronoun use in schizophrenia. Tasks influence detectability; for instance, dream reports predicted psychosis better than waking activity descriptions. Despite these advances, prior studies often employed limited metric sets and varied tasks, leaving gaps regarding the most informative measures, their inter-relationships, and optimal elicitation strategies.
Methodology
Design and participants: Three groups were recruited: CHR-P (n=25; OASIS service; CAARMS criteria), FEP (n=16), and healthy controls (n=13) matched for age and sex. Participants were English-fluent and consented; ethics approval obtained. CHR-P subjects were clinically followed for ~7 years; 8/25 transitioned to psychosis (persistent frank psychotic symptoms >1 week). Assessments included Thought and Language Index (TLI) on TAT speech, PANSS (subset), WRAT IQ, WAIS digit span, and years of education.
Speech elicitation tasks: Primary analyses used Thematic Apperception Test (TAT): 8 standardized pictures; 1-minute description per picture with standardized prompts as needed. Additional tasks: Discourse Comprehension Test (DCT) story retelling (6 stories), and 10-minute free speech interview on chosen topics with prompts. Not all participants had data for all tasks (TAT N=52 after exclusions; DCT N=51; free speech N=52); some had missing individual excerpts.
NLP measures (12 total):
- Basic: total words (Nword), total sentences (Nsent), mean sentence length (Nword/Nsent).
- Semantic coherence: mean cosine similarity between adjacent sentence embeddings. Preprocessing removed stop/filler words; word2vec Google News embeddings and SIF sentence embeddings used.
- Tangentiality: slope of cosine similarity over time between each response sentence and an a priori stimulus description (TAT picture descriptions; DCT original stories). More negative slope indicates drifting off-topic. Not computed for free speech.
- On-topic score: mean cosine similarity of each sentence to the a priori stimulus description (captures average on-topic alignment across the response). Not computed for free speech.
- Repetition: maximum cosine similarity between any sentence pair (proxy for repetitiveness).
- Ambiguous pronouns: count of instances where the first reference to an entity was a third-person pronoun (coreference via a neural model), reflecting unresolved/late-resolved references.
- Speech graphs: directed graphs with unique words as nodes and edges linking sequential words. Connectivity metrics: LCC, LSC, and their normalized versions to randomized graphs (LCCr, LSCr) via SpeechGraphs software.
Analysis: For each task, measures were computed per excerpt and averaged per subject. Normality was tested (Shapiro–Wilk). Group differences used two-sided Mann–Whitney U-tests. Inter-measure associations used linear regression controlling for group. Inaudible segments per word were counted and compared; sensitivity analyses controlled for inaudibles using GAMLSS with gamma distribution. GAMLSS models also controlled for IQ, years of education, and digit span; multiplicative effects on the mean (λ) reported. Multiple comparisons were FDR-corrected where specified.
Key Findings
Primary TAT results: Compared to controls, FEP patients produced fewer words, shorter sentences, more sentences, lower semantic coherence (Z=3.2, P=0.001), and markedly lower on-topic scores (Z=3.5, P=0.001); tangentiality was not significant. Speech graph connectivity (e.g., LSC, LSCr) was reduced in FEP. Ambiguous pronoun count and repetition were not significantly different.
CHR-P vs controls (TAT): On-topic and semantic coherence were reduced; no significant differences from FEP on these measures. CHR-P showed higher LCC, LCCr, and LSCr than FEP but did not differ from controls on these graph metrics.
Medication exclusion did not qualitatively change results (except FEP vs control word count lost significance). Among CHR-P, after controlling for IQ, converters vs non-converters differed in LSC (T=2.8, P=0.011) and LSCr (T=3.1, P=0.005) only; without IQ control, these differences were not evident.
Prompts: FEP required more prompts than controls (Z=2.6, P=0.0084) and CHR-P (Z=2.3, P=0.020).
Inaudibles: CHR-P had more inaudible pieces per word than controls for TAT (Z=2.2, P=0.029), but FEP did not differ from controls or CHR-P. Controlling for inaudibles did not alter significance of identified group differences.
Inter-measure relationships: Graph metrics strongly correlated among themselves. LSC negatively associated with repetition and positively with on-topic. No significant association between graph connectivity and semantic coherence. Coherence correlated positively with number of words, sentence length, and on-topic, and negatively with number of sentences.
Associations with clinical/cognitive measures: After FDR correction, TLI negative correlated with fewer words and lower graph connectivity (e.g., LCC, LSC; PFDR ≤ 0.014). No significant associations with IQ or years of education after correction. Controlling for IQ/education attenuated some group differences (e.g., FEP vs control differences in number of words, sentences, and LSCr no longer significant), but sentence length, coherence, on-topic, and several graph measures remained. Digit span correlated with coherence (TAT) and with number of sentences, on-topic, and ambiguous pronouns (DCT). Controlling for digit span: for DCT, no NLP group differences remained; for TAT, on-topic and graph connectivity differences remained.
Task comparisons:
- DCT story retell: Replicated TAT findings for FEP vs controls: reduced coherence (Z=2.2, P=0.026), on-topic (Z=3.5, P=0.001), fewer words, shorter sentences, reduced LCC/LSC/normalized metrics. FEP vs CHR-P also showed reductions in most of these (except some graph normalizations). Ambiguous pronouns were increased in FEP vs controls (Z=3.2, P=0.001); CHR-P did not differ from either group on ambiguity.
- Free speech: Only number of sentences was increased in FEP vs both controls and CHR-P. No significant differences in coherence, on-topic (not computed), repetition (saturated at 1 for several long excerpts), or pronouns. Nonetheless, graph metrics (LCC, LCCr, LSCr) were lower in FEP vs CHR-P. Inaudibles were fewer in free speech than TAT/DCT; DCT had more inaudibles for FEP vs controls (Z=2.0, P=0.047).
Discussion
Multiple automated NLP measures discriminated psychosis-related groups. Semantic coherence and speech graph connectivity were reduced in FEP vs controls; CHR-P showed intermediate or selective alterations (reduced coherence/on-topic vs controls; higher graph connectivity than FEP). The novel on-topic score consistently outperformed tangentiality, suggesting average topical alignment is a more sensitive marker than temporal drift. Inter-measure analyses indicated limited redundancy: graph connectivity did not track semantic coherence, implying complementary dimensions of speech disorganization. Some effects were robust to IQ and education, though certain graph measures attenuated after covariate control, aligning with known relations of graph metrics to cognition and education. Task choice mattered: structured elicitation (TAT, DCT) yielded stronger group separations than free speech, likely reflecting task-specific cognitive demands and availability of stimulus references for on-topic calculations. Speech graph connectivity uniquely differentiated CHR-P converters when adjusting for IQ, underscoring potential prognostic value. Together, results support combining diverse, computationally inexpensive NLP markers to enhance assessment and prediction in early psychosis.
Conclusion
Automated NLP markers of transcribed speech can quantify formal thought disorder and distinguish FEP, CHR-P, and controls. On-topic score, semantic coherence, and speech graph connectivity were particularly informative, with complementary information across measures. Structured elicitation tasks (TAT, DCT) outperformed free speech for detecting abnormalities. Speech graph metrics showed potential to differentiate CHR-P converters when accounting for IQ. Future work should validate these findings in larger, diverse cohorts, explore multimarker combinations and machine learning models for individual-level prediction, refine pronoun/coreference measures with improved models, and systematically assess task-dependent cognitive influences.
Limitations
Key limitations include modest sample size (N=54; CHR-P converters n=8) with risk of type II error and limited generalizability; focus on FEP without chronic psychosis comparison; group differences in potential confounders (antipsychotic use, IQ, education, digit span, number of prompts) and task order not randomized; some NLP differences attenuated after controlling for IQ, education, or digit span (notably for DCT); higher inaudibles in some conditions; inability to compute certain measures (on-topic, tangentiality) for free speech; potential inaccuracies in coreference resolution affecting ambiguous pronoun counts; analyses were group-level rather than individual predictive modeling.
Related Publications
Explore these studies to deepen your understanding of the subject.