logo
ResearchBunny Logo
"Ask" or "Inquire": operationalizing speech formality in psychosis and its risk states using etymology

Psychology

"Ask" or "Inquire": operationalizing speech formality in psychosis and its risk states using etymology

M. Cotter, A. Mcgowan, et al.

Explore how individuals with psychotic symptoms use language differently in a groundbreaking study by Matthew Cotter, Alessia McGowan, and colleagues. This research reveals that those with recent-onset psychosis favor simpler, Germanic words over more formal, Old French terms, hinting at significant implications for understanding speech and role functioning in mental health.... show more
Introduction

Disordered communication characterizes psychosis and is linked to functional deficits. Prior work in healthy speakers suggests that Latinate words (often via Old French) signal greater formality compared to Germanic-origin words. Authors leverage this etymological contrast in literature to convey perceived formality or intellect. This study presents the first analysis of speech etymology content in a clinical cohort. The authors developed a pipeline to quantify proportions of Germanic-origin versus Old French-origin words in transcripts from open-ended interviews with individuals with recent-onset psychosis (ROP), individuals at clinical high risk (CHR), and healthy controls (HC). They hypothesized that ROP speech would contain proportionately more Germanic and fewer Old French words, alongside reduced lexical diversity and lower perplexity (rarity) relative to HC, reflecting difficulty modulating formality and retrieving uncommon words. They also examined associations with demographic and clinical covariates (sex, age, site, education, race, antipsychotic use, socioeconomic status via maternal education, and IQ where available).

Literature Review

Among healthy speakers, Latinate-origin words are used to convey formality, while Germanic-origin words convey informality or sincerity. Studies have quantified these patterns in spoken language and classical literature, showing systematic stylistic manipulation of etymological content to signal formality or intellect. Prior clinical NLP research has linked lexical diversity measures (e.g., Honoré's Statistic) to poverty of speech in schizophrenia, but etymology has not been studied in clinical cohorts.

Methodology

Study design: Two language production studies spanning the schizophrenia spectrum were analyzed: a single-site study in New York (2016–2023; R01MH107558) and a multisite study (2018–2022; R01MH115332) in New York (USA), Melbourne (Australia), and Toronto (Canada). Participants: 92 ROP (symptom onset within 5 years), 144 CHR, and 173 HC with similar demographics. Exclusions included risk to self/others, medical/neurological disorders affecting language, IQ < 70, and for HC only, any DSM Axis I diagnosis. Diagnoses via SCID-5; CHR assessed with SIPS (North America) or CAARMS (Australia). Assessments: Demographics (age, sex, race), medication (antipsychotic yes/no), symptom scales (PANSS for ROP; SIPS/CAARMS for CHR; HC received site-appropriate assessments), functioning (Global Functioning: Role and Social, GF-R/GF-S), and IQ (WASI-2 subtests) in R01MH115332. Data collection: Approximately 30-minute open-ended interviews (starter prompt: "How have things been going for you lately?") were recorded and transcribed by TranscribeMe!. NLP preprocessing: Transcripts were lowercased, lemmatized with Stanza; special characters and punctuation removed. Etymology extraction: Lemmas were queried against Etymonline and a Wiktionary-derived database to identify Germanic or Old French origins. To avoid bias from structural function words (predominantly Germanic), analysis was restricted to nouns, verbs, adjectives, and adverbs. Only lemmas with exclusive Germanic or exclusive Old French origins were counted; mixed or neither origins were excluded. Etymology proportions were computed as the fraction of content-word lemmas with Germanic or Old French origin. Lexical diversity: Honoré's Statistic computed on whole transcripts without part-of-speech filtering. Perplexity (rarity): First-order perplexity estimated using lemma frequencies from Google N-grams; transcript perplexity computed as the log-product (sum of log-probabilities) across content-word lemmas. Statistical analysis: Potential covariates (sex, age, site, education, race, socioeconomic status via maternal education, IQ) were sequentially tested for associations with lexical variables. Categorical covariates used t-tests/ANOVAs with adjustments by HC median differences; continuous covariates used Spearman correlations and linear regression residualization from HC data. Antipsychotic use tested in CHR and ROP only. Group differences across HC/CHR/ROP were examined via ANOVA with pairwise tests; significant findings were also tested within sites. Associations among etymology, lexical diversity, and perplexity were assessed via Pearson correlations (covariate-adjusted variables). Multiple linear regression models predicted uncorrected Germanic and Old French proportions from cohort, lexical diversity, perplexity, sex, age, education, site, and race; standardized coefficients were compared. Clinical relevance: Spearman correlations tested relations of etymology proportions with positive and negative symptoms (SIPS/CAARMS for CHR; PANSS for ROP) and with GF-R and GF-S in combined clinical cohorts. Bonferroni-corrected alpha < 0.0025 was applied across models and clinical correlations. The authors compared Etymonline- vs. Wiktionary-derived proportions and chose Wiktionary for subsequent analyses due to near-identity and broader multilingual potential.

Key Findings
  • Etymology source agreement: Etymonline- and Wiktionary-derived etymology proportions were nearly identical across transcripts (Germanic r = 0.99, p < 0.05; Old French r = 0.99, p < 0.05). Wiktionary-derived data were used henceforth. - Group differences (after covariate adjustment for sex, age, site, education, race, and IQ where available): HC used fewer Germanic and more Old French words than ROP, as hypothesized; CHR showed a similar pattern to ROP. HC had greater lexical diversity than either clinical group; perplexity (rarity) did not differ across groups. Patterns held in the IQ-known sub-cohort and were largely replicated within sites (with a trend-level Germanic difference in New York). - Descriptive statistics (Table 2, adjusted means): Full cohort: Proportion Germanic (mean SD): HC 0.701 (0.026), CHR 0.724 (0.032), ROP 0.724 (0.038). Proportion Old French: HC 0.190 (0.058), CHR 0.172 (0.023), ROP 0.173 (0.030). Honoré's Statistic: HC 0.692 (12), CHR 0.618 (14), ROP 0.666 (15). Perplexity: HC 11.2 (5.2), CHR 10.7 (5.5), ROP 11.2 (5.7). Significant ANOVA/t-tests indicated HC differed from CHR and ROP in etymology proportions and lexical diversity (p < 0.001); CHR and ROP did not differ materially except a few comparisons noted. - Correlation matrix (covariate-adjusted variables; Table 3): Germanic vs Old French: r = -0.88. Germanic with lexical diversity: r = -0.34; Germanic with perplexity: r = -0.37. Old French with lexical diversity: r = 0.33; Old French with perplexity: r = 0.24. Lexical diversity with perplexity: r = 0.20. All p < 0.05. - Multiple regression (Table 4): Adjusted R² = 0.388 for Germanic model; 0.326 for Old French model. Clinical cohort, lexical diversity, and perplexity significantly contributed to both models (corrected p < 0.05). Site correlated with Old French; education correlated with Germanic. Standardized coefficients indicated clinical cohort (particularly ROP vs HC) was the largest contributor to etymology variance. - Clinical relevance: Role functioning (GF-R) in the combined clinical cohort was associated with greater formality: higher Old French use (rho = 0.26, p < 0.01) and lower Germanic use (rho = -0.19, p < 0.05); these survived Bonferroni correction. Social functioning (GF-S) showed no association. Etymology proportions were not associated with positive or negative symptoms after correction. Role functioning associations were not explained by lexical diversity or perplexity and were consistent within recruitment sites.
Discussion

Findings support that individuals with psychotic symptoms (ROP and CHR) use proportionately more Germanic-origin and fewer Old French-origin words than healthy controls during open-ended interviews, consistent with decreased speech formality. These group differences persisted after adjusting for demographic and cognitive covariates, suggesting that etymological composition captures a dimension of formality not reducible to lexical diversity or word rarity (perplexity). Although etymological proportions were moderately related to lexical diversity and perplexity, multivariate models showed independent contributions of diagnosis, diversity, and rarity to etymology, indicating partially distinct constructs. Importantly, reduced formality (more Germanic, less Old French) related specifically to poorer role functioning, but not social functioning, aligning with the idea that formal communication style may be more relevant to academic/work contexts than to social interactions. Educational attainment related to etymology patterns, whereas maternal education (SES proxy) did not, implying formal vocabulary exposure in schooling may shape speech formality. Cross-site analyses suggested generalizability across American, Canadian, and Australian English. Overall, speech etymology appears to operationalize formality in clinical speech and may help explain functional impairments in psychosis beyond general lexical complexity measures.

Conclusion

This is the first clinical study of speech etymology content. Proportions of Germanic and Old French word use differ between psychosis-spectrum individuals and healthy controls, with variance partly attributable to lexical diversity and perplexity but also reflecting an independent dimension of speech formality. Reduced formality is associated with lower role functioning. Etymology-based measures may help operationalize formality in clinical language, inform mechanisms underlying functional deficits, and guide interventions to improve role functioning. Future work should test generalizability across languages and cohorts, integrate cognitive measures, and leverage large-scale multilingual datasets (e.g., AMP SCZ) to examine cross-linguistic etymology-formality relationships and clinical correlates.

Limitations

Key limitations include: limited availability of IQ and broader cognitive data (MCCB) due to COVID-19 disruptions and non-concurrent assessments; lack of detailed antipsychotic dosage information (particularly in ROP); absence of data on native language, language spoken at home, and migration history despite English fluency, which may influence vocabulary and formality associations; and incomplete explanatory power of models (approximately one-third of variance explained), indicating unmeasured factors (e.g., working memory, processing speed) might affect etymology patterns. Generalizability to non-English languages is uncertain; English’s unique etymological history may facilitate formality indexing that may not directly translate to other languages. Future studies should collect comprehensive cognitive batteries, antipsychotic dosing, detailed language background, assess specificity relative to other clinical groups (e.g., depression, anxiety), and examine multilingual etymology leveraging resources like Wiktionary and cohorts such as AMP SCZ.

Listen, Learn & Level Up
Over 10,000 hours of research content in 25+ fields, available in 12+ languages.
No more digging through PDFs, just hit play and absorb the world's latest research in your language, on your time.
listen to research audio papers with researchbunny