Psychology

Are there distinct levels of language comprehension in autistic individuals – cluster analysis

A. Vyshedskiy, R. Venkatesh, et al.

This groundbreaking study by Andrey Vyshedskiy, Rohan Venkatesh, and Edward Khokhlovich uncovers three distinct levels of language comprehension in over 31,000 autistic individuals. By moving beyond traditional classifications, their findings offer a more nuanced view of communication abilities, emphasizing the critical link between language comprehension and effective therapy.

00:00

~3 min • Beginner • English

Index

Introduction

Autism Spectrum Disorder (ASD) involves impairments in social communication and restricted, repetitive behaviors, with language deficits often central and receptive (comprehension) abilities commonly more impaired than expressive language. Despite this, communication level is often informally described only along a verbal continuum (nonverbal, minimally verbal, verbal), leaving comprehension unclassified. The authors propose an objective classification of language comprehension level to improve characterization of communication skills and guide therapy. They leveraged a parent-reported language comprehension instrument, the Mental Synthesis Evaluation Checklist (MSEC), designed to target higher aspects of language, and administered via a widely used app among individuals with ASD. Using unsupervised hierarchical clustering and principal component analysis (PCA), they aimed to determine which language comprehension abilities co-occur and whether autistic individuals can be grouped into distinct comprehension phenotypes. The hypothesis was that ASD participants can be assigned to subgroups (phenotypes) based on language comprehension abilities, improving classification beyond verbal level alone.

Literature Review

The MSEC has been widely used and validated in large ASD cohorts. Prior work showed MSEC can distinguish ASD from typically developing children as early as age 2 (N=29,138) with sensitivity/specificity improving with age: 0.71/0.88 at age 2; 0.92/0.95 at 3; 0.91/1.0 at 4; 0.93/0.97 at 5; 0.93/1.0 at 6; and 0.95/1.0 at 7. Multiple longitudinal app-based studies reported that shorter passive video/TV watching was associated with 1.4-fold greater improvement in MSEC comprehension (p=0.0128) despite a 1.3-fold reduction in expressive language (p=0.0719); a therapeutic language intervention yielded 2.2-fold greater improvement in MSEC comprehension (p<0.0001) vs. 1.4-fold in expressive language (p=0.0144); pretend play associated with 1.9-fold faster MSEC comprehension improvement (p<0.0001) vs. 1.4-fold in expressive (p<0.0001); longer joint engagement linked to 1.4-times faster MSEC comprehension development (p=0.0019); absence of seizures linked to 1.5-times faster MSEC comprehension improvement (p<0.0001); gluten-free diet associated with 1.5-fold faster MSEC comprehension improvement (p<0.0001) but no expressive change (p=0.5918); meat/eggs associated with 1.6-fold faster MSEC comprehension improvement (p<0.0001) and vegetable intake with 1.5-fold (p<0.0001), with smaller effects on expressive language. Prior ASD clustering literature has used unsupervised methods to parse diagnostic heterogeneity based on core symptoms, but not language comprehension. Historically, language phenotypes were defined via expressive language rather than comprehension. Evidence also shows receptive language often diverges from expressive in ASD.

Methodology

Design and data source: Parent-reported assessments were collected via a free language therapy app launched in September 2015. Caregivers consented to anonymized analysis and completed the Autism Treatment Evaluation Checklist (ATEC) and the Mental Synthesis Evaluation Checklist (MSEC). Inclusion criteria: parent-reported ASD diagnosis; absence of seizures; absence of serious/moderate sleep problems; age 4–21 years (ensuring exposure to assessed concepts and avoiding age-related linguistic decline). For participants with multiple entries, the last evaluation was used. Final sample: N=31,845 (mean age 6.5±2.6 years; 78% male). Language comprehension measures: Fourteen comprehension items derived from ATEC and MSEC assessed abilities including: knowing name; responding to 'No/Stop'; following commands; understanding color/size modifiers; multiple modifiers; size superlatives; numbers; spatial prepositions; verb tenses; flexible syntax (word order meaning changes); possessive pronouns; explanations beyond immediate context; simple stories; and elaborate fairytales. Responses were coded as very true (0), somewhat true (1), not true (2); lower scores indicate better comprehension. Additional measures: ATEC subscales for expressive language, sociability, sensory awareness, and health were used for external validation; verbal level was defined from ATEC expressive language (verbal: uses sentences with 4+ words; nonverbal: not able to use 2 words at a time; otherwise minimally verbal). Analytic approach: Unsupervised hierarchical clustering (Ward’s method, Euclidean distance) was applied to (a) the 14 language comprehension items to detect co-acquisition clusters, and (b) participant profiles to identify language-comprehension phenotypes. Robustness checks included alternative linkage/distance methods (Ward.D/Ward.D2, Average, Complete, Mcquitty; Manhattan, Minkowski) and stratification by age groups (4–6, 6–12, 12–21) and time points (first vs. last evaluation). PCA visualized separation among clusters. A two-dimensional heatmap (R pheatmap) related participant clusters to item clusters. Statistical comparisons across phenotypes for external measures used two-sample t tests (p<0.0001). Data and code availability upon reasonable request.

Key Findings

- Three distinct clusters of language comprehension abilities emerged consistently: (1) Command language: knows name, responds to 'No/Stop', follows some commands (items 1–3); (2) Modifier language: understands simple color/size modifiers, multiple modifiers, size superlatives, numbers (items 4–7); (3) Syntactic language: spatial prepositions, verb tenses, flexible syntax, possessive pronouns, explanations, simple stories, elaborate fairytales (items 8–14). PCA supported separation (for items, PC1 explained 44.9% variance; PC2 11.7%). Correlation analysis showed stronger within-cluster correlations. The three-cluster solution was more stable and had greater inter-cluster distances than alternative two- or five-cluster solutions. - Participant clustering (N=31,845) yielded three robust language-comprehension phenotypes: Syntactic Language Phenotype (17%, n=5,298), Modifier Language Phenotype (43%, n=13,782), Command Language Phenotype (40%, n=12,765). PCA indicated reasonable separation (participants: PC1=48.2% variance; PC2=9.1%). - Heatmap alignment showed participant clusters map onto the three ability clusters: syntactic phenotype showed predominantly “very true” across all items; modifier phenotype showed strengths in command and modifier items but not syntactic; command phenotype showed strengths limited to command items. - External validity: Phenotypes differed significantly in expressive language, sociability, sensory awareness, and health (all t tests p<0.0001), with lower (better) scores in the syntactic phenotype. - ASD severity distribution (parent-reported): Syntactic phenotype: mild 79%, moderate 16%, severe 5%; Modifier: mild 63%, moderate 27%, severe 10%; Command: mild 39%, moderate 33%, severe 28%. - Verbal level distribution: Syntactic phenotype: verbal 67.4%, minimally verbal 26.6%, nonverbal 6.0%; Modifier: verbal 31.9%, minimally verbal 53.1%, nonverbal 15.0%; Command: verbal 8.9%, minimally verbal 42.5%, nonverbal 48.6%. Aggregated, nonverbal/minimally verbal were 91% in command, 68% in modifier, 33% in syntactic phenotypes. - No significant age differences between clusters (mean ages ~6.3–6.6 years), suggesting phenotypic differences are not age-driven. - Stability: Three-cluster structure was stable across clustering methods, seeds, age strata, and time points (first vs. last evaluation).

Discussion

Findings support the hypothesis that language comprehension in autistic individuals organizes into three distinct, co-acquired clusters—command, modifier, and syntactic—and that individuals correspondingly fall into three comprehension phenotypes. The robust separation across analytic methods, ages, and time points indicates these are meaningful dimensions rather than artifacts. The phenotypes correlated with independent measures of symptom severity (expressive language, sociability, sensory awareness, health) and parent-reported ASD level, reinforcing clinical relevance. Associations with verbal level were strong but not absolute, underscoring that receptive and expressive language are dissociable in ASD. The persistence of the three-cluster pattern in older subsets argues against simple differences in exposure to linguistic concepts as an explanation, suggesting distinct underlying regulatory mechanisms for command, modifier, and syntactic comprehension abilities. Clinically, augmenting the traditional one-dimensional verbal classification with a comprehension-level (command, modifier, syntactic) can yield a more accurate depiction of communication abilities and guide interventions to target receptive language specifically, potentially improving outcomes.

Conclusion

This large-scale, data-driven study identifies three stable language comprehension phenotypes in ASD—command, modifier, and syntactic—mirroring three clusters of co-acquired abilities. The phenotypes are clinically meaningful, correlating with broader symptom measures and ASD severity, yet only partially overlapping with verbal level. The work supports adopting a two-dimensional characterization of communication ability that incorporates both verbal output and comprehension phenotype. Future directions include validating these phenotypes in more and older populations, elucidating underlying neurocognitive mechanisms, and developing age-appropriate clinician- and parent-report assessments of syntactic comprehension for younger children (2.5–4.5 years). Using phenotype-based assessments to monitor acquisition could help tailor language therapies to receptive skills.

Limitations

- Reliance on parent reports, which may introduce bias; however, prior work indicates reasonable agreement with clinician assessments. - Predominantly young cohort; phenotypes may be transitional in development—confirmation in older individuals is needed. - Lack of data on hearing or other sensory impairments and intellectual disability, which can affect language development; these could not be controlled or analyzed. - App-based convenience sample may limit generalizability. - Although cultural/linguistic exposure differences were considered, and analyses in older age groups supported stability, residual exposure-related confounding cannot be entirely ruled out.

Related Publications

Explore these studies to deepen your understanding of the subject.

Linguistics and Languages

Meaning patterns of the NP de VP construction in modern Chinese: approaches of covarying collexeme analysis and hierarchical cluster analysis

J. Zhou

Medicine and Health

Efficient Detection of Stigmatizing Language in Electronic Health Records via In-Context Learning: A Comparative Analysis and Validation Study

H. Chen, M. Alfred, et al.

Interdisciplinary Studies

What are the core concerns of policy analysis? A multidisciplinary investigation based on in-depth bibliometric analysis

Y. Yang, X. Tan, et al.

Linguistics and Languages

Linguistic variation in mediated diplomatic communication: a full multi-dimensional analysis of interpreted language in Chinese Regular Press Conferences

Y. Yao, D. Li, et al.

Listen, Learn & Level Up

Over 10,000 hours of research content in 25+ fields, available in 12+ languages.

No more digging through PDFs, just hit play and absorb the world's latest research in your language, on your time.

listen to research audio papers with researchbunny