logo
Loading...
Measuring linguistic complexity in Chinese: An information-theoretic approach

Linguistics and Languages

Measuring linguistic complexity in Chinese: An information-theoretic approach

X. Liu, F. Li, et al.

Explore the groundbreaking study by Xun Liu, Feng Li, and Wei Xiao, which applies an information-theoretic approach using Kolmogorov complexity to analyze Chinese linguistic complexity. With a corpus of 60 million characters, their research reveals significant correlations and insights into morpheme richness and topic prominence. Discover how this approach compares to nine European languages and sheds light on the proficiency of Chinese L1/L2 speakers!... show more
Introduction

Linguistic complexity concerns the number of components and their interconnections in a language system and the cognitive difficulty for users. It is central to linguistic description, typology, language development assessment, and L2 teaching, yet measuring it accurately across tiers (lexicon, syntax, morphology, collocation) is challenging. Prior work largely focuses on English and European languages; Chinese has unique properties—logographic writing, analytic morphology with many morphemes realized as standalone characters, and distinctive word order patterns—necessitating tailored approaches. Existing Chinese metrics either require extensive feature tagging (risking subjectivity), suffer from redundancy and collinearity when measuring macro-level complexity, or lack cross-linguistic transferability. This study adopts Kolmogorov complexity, an information-theoretic measure approximated via compression, to evaluate Chinese at morphological, syntactic, and overall levels. Research questions: (1) Is Kolmogorov complexity reliable for Chinese, i.e., how does it correlate with established Chinese complexity metrics? (2) Is it valid for Chinese, i.e., what Chinese language features does it capture?

Literature Review

Kolmogorov complexity, originating from information theory, quantifies the shortest description length of a string; in practice it is approximated via compression (e.g., Lempel–Ziv/gzip). In linguistics, it has been used to measure structural surface redundancy at morphological, syntactic, and overall tiers. Early typological work showed intuitively simpler languages (e.g., English) tend to have lower morphological but higher syntactic complexity, while morphologically rich languages (e.g., Finnish) show the opposite. Compression-based approaches have also aided language family classification and have been applied to register variation, diachronic change, and SLA, often outperforming traditional metrics in some settings. Despite broad applications, most targets are European languages, with little systematic evaluation in Chinese. Prior Chinese complexity research used metrics like T-unit, topic chains, lexical diversity, and specialized features; however, issues include redundancy, heavy annotation requirements, and limited cross-linguistic comparability. Thus, a holistic, objective, and transferable approach like Kolmogorov complexity warrants testing for Chinese.

Methodology

Data: 480 Chinese PhD theses (2016–2020) from CNKI, spanning philosophy, economics, law, education, arts, history, management, and fine arts. Cleaning removed non-Chinese characters and most punctuation except sentence-final full stops (other sentence terminators normalized to full stops). Final corpus size: 58,409,544 characters (mean 121,686.55 per thesis). Kolmogorov complexity metrics: Approximated using gzip-based compression with adapted R scripts. Morphological complexity (MCS): random deletion of 10% of Chinese characters (morphological distortion), compress original and distorted texts, compute ratio of compressed sizes (with negative sign for visualization); iterate 1000 times and average. Syntactic complexity (SCS): random deletion of 10% of words (syntactic distortion), gzip compress original and distorted texts, compute ratio; iterate 1000 times and average. Overall complexity (AOCS): regress compressed size on original size across samples; use residual (ε) as adjusted overall complexity. Comparison metrics: 18 established metrics across four tiers: Character (CHAR_RTTR); Lexicon (LEXICAL_RTTR, LEXICAL_MATTR, LEXICAL_MTLD, LEXICAL_VARIATION, LEXICAL_DENSITY); Syntax (MLS, MLC, MLTU, C/S, C/T, MEAN_TREE_DEPTH, MAX_TREE_DEPTH); Collocation (COLL_RTTR, UNIQUE_RTTR, GENERAL_RTTR, BIGRAM_RTTR, DEP_RTTR). Metrics were computed using L2C-Rater, CTAP for Chinese, and CLRA. Analysis: Used R 4.2.1 and gzip 1.2.4. Shapiro–Wilk indicated non-normality for all but MEAN_TREE_DEPTH; hence Spearman rank correlations were used for associations. Mann–Whitney U tests compared two pseudo groups (random equal split) to assess internal heterogeneity. Visualization employed corrplot and GGally packages.

Key Findings

Internal consistency: No significant differences between pseudo groups for MCS, SCS, or AOCS (all p > 0.05). MCS and AOCS strongly positively correlated (r = 0.72, p < 0.05); SCS strongly negatively correlated with MCS (r = -0.96, p < 0.05) and moderately negatively with AOCS (r = -0.63, p < 0.05). Descriptive stats: MCS min -1.057, max -0.997, mean -1.013 (SD 0.013); SCS min 0.898, max 0.937, mean 0.911 (SD 0.007); AOCS min -589.485, max 517.246, mean -2.230 (SD 174.05). Correlations with established metrics:

  • MCS: positive with CHAR_RTTR (r = 0.55), LEXICAL_RTTR (0.80), LEXICAL_MATTR (0.76), LEXICAL_MTLD (0.79), LEXICAL_VARIATION (0.54); negative with LEXICAL_DENSITY (-0.33). All p < 0.05.
  • SCS: positive with MLS (0.55), MLC (0.69), MLTU (0.72), MEAN_TREE_DEPTH (0.60), MAX_TREE_DEPTH (0.28); non-significant with C/S (0.07) and C/T (0.07). Collocations: strong negative with COLL_RTTR (-0.77), UNIQUE_RTTR (-0.81), GENERAL_RTTR (-0.76), BIGRAM_RTTR (-0.81), DEP_RTTR (-0.78). All p < 0.05 unless noted.
  • AOCS: generally positive with character, lexical, and collocation metrics; negative with most syntactic metrics, mirroring MCS/SCS patterns. Validity checks: Cross-language comparison (vs. nine European languages) positioned Chinese at high MCS and low SCS, distinct from others. L2 comparison showed higher-level Chinese L2 close to L1, lower-level L2 more distant, indicating sensitivity to proficiency.
Discussion

Findings address the RQs by demonstrating that Kolmogorov complexity metrics are reliable for Chinese (strong inter-metric consistency and predictable correlations with established metrics) and valid (they capture known Chinese traits). The strong MCS–AOCS link and MCS–SCS trade-off reflect the classic morphology–syntax compensation (Hockett’s dictum; Principle of Least Effort): what is not signaled morphologically is managed syntactically. For Chinese, MCS aligns with character and lexical richness/diversity, consistent with the large morpheme inventory and flexible compounding; its weak link to lexical density indicates it does not track content-word proportion. SCS aligns with length- and depth-based syntactic complexity (longer sentences/clauses, deeper dependency trees), but not with clause ratio metrics, likely because compression taps structural (ir)regularities rather than counts of subordination. Negative SCS–collocation correlations suggest that richer, more varied collocations reflect freer word order and thus lower syntactic rigidity per compression. AOCS overlaps partially with lexical richness and shows consistent patterns with SCS and collocation metrics, supporting its role as a global measure. Cross-linguistically, Chinese’s high MCS and low SCS reflect morpheme richness and topic-prominent, flexible word order; L2 results show proficiency sensitivity, with advanced L2 approximating L1 profiles.

Conclusion

Kolmogorov complexity provides a succinct, objective, and holistic system (morphological, syntactic, overall) to quantify Chinese linguistic complexity. It exhibits high reliability (consistent internal and external correlations) and validity (captures hallmark features of Chinese and distinguishes L1/L2 proficiency levels; differentiates Chinese from European languages). Contributions include extending the generalizability of Kolmogorov complexity to a major non-alphabetic, analytic language and refining Chinese complexity measurement. Future work should: (1) refine morphological distortion units (e.g., character strokes) to reduce overlap with syntax; (2) expand to fine-grained syntactic/phrase-level metrics and compare with compression-based measures; (3) extend the framework to additional tiers (semantic, pragmatic) and broaden applications in quantitative Chinese linguistics and L2 assessment.

Limitations

Morphological distortion used Chinese characters as the unit, potentially overlapping with syntactic distortion given many single-character words. Fine-grained measures (e.g., complex noun phrase ratios) were not included in comparisons, reflecting the large-grained nature of compression-based metrics. Validity comparisons used corpora differing in genre, themes, and lengths, limiting strict comparability. The current framework covers morphological, syntactic, and overall tiers; extensions to semantic and pragmatic tiers remain for future work.

Listen, Learn & Level Up
Over 10,000 hours of research content in 25+ fields, available in 12+ languages.
No more digging through PDFs, just hit play and absorb the world's latest research in your language, on your time.
listen to research audio papers with researchbunny