Linguistics and Languages
Regional varieties and diachronic changes in Chinese political discourse
R. Hou, C. Huang, et al.
The study investigates how language, viewed as a complex adaptive system, adapts over time (diachronic change) and across regions (synchronic variation). Drawing on the Menzerath-Altmann (MA) law—which posits that larger linguistic constructs tend to have shorter constituents—the authors model relationships at two immediate constituency levels (sentence–clause and clause–word) in Chinese political speeches from Hong Kong (HK) and the People’s Republic of China (PRC). They treat the MA parameters a and b as system-level characteristics that may remain stable within a given system but differ across systems and time. The paper poses two research questions: (RQ1) whether the MA law can model both regional variations and diachronic changes in PRC and HK political speeches; and (RQ2) whether the fitted parameters can differentiate variations and changes across time and region. The motivation includes the need for a unified model capturing interactions between language change and variation, and the potential to relate linguistic patterns to broader social dynamics.
The paper reviews synchronic and diachronic research on political discourse: synchronic analyses of stylistic differences (e.g., Savoy 2018; Yu 2013; Randour et al. 2020) and diachronic studies of ideology, language change, and register (e.g., Burgers & Ahrens 2020; Jiang et al. 2020; Kubát & Cech 2016). It discusses the interplay between variation and change, noting that variation can lead to change and vice versa (Thomason 1997; Xu et al. 2022; Zampieri & Nakov 2021). The authors then summarize quantitative laws in linguistics, focusing on MA law applications across levels (sentence–clause, clause–word), and related models (Zipf’s law). Prior work validates MA across languages and levels (Köhler 1982, 1984, 1989; Buk & Rovenchak 2008; Benešová 2016; Xu & He 2020; Jiang & Ma 2020; Berdicevskis 2021; Mačutek et al. 2021; Tuldava 1995; Hou et al. 2017, 2019a, 2019b, 2020a, 2020b; Chen & Liu 2022), including in Chinese. Studies suggest intermediate levels and style/register sensitivity, and that lower-level correlations may be more informative. This background motivates testing whether MA can jointly capture diachronic change and synchronic variation and how different constituency levels reflect these effects.
Corpora: The HKBU Corpus of Political Speeches was used. The dataset includes: (a) HK policy addresses by Colonial Governors (1984–1996) and SAR Chief Executives (1997–2014); (b) PRC Premier’s Reports on the Work of the Government (1984–2013). Three comparable periods were sampled (~5-year spans about a decade apart): HK (1984–1988, 1997–2001, 2010–2014); PRC (1984–1988, 1997–2001, 2009–2013). Analyses focus on HK speeches written in Chinese to compare with PRC Chinese. Units and measurements: In written Chinese, sentences are operationally defined by terminal punctuation (periods, exclamation, question marks). Sentence length is measured as number of clauses. Clauses are identified by internal punctuation (commas, semicolons, colons); minimum sentence length is 1 clause if none present. Clause length is the number of words in the clause; word length is the number of syllables, which in formal written Chinese equals number of characters (consensus noted). Word segmentation follows the HKBU corpus. Handling headlines: Many HK one-clause “sentences” are headlines comprised of phrases, not full sentences. Treating these as sentences violates MA at the sentence–clause level. The authors therefore treat HK headlines as stand-alone clauses, not sentences, when fitting sentence–clause relations; PRC headlines are mostly full sentences and are kept as sentences. Model and fitting: The MA standard form y = a x^b was fitted for two immediate constituency relations: sentence–clause (y = average clause length; x = sentence length in clauses) and clause–word (y = average word length; x = clause length in words). To stabilize estimation, the model was linearized: ln(y) = ln(a) + b ln(x). Goodness-of-fit was assessed with R² (≥0.9 very good; >0.8 good; >0.75 acceptable). Texts were represented by fitted parameters a and b and plotted in 2D; t-tests and hierarchical clustering (Euclidean distance, Ward’s method) assessed separability. Data size: Across PRC and HK, counts were 16,677 sentences, 44,743 clauses, and 333,273 word tokens (Appendix).
Sentence–clause level:
- HK results (after treating headlines as clauses, not sentences) conform to MA with good to very good fits: • 1984–1988: a = 10.581, b = −0.237, R² = 87.69% • 1997–2001: a = 9.707, b = −0.225, R² = 89.59% • 2010–2014: a = 10.361, b = −0.227, R² = 93.76% Average clause length decreases with sentence length (2–8 clauses), consistent with MA once headlines are excluded as sentences.
- PRC results conform to MA with very good fits and show clear diachronic change: • 1984–1988: a = 10.896, b = −0.223, R² = 97.83% • 1997–2001: a = 8.623, b = −0.191, R² = 93.82% • 2009–2013: a = 8.307, b = −0.122, R² = 92.74% Parameter a decreases and b increases over time, indicating weaker decrease of average clause length with longer sentences in later PRC periods. T-tests show significant differences in a between PRC period 1 vs periods 2–3 (means 11.092 vs 8.559; p ≪ 0.05) and in b between period 3 and periods 1–2. The first-period PRC speeches are closer to HK; later PRC periods diverge.
- Additional per-year fits: A poor fit in PRC 1985 (R² ≈ 53%) likely due to few sentences (n=272). Some third-period PRC speeches (2010, 2012, 2013) show lower per-text R² (≈62–68%), possibly reflecting increased colloquial style.
Clause–word level:
- Both regions and all periods generally conform to MA with good fits: HK: • 1984–1988: a = 2.022, b = −0.088, R² = 83.52% • 1997–2001: a = 2.122, b = −0.098, R² = 87.96% • 2010–2014: a = 2.128, b = −0.085, R² = 88.81% PRC: • 1984–1988: a = 2.303, b = −0.132, R² = 85.00% • 1997–2001: a = 2.291, b = −0.114, R² = 89.38% • 2009–2013: a = 2.521, b = −0.136, R² = 82.18%
- PRC has larger a (longer average words for same clause length) and more negative b (stronger decrease of average word length with longer clauses) than HK. An outlier in PRC 2009–2013 at 3-word clauses reflects frequent 1-character words in address/time expressions.
- Per-speech fits: Two weaker clause–word fits noted (PRC 2013 R² ≈ 63.75%; HK 1986 R² ≈ 64.62%).
Separation of varieties and periods:
- Sentence–clause parameters: HK shows little diachronic change; PRC shows significant change with later periods diverging from HK; earliest PRC period is closest to HK. Later PRC b values suggest weakening clause linkage, possibly due to colloquialization.
- Clause–word parameters: Clear regional separation; t-tests show significant group differences in a and b between PRC and HK. Hierarchical clustering largely separates regions, with a small mixed cluster (2 HK, 5 PRC speeches) indicating partial overlap.
- Overall, parameters a and b can differentiate both regional variation and diachronic change; clause–word level appears especially sensitive. Negative correlation between a and b is observed.
Findings address RQ1 by showing that the MA law models both regional variation and diachronic change: the sentence–clause relation captures strong diachronic shifts in PRC (a↓, b↑ over time) and minimal change in HK, while the clause–word relation robustly captures both regional differences and temporal trends. For RQ2, the fitted parameters (a, b) distinguish the two regional varieties and periods: earliest PRC aligns closer to HK; later PRC diverges, indicating compounding effects where rapid PRC diachronic change increases inter-variety distance. The rise of b toward zero in later PRC implies weaker clause linkage consistent with colloquialization. At the clause–word level, larger a and more negative b in PRC suggest longer words but stronger pressure toward shortening as clauses lengthen, possibly reflecting disyllabization and efficiency pressures. The convergence of evidence across levels supports viewing language as a self-organizing complex system wherein parameters encode equilibrium states and adaptations. Different constituency levels capture different aspects of systemic change, with clause–word relations providing greater sensitivity than sentence–clause relations.
The study is the first to model, on a comparable dataset, both diachronic changes and synchronic regional variation in Chinese political discourse using the MA law. Key contributions include: (1) demonstrating that MA parameters a and b characterize complex system equilibria capable of distinguishing PRC vs HK varieties and tracking PRC diachronic changes; (2) showing that proper treatment of HK headlines (as clauses, not sentences) restores MA compliance at the sentence–clause level; (3) identifying stronger diachronic change in PRC than HK, leading to increasing divergence over time; (4) evidencing that clause–word correlations are highly informative and often more sensitive than sentence–clause correlations. Future work could test whether shared diachronic tendencies generalize to other languages/varieties, further examine the observed negative linear relation between a and b, and refine intermediate-unit modeling between clause and word when appropriate.
Some fits are suboptimal: PRC 1985 sentence–clause (R² ≈ 53%) likely due to a small sample (272 sentences), and weaker per-speech fits in PRC 2010–2013 suggest colloquialization affecting the sentence–clause MA relation. Two clause–word fits are relatively low (PRC 2013, HK 1986; R² ≈ 63–65%). The study notes that in some languages an intermediate unit may exist between clause and word; if such a unit were relevant in Chinese, it could affect clause–word modeling. HK headline formatting required special handling (treating headlines as clauses) to satisfy MA expectations at the sentence–clause level.
Related Publications
Explore these studies to deepen your understanding of the subject.

