Linguistics and Languages
Dated phylogeny suggests early Neolithic origin of Sino-Tibetan languages
H. Zhang, T. Ji, et al.
Sino-Tibetan is the world’s second-largest language family (around 500 languages, ~1.4 billion speakers) spanning East and South Asia. Complex contact with neighboring families and migration histories have challenged traditional comparative linguistics in resolving its origins and internal relationships. Competing hypotheses include: (1) a Northern China (Yellow River) origin associated with early farming dispersals, placing Sinitic as the primary branch near the root; (2) an Eastern Himalayan origin with highly diverse foragers, predicting a rake-like topology and close Sinitic–Bodish affinity; and (3) a Sichuan origin with a deep north–south split. Bayesian phylogenetic methods applied to basic vocabulary cognates allow formal tests of these scenarios and are relatively robust to borrowing. Two recent Bayesian studies supported Sinitic near the root but differed in divergence timing and some internal relationships. This study reconstructs a dated phylogeny using a larger, more diverse lexical dataset (131 languages, 110 basic concepts) with historical calibrations to assess the outgroup, divergence times, and internal topology, and to evaluate how language diversification relates to Neolithic demographic and environmental changes in East Asia.
Traditional and recent proposals differ on the homeland and topology of Sino-Tibetan. The Northern China early farming dispersal hypothesis links the family’s origin to millet agriculture in the Yellow River basin and places Sinitic as the primary branch. Alternatives posit an Eastern Himalayan or Sichuan origin, predicting a rake-like topology, close Sinitic–Bodish relationships, and Kuki-Karbi as potential outgroup. Two independent Bayesian studies (2019) using different datasets agreed on a primary Sinitic branch but estimated distinct root ages (~5.9 vs ~7.2 kya) and diverged on subgroup relationships. These discrepancies motivated replication with broader sampling and independent data to test the robustness of key phylogenetic features and refine timing estimates.
Data: 131 Sino-Tibetan languages with cognate judgments for 110 basic vocabulary items (Swadesh 100 plus 10 added concepts). Source: Tower of Babel (Peiros & Starostin). Bai was excluded due to high borrowing; a duplicate Southern Chinese variety was removed. Loanwords flagged in the source were excluded. Multistate cognate classes per concept were converted to binary presence/absence across languages, yielding 1,726 binary cognate sets. Geography: Language coordinates from World Language Mapping System and Ethnologue; distributions plotted with Python (Plotly). Inference framework: Bayesian MCMC in BEAST2 on the binary cognate matrix. Models compared for cognate evolution: (i) an asymmetric 0↔1 continuous-time Markov model (m1p), (ii) m1p with gamma-distributed rate heterogeneity, and (iii) m1p with a binary covarion (CV) component. Exponential priors (mean=10) on transition rates. Tree prior: fossilized birth–death to accommodate time-structured data including extinct taxa and sampled ancestors; uniform prior on sampling proportion [0–1]. Clock models: strict and uncorrelated relaxed lognormal clocks (µ=1.0, σ=0.1), with the relaxed-clock preferred. Calibrations: Six calibrations from historical records: extinction (last-attested) times for Old Chinese, Padam, and Shaiyang; three internal node maxima set earlier than earliest written attestations of distinct descendant groups (e.g., MRCA constraints for Lolo-Burmese, Pumi, Naxi). Calibrations were based solely on empirical records, not on traditional linguistic theories. Model selection: Stepping-stone sampling (100 steps; 1,000,000 samples/step) to estimate log marginal likelihoods. The relaxed-clock m1p+CV model had the best fit (log Bayes factors in Table S2) and was used for final inferences. MCMC: At least five independent chains with 5,000,000 burn-in iterations followed by 50,000,000 sampling iterations per chain; sampling every 50,000 iterations. A posterior of n=900 trees (low autocorrelation, convergence diagnostics satisfied) was summarized as a Maximum Clade Credibility tree with TreeAnnotator v2.6.0. Outgroup tests: Initial analyses without monophyly constraints identified Sinitic as the most probable outgroup (posterior 80.13%), with the next candidate (Sinitic+Sal+Tani+Kiranti+Kho-Bwa) at 14.32%. Constraining Sinitic as outgroup improved fit (Bayes Factor 20.18). Subsequent analyses fixed Sinitic as the outgroup; no further monophyletic constraints were imposed.
- The dated phylogeny supports a primary split between Sinitic and Tibeto-Burman clades, replicating previous findings, with Sinitic as the outgroup (unconstrained posterior 80.13%; constrained model preferred, BF=20.18).
- Mean root age (initial divergence) estimated at 7,983 years BP (95% HPD: 4,778–11,285 BP), earlier than previous estimates (~5,871 and ~7,184 BP), and coincident with the onset of millet-based agriculture (ca. 8,100–7,700 BP) and regional climatic shifts to warm-wet conditions.
- Strong posterior support (>95%) for ten subgroups: Lolo-Burmese, Qiangic, Bodish, Naga, Kuki-Karbi, Karenic, Kho-Bwa, Sal, Tani, and Kiranti.
- Kiranti forms a distinct subgroup (posterior=0.92) and is unlikely to share an immediate ancestor with Magar, Kham, and Chepang (posterior=0.02).
- Little support for a Tani–Idu–Taraon subgroup (posterior=0.32).
- Lolo-Burmese and Qiangic are closely related (posterior=0.95); weak support for a Bodish + Lolo-Qiangic subclade (posterior=0.40).
- Sal languages (Bodo, Konyak, Jingpo) form a coherent subgroup, but there is little evidence that Sal is a separate primary branch from all other Tibeto-Burman.
- No support for Tibeto-Dulong, Tani-Idu, or Tibeto-Gralrongic as independent subgroups.
- No evidence for a Kuki-Karbi outgroup or a Sino-Bodish lower-level subgroup; topologies predicted by the Eastern Himalayan and Sichuan origin hypotheses are not supported.
- Chronological inferences align with archaeological and genetic evidence: early Bodish presence ca. 5,000 BP; Bodish diversification ca. 3,600 BP; proto-Tibeto-Burman demographic expansions ca. 4.2–7.5 kya.
- The earlier root age suggests initial divergence among growing populations of incipient agriculturalists rather than a single large-scale migration of specialized farmers.
The analyses address the origins and internal relationships of Sino-Tibetan by leveraging a larger, more diverse lexical dataset and historical calibrations within a Bayesian framework. The inferred early Neolithic root age (~8 kya) and primary Sinitic–Tibeto-Burman split support an origin in northern China consistent with, but temporally earlier than, prior estimates. The timing coincides with the onset of millet agriculture and a transition to warmer, wetter Holocene conditions, as well as archaeological signals of sedentism and social complexity, implying that language diversification likely began among expanding, increasingly sedentary communities before extensive demic dispersals. Topologically, the study rejects predictions from the Eastern Himalayan and Sichuan origin hypotheses (e.g., a Sino-Bodish subgroup or Kuki-Karbi as outgroup) and provides strong support for multiple well-defined subgroups, while recognizing limited resolution among some Tibeto-Burman clades—expected given broad sampling distant from Sinitic. The results suggest a mosaic evolutionary history with multiple waves of movement and varying degrees of demic and cultural diffusion, rather than a single major Neolithic migration. The Himalayan region’s rugged geography and social structures likely promoted parallel evolution and high ethnolinguistic diversity, aligning with the finding that Himalayan subgroups evolved largely independently despite proximity. Overall, the findings refine the temporal framework for Sino-Tibetan diversification, align linguistic phylogeny with archaeological and genetic evidence, and encourage a nuanced view of the early farming dispersal in East Asia where agricultural practices and demography matured over millennia before sustained expansions.
This study provides a replicated, time-calibrated phylogeny of the Sino-Tibetan family using an expanded language sample and empirically grounded historical calibrations. It corroborates an early bifurcation into Sinitic and Tibeto-Burman, supports ten major subgroups, and dates the family’s initial divergence to ~8,000 BP—earlier than previous estimates and contemporaneous with early millet agriculture and Holocene climatic shifts. These results advocate a nuanced early farming dispersal model in East Asia, emphasizing gradual demographic growth and mixed demic–cultural processes. Future research should: (1) refine calibrations and expand lexical and structural datasets, especially for underrepresented Himalayan and peripheral groups; (2) integrate linguistic, archaeological, and genomic data in joint models; and (3) apply the reconstructed phylogeny to comparative studies of cultural evolution (e.g., kinship, subsistence), particularly in the linguistically diverse Himalayan region.
- Branching order among several Tibeto-Burman clades shows low resolution, reflecting inherent uncertainty with deep-time reconstruction and broad taxon sampling distantly related to Sinitic.
- Calibrations rely on historical records of last attestation and earliest mentions, which can introduce uncertainty in node age constraints.
- Borrowing was filtered using source annotations, but undetected horizontal transmission may persist and affect signal.
- Exclusion of Bai and one redundant Sinitic variety may influence overall topology for certain regions.
- Model choice (binary covarion, relaxed clock, fossilized birth–death) captures key processes but remains an approximation; alternative modeling assumptions could shift some estimates.
- The interpretation of the root age as reflecting subgroup formation rather than migration events underscores the need for caution when linking linguistic divergence directly to archaeological dispersal episodes.
Related Publications
Explore these studies to deepen your understanding of the subject.

