logo
ResearchBunny Logo
Inferring language dispersal patterns with velocity field estimation

Linguistics and Languages

Inferring language dispersal patterns with velocity field estimation

S. Yang, X. Sun, et al.

Discover how Sizhe Yang, Xiaoru Sun, Li Jin, and Menghan Zhang have unveiled a groundbreaking method called language velocity field estimation (LVF) to trace the paths of language spread across time and space. Their research reveals fascinating connections between language evolution and ancient agricultural practices spanning 10,000 years.... show more
Introduction

The study addresses how to reconstruct spatial evolution and dispersal patterns of languages in a way that accommodates both vertical divergence and horizontal contact (e.g., borrowing, areal diffusion). Over the last 10,000 years, demic diffusions and cultural spreads accompanied agricultural intensification and language dispersals. While genetic and archaeological advances have illuminated demic and cultural dynamics, aligning linguistic evidence—particularly spatial dynamics—with these fields remains challenging. Phylogeographic approaches project phylogenetic trees into geographic space, but family-tree models capture only vertical divergence and often fail in contexts with significant contact. The research introduces a phylogeny-free method, Language Velocity Field estimation (LVF), to infer dispersal trajectories and centres directly from linguistic trait dynamics coupled with geography. The study validates LVF on simulated benchmarks and applies it to four agricultural language families/groups (Indo-European, Sino-Tibetan, Bantu, Arawak) to test alignment with known demic and cultural diffusions, and compares LVF to phylogeographic and other baseline methods.

Literature Review

Phylogeographic approaches commonly used in linguistic spatial reconstructions involve: (1) building a language phylogenetic tree from linguistic traits to represent vertical divergence and (2) projecting this tree into geographic space, leveraging correlations between linguistic relatedness and geographic proximity. However, family-tree models neglect horizontal contact such as borrowing and areal diffusion, a pervasive phenomenon across lexicon, grammar, and phonology, especially in multilingual areas. Recent advances in velocity field estimation in other domains (atmospheric circulation, RNA velocity, human mobility, cultural diffusion) offer a framework for modelling continuous trajectories without strict tree constraints. Linguistic phylogenetic studies have provided temporal evidence for language evolution and examined hypotheses like farming/language dispersal, but spatial evolution remains difficult, partly because interdisciplinary evidence is fragmented and alignment is hard. The paper situates LVF as an alternative that can incorporate both divergence and contact and thus complements or extends phylogeographic methods when tree-likeness is weak.

Methodology

Data and preprocessing: Public lexical datasets for four language families/groups were used: Indo-European (103 languages, 5,995 binary-coded cognate traits), Sino-Tibetan (109, 949 traits), Bantu (420, 3,859 traits), and Arawak (60, 693 traits). Each language has geographic coordinates. Traits with >75% missing values were removed; remaining missing values were imputed via mode-value imputation (robustness assessed via cosine similarity and Procrustes analysis). Binary-coded traits were converted to continuous state frequencies in [0,1] using k-nearest neighbours (k=10) over geographically nearest languages, computing the local frequency of state 1 per trait.

LVF step 1: estimating trait-dynamics velocity in linguistic (PC) space. Linguistic relatedness among languages was represented by Euclidean distances in a two-dimensional PCA space derived from binary traits (PC1, PC2). A dynamic model, inspired by the Abrams–Strogatz framework and akin to a covarion model, was specified to reconstruct past states of each trait for each language. It assumes (i) traits undergo multiple state transitions with heterogeneous rates, (ii) neighbouring languages influence trait variation (contact/competition), and (iii) states have sociolinguistic prestige (inheritance probability). Prestige parameters were estimated under a Poisson-process substitution analogy (λ=1 by default; state equilibrium π set to observed global trait-state frequencies). Using the analytical solution, past state frequencies x_i^l(-m) were reconstructed for each trait i in language l at reconstruction time m (default m=1). High-dimensional velocity vectors per language were computed as the difference between present and reconstructed past state frequencies divided by m. These velocities were projected into the two-dimensional PC space by mapping present and past states through the PCA loadings, yielding a velocity field in PC space that visualises diachronic trajectories shaping observed relatedness.

LVF step 2: projecting to geographic space. A kernel projection mapped PC-space velocities to geographic vectors for each language. For each language l, its geographic velocity V_Geo^l is a weighted sum over nearby languages in PC space, with weights P_lj reflecting correlations between V_PC^l and PC differences (PC_l − PC_j), scaled by geographic proximity. The resulting geographic velocity field indicates directions from which languages diffused to their current locations. Spatial smoothing adjusted vector lengths (not directions) to match PC-space magnitudes and incorporate local kernel-weighted averaging. Grid smoothing interpolated vectors onto regular geographic grids for uniform visualisation and to cover areas with sparse sampling.

Dispersal centre inference: Based on the grid-smoothed field, candidate centres were identified where surrounding vectors exhibit a radially outward pattern. For each grid point, vectors of nearby grid points were normalised and the average variance across dimensions was computed; the grid with maximal average variance was taken as the dispersal centre. Uncertainty was assessed via jackknife resampling over languages.

Validation and comparisons: Simulated validation used 1,000 datasets (20 languages, 306 traits) from Wichmann & Rama (2021), each with known dispersal centre and random-walk geographic diffusion on a given phylogeny. Effectiveness was assessed by testing differences between inferred and true centres in longitude/latitude (Wilcoxon rank-sum). Robustness to parameters (k=2–18, λ=0.1–10, m=1–9) was evaluated via cosine similarity of velocity fields in high- and PC-space. Empirical robustness tested the same parameter ranges on the four families.

Comparative analyses: LVF was compared to a phylogeographic model (BayesTraits geographical model, PhyloG) and to three phylogeny-free baselines: Diversity (maximum local entropy over k-neighbourhood frequencies), Centroid (geometric centroid of language polygon), and Minimal Distance (language with minimal average geographic distance to others). Three relatedness matrices—overall (Manhattan distance on binary traits), PCA-based (Euclidean distance in PC space), and tree-based (pairwise phylogenetic distance)—were correlated via Mantel tests. Tree-likeness was assessed by delta scores.

Key Findings

Simulations: Across 1,000 synthetic datasets, LVF inferred dispersal centres not significantly different from the true centres in longitude and latitude under default and varied parameters, demonstrating effectiveness. Velocity fields were robust to parameter changes (k, λ, m) with no significant differences by cosine similarity in both high-dimensional and PC spaces.

Empirical dispersal trajectories: LVF reconstructed trajectories for four agricultural language families/groups consistent with prior linguistic, genetic, and archaeological evidence. Indo-European expanded west into Europe and east into the Indian subcontinent; Sino-Tibetan expanded west to the Tibetan Plateau, south to mainland Southeast Asia, and east to coastal China; Bantu expanded east and south across sub-Saharan Africa; Arawak spread from the Amazon basin toward the Caribbean and across lowlands.

Inferred dispersal centres (adjacent to agricultural/Neolithic homelands):

  • Indo-European: Fertile Crescent (supports Anatolian origin over Pontic steppe hypothesis).
  • Sino-Tibetan: Gansu, China (within ranges of Yangshao and/or Majiayao Neolithic cultures; supports Northern/Yellow River origin).
  • Bantu: Southern Cameroon (near eastern Nigeria–western Cameroon agricultural homeland).
  • Arawak: Northern lowlands of Bolivia, upper Madeira River basin (a key homeland of ancient lowland South American agriculture).

Tree-likeness and method performance: Delta scores indicated significant deviation from tree topology for all four empirical datasets (Indo-European: 0.2656, p<0.001; Sino-Tibetan: 0.3324, p<0.001; Bantu: 0.3598, p<0.001; Arawak: 0.4129, p<0.001; baseline simulated one-sided 95% CI [0.1553, 0.1727]). When comparing dispersal centres, LVF and PhyloG showed similar performance in simulations (no significant lon/lat differences; mean absolute differences: Lon 1.55°, 95% CI [5×10⁻³, 3.55]; Lat 0.94°, 95% CI [4×10⁻³, 2.82]). In empirical data, significant differences were found for Sino-Tibetan (Lon diff 6.52°, p=0.020; Lat diff 5.12°, p=0.012) and Arawak (Lat diff 6.58°, p=0.007; Lon diff 3.19°, p=0.058) but not for Indo-European (Lon 1.61°, p=0.158; Lat 2.46°, p=0.069) or Bantu (Lon 0.76°, p=0.443; Lat 0.14°, p=0.878). Differences increased with higher delta scores, indicating greater divergence between LVF and phylogeographic results under stronger horizontal contact.

Explanatory power of relatedness representations (Mantel tests):

  • Simulated: PCA-based R²=0.90 (p=0.001) and tree-based R²=0.93 (p=0.001) both strongly correlated with overall relatedness.
  • Indo-European: PCA-based R²=0.37 (p=0.001); tree-based R²=0.39 (p=0.001).
  • Bantu: PCA-based R²=0.65 (p=0.001); tree-based R²=0.38 (p=0.001).
  • Sino-Tibetan: PCA-based R²=0.44 (p=0.001); tree-based R²=0.05 (p=0.160, ns).
  • Arawak: PCA-based R²=0.53 (p=0.001); tree-based R²=0.09 (p=0.057, ns). These results support that LVF’s PCA-based relatedness better captures empirical relatedness where horizontal contact is strong, aligning more closely with genetic/archaeological dispersal evidence than phylogeography in those cases. LVF centres significantly differed from Diversity, Centroid, and Minimal Distance baselines, highlighting distinct theoretical underpinnings.
Discussion

The LVF framework addresses a central challenge in linguistic phylogeography: representing linguistic relatedness when horizontal contact substantially shapes languages. By modelling diachronic trait change and projecting a velocity field into geography via correlations between linguistic and spatial structure, LVF infers dispersal trajectories and centres without relying on a phylogenetic tree. The empirical reconstructions of Indo-European, Sino-Tibetan, Bantu, and Arawak trajectories and centres largely align with demic and cultural diffusion routes evidenced by ancient DNA and archaeology, indicating that agricultural language spreads tracked Holocene population movements and Neolithic cultural expansions.

Comparisons clarify method choice: where family-tree models aptly capture relatedness, LVF and phylogeographic approaches converge; where tree-likeness is weak (high delta scores), LVF more reliably reconstructs dispersal patterns, better supported by external evidence. The Mantel analyses further show that PCA-based distances can explain empirical relatedness where tree representations falter, particularly for Sino-Tibetan and Arawak. Methodologically, LVF’s phylogeny-free nature and accommodation of contact make it applicable beyond lexicon to structural features (grammar, phonology) that exhibit complex contact-driven evolution.

Overall, LVF complements phylogeography by relaxing tree-topology assumptions, enabling spatial inference under contact-rich histories, and offering a unifying framework to interface linguistic dynamics with genetic and archaeological records.

Conclusion

This study introduces Language Velocity Field estimation (LVF), a phylogeny-free approach to infer language dispersal trajectories and centres from linguistic trait dynamics and geography. Extensive simulations demonstrate effectiveness and robustness across parameters. Applied to Indo-European, Sino-Tibetan, Bantu, and Arawak, LVF reconstructs dispersals congruent with known Holocene demic and cultural spreads and locates centres proximate to agricultural/Neolithic homelands. Comparative analyses show LVF matches phylogeography when tree-likeness is high but outperforms it where horizontal contact undermines tree representations.

Future directions include: estimating dispersal timing alongside spatial paths; correcting biases due to uneven sampling of language locations; extending to diverse linguistic traits (lexicon, grammar, phonology) and to broader interdisciplinary applications linking linguistic, genetic, and archaeological dynamics.

Limitations

LVF depends on correlations between linguistic relatedness (via PCA) and geography and may be less precise than phylogeographic tree-based models when linguistic relatedness is well-represented by a family tree. Current implementations focus on spatial patterns without directly estimating temporal parameters (e.g., dispersal timing). Inference may be affected by uneven geographic sampling density and potential biases in k-NN frequency estimation, though smoothing and jackknife procedures mitigate some issues. Parameter choices (k, λ, m) were tested for robustness, but extreme settings or markedly different datasets may require tuning. LVF identifies centres based on outward radiative vector patterns, which could be sensitive to complex migration histories or barriers not captured in the model.

Listen, Learn & Level Up
Over 10,000 hours of research content in 25+ fields, available in 12+ languages.
No more digging through PDFs, just hit play and absorb the world's latest research in your language, on your time.
listen to research audio papers with researchbunny