Linguistics and Languages
Detecting contact in language trees: a Bayesian phylogenetic model with horizontal transfer
N. Neureiter, P. Ranacher, et al.
The paper addresses a key limitation of standard Bayesian phylogenetic models for language evolution: the assumption that languages evolve independently along a tree without horizontal transfer. In reality, language contact leads to borrowing of lexical and structural features, violating tree-like evolution and potentially biasing inferences of topology, rates, and dates. While phylogenetic methods have successfully illuminated many aspects of cultural and linguistic evolution, a principled, probabilistic way to incorporate contact has been lacking. The authors propose contacTrees, a Bayesian model that augments language phylogenies with explicit horizontal transfer (contact) edges, allowing different words (or traits) to follow different trees induced by contact while sharing an underlying language tree. The purpose is to jointly infer the phylogeny, contact events, and borrowed traits, improving model fit and yielding historically interpretable contact histories.
Bayesian phylogenetic inference in linguistics often imports models from biology, such as CTMC substitution processes and coalescent or birth–death priors, with increasing attempts to tailor models to linguistic data (e.g., Dollo-like models for cognates, concerted evolution for sound changes, ordinal models for ordered traits). Yet, assumptions about strictly tree-like evolution remain problematic given language contact. Prior efforts to handle contact include data preprocessing (removing known loanwords), and diagnostics/visualizations for non-tree-like signal (Q residuals, δ scores, TIGER; NeighborNet, SplitsTree; Minimal Lateral Networks), as well as non-Bayesian network reconstructions or parsimony-based extensions (e.g., Willems et al., 2016; Nakhleh et al., 2005) and causal inference of lexical flow (Dellert, 2019). Bayesian approaches allowing borrowing exist for small trees or with borrowing integrated out (Kelly & Nicholls, 2017), limiting interpretability and scalability. In biology, analogous challenges are addressed by the multispecies coalescent and species networks, but network inference is computationally expensive. The ClonalOrigin model explicitly models horizontal transfer in bacteria, but its assumption of contiguous segment conversion is not suitable for unordered linguistic features. contacTrees adapts the ClonalOrigin concept to linguistics, enabling explicit inference of contact edges and borrowed traits within a Bayesian framework compatible with BEAST 2.
Model: contacTrees represents the language family by a rooted, time-calibrated binary tree T (languages at leaves, ancestral languages at internal nodes). A set of contact edges C augments the tree, where each contact edge c = (l1, l2, t) models a borrowing event from donor lineage l1 to receiver l2 at time t. For each edge c and each word (trait) w ∈ W, a binary indicator Z_cw ∈ {0,1} denotes whether w is borrowed along c. For each word w, the combination of T, C, and Z induces a word-specific tree Tw that may deviate from T when borrowing occurs. Likelihood: Assuming independence across words given their word trees, the data likelihood factorizes as P(X|T,C,Z,θX) = ∏_{w∈W} P(Xw | Tw, θX), where θX includes substitution and branch rate parameters. Priors: The joint prior over network structures is P(T,C|θT, Γ) = P(C|T,Γ) P(T|θT). The contact prior P(C|T,Γ) is a Poisson process along the branches of T: P(C|T,Γ) = exp(−Γ L) Γ^{|C|} / |C|!, where L is the total tree length and Γ is the expected number of contact edges. Borrowing indicators have independent Bernoulli priors with borrowing probability β: P(Z|C,β) = β^{|Z|} (1−β)^{|W||C| − |Z|}. Posterior: P(T, C, Z, β, Γ, θX, θT | X) ∝ P(X|T, C, Z, θX) P(Z|C, β) P(C|Γ, T) P(T|θT) P(β, Γ, θX, θT). Inference: The model is implemented in BEAST 2 and sampled by MCMC. Efficient inference requires specialized proposal operators: (1) tree operators (adapting standard subtree moves to maintain valid contact edges when branches move), (2) contact-edge operators (add/remove/retime/move donor/receiver without altering T), and (3) borrowing-configuration operators (Gibbs proposals for Z to improve mixing over naive random-walk updates). Additional details on operator design and implementation are provided in supplementary materials. Simulation study: 100 replicate datasets were simulated under a Yule tree with 25 languages, expected 6.0 contact edges, per-edge borrowing probability β = 0.25, 100 words each with 20 binary sites evolving on the word trees under a binary CTMC with log-normal relaxed branch rates (σ = 0.3). Node-height priors calibrated the clock via three internal-node age priors. Each dataset was analyzed under (i) contacTrees (CT) and (ii) a conventional no-contact model (noCT; Γ set to 0). Coverage of 95% credible intervals, bias, and topological error (RNNI distance) were assessed. Case study: 39 Indo-European languages from the Celtic, Germanic, and Romance (CGR) branches were analyzed using IELex (1419 cognates, 206 meaning classes), with curated corrections and additional Medieval Latin data. The tree prior was a birth–death skyline; substitution model was a binary covarion; an uncorrelated relaxed clock was calibrated using ancient languages and clade-age priors. Contact prior expected edges set to I = 0.25 and borrowing probability β = 0.1 (regularizing to prevent overfitting). Three scenarios were run: CT (full model), noCT (Γ = 0), and noCT-filtered (no contact; known loanwords removed by coding them as constant absence). For interpretability of contact edges, one analysis conditioned on a fixed topology (from Chang et al., 2015) while inferring node heights and contact.
Simulation study: Under contacTrees, 94/100 (root height) and 93/100 (clock rate) 95% credible intervals covered the true values, within the expected 91–99% range. Under noCT, coverage dropped to 84/100 (root height) and 89/100 (clock rate), with biases toward lower root heights and higher clock rates. Topology error decreased with CT (mean RNNI ≈ 1.64) compared to noCT (≈ 2.35). These results show that ignoring contact induces systematic errors in rates, dates, and to a lesser extent topology, even when loanwords are moderate and concentrated in events. Case study (CGR): The CT analysis inferred a substantially younger root height (mean ≈ 5384 years BP) and lower clock rate (2.84×10^-5 per year) than noCT (root ≈ 7206 BP; rate 5.97×10^-5) and noCT-filtered (root ≈ 7449.8 BP; rate 4.96×10^-5). Covarion parameters indicated more variable rates under noCT (e.g., α ≈ 0.006) than CT (α ≈ 0.044). A Bayes factor comparing models with versus without contact yielded BF ≈ 1056 in favor of CT, indicating decisive support and arguing against overfitting. contacTrees inferred 32 contact edges: 14 within Germanic, 15 within Romance, 1 within Celtic, and 2 cross-clade (Romance → Brittonic around 350 CE; Gallo-Romance/Old French → Middle English around ~640 YBP). ROC analysis comparing predicted loans to IELex annotations showed better-than-chance detection; for English, a threshold of 0.33 posterior probability achieved TPR ≈ 67.9% at FPR ≈ 27.2%; for Welsh_N and Breton_ST, examples at the same threshold yielded TPR ≈ 33.3%/29.4% with FPR ≈ 10.4%/10.0%. Known historical contacts (e.g., Viking Age Norse → English; Norman French → English; Latin/Romance → Brittonic) were recovered. Some Romance edges used Sardinian as donor, plausibly acting as a proxy for conservative/archaic forms.
The findings support the central hypothesis that explicitly modeling contact improves phylogenetic inference for languages. By allowing word-specific histories via contact edges, contacTrees reduces biases in rate and date estimates that arise when borrowings are forced to be explained by parallel innovations and losses under a strict tree. In simulations, CT restores calibration and reduces topological error; in the CGR case, CT yields a younger, better-supported root height consistent with external estimates and a substantially lower rate, enhancing overall fit (high BF). The inferred contact edges align with well-attested historical events and reveal plausible patterns of contact among closely related and geographically proximate languages, while also surfacing latent contact effects and parallel innovations. The approach integrates contact inference into a transparent Bayesian framework, bridging theoretical expectations of non-tree-like linguistic evolution with empirical phylogenetic practice and offering outputs (dated edges, loanword posterior probabilities) that are directly interpretable for historical linguistics and cultural evolution.
contacTrees introduces a Bayesian network model for linguistic phylogenies that jointly infers a language tree, dated horizontal contact edges, and borrowed traits. It improves calibration and mitigates biases in rates and dates seen in conventional models when data contain borrowings. In a CGR Indo-European case study, contacTrees decisively outperformed a no-contact model, recovered known contact events and loanwords, and produced a root age more consistent with independent estimates. The implementation in BEAST 2 and compatibility with standard priors, clocks, and substitution models make the approach broadly applicable. Future research directions include: extending the contact prior to incorporate geography and phylogenetic proximity; allowing feature-specific borrowing propensities (β) to test borrowability hypotheses; applying the model to full Indo-European and other less-documented families; leveraging longer word lists and alternative trait codings (etymon, phonemic, typological) to refine borrowing detection; and integrating inferred contact edges into downstream comparative analyses of grammatical and cultural traits.
Limitations stem primarily from data and modeling assumptions: (1) Form–meaning (Swadesh-style) coding captures only presence/absence of the most common form per meaning, conflating lexical replacement with semantic shifts and making synonyms problematic; this can induce homoplasy and misattribute parallel innovations or retentions as borrowing. (2) Loanword annotations in IELex are incomplete, complicating quantitative validation; ROC results are conservative given unlabelled loans and semantic loans. (3) Some inferred donors (e.g., Sardinian) may act as proxies for conservative forms due to coding choices, potentially obscuring true sources. (4) The case study fixed topology for clarity when interpreting contact, and general inference with both topology and contacts can be computationally demanding. (5) The current contact prior is uninformative with respect to geography or relatedness, allowing geographically implausible edges (e.g., Afrikaans as donor) that a geographically informed prior could discourage. (6) Different data types (etymon, phonemic, typological) may exhibit different contact effects and require tailored substitution models; results may vary accordingly.
Related Publications
Explore these studies to deepen your understanding of the subject.

