Linguistics and Languages

Detecting contact in language trees: a Bayesian phylogenetic model with horizontal transfer

N. Neureiter, P. Ranacher, et al.

Discover how contacTrees, a groundbreaking Bayesian model developed by authors from the University of Zurich, integrates language contact and horizontal transfer to illuminate the complexities of language evolution. This innovative approach not only reconstructs phylogenetic trees but also reveals vital linguistic interactions, yielding enhanced insights into less-documented language families.

00:00

Playback language: English

Index

Introduction

Phylogenetic trees are essential tools in studying language evolution, offering insights into the broader field of cultural evolution. While Bayesian phylogenetic methods offer transparency in their assumptions and data, a significant limitation is their inability to account for language contact and the resulting horizontal transfer of linguistic traits (borrowing). This assumption of independent language change contradicts the reality of language evolution, potentially leading to inaccurate reconstructions and neglecting the rich history of linguistic interaction. This paper addresses this critical gap by presenting contacTrees, a Bayesian phylogenetic model incorporating horizontal transfer. ContacTrees aims to simultaneously reconstruct the phylogenetic tree of a language family and the contact events between its clades. The model builds upon existing Bayesian phylogenetic frameworks but relaxes the strict tree-like assumption by allowing for deviations in the evolution of individual linguistic traits due to borrowing events. This approach offers a more nuanced and realistic portrayal of language evolution, enhancing our understanding of the complex interplay between vertical inheritance and horizontal transfer in shaping language families. The importance of this development is particularly evident in less-documented language families where our knowledge of historical contacts is limited. The widespread use of linguistic phylogenies in cultural evolution studies necessitates this improvement to yield more robust interpretations.

Literature Review

Traditional methods in comparative linguistics qualitatively inferred language phylogenies through trait comparison, but recent computational methods, particularly Bayesian phylogenetic inference, have allowed for explicit modeling of evolutionary processes. These methods have been applied to diverse aspects of cultural evolution, such as the evolution of musical instruments, religious beliefs, and political complexity. However, the application of these methods has been based on the assumption of tree-like evolution, neglecting the significant role of language contact and borrowing. Existing methods for addressing horizontal transfer either involved manual exclusion of known loanwords, lacked integration with probabilistic phylogenetic frameworks, or were computationally impractical for large-scale analyses. While some attempts have been made to incorporate borrowing in phylogenetic models (e.g., Kelly and Nicholls, 2017), limitations in computational tractability or the inability to inspect individual borrowing events hindered broader application. The ClonalOrigin model (Didelot et al., 2010), developed for horizontal gene transfer in biology, provided a framework, but required modification to be suitable for the nature of linguistic borrowing events. This paper builds on these prior efforts to develop a more robust and efficient model.

Methodology

The contacTrees model represents the ancestry of a language family using a rooted binary time tree (T), where leaves represent extant languages and internal nodes represent ancestral languages. This language tree is augmented by a set of horizontal contact edges (C), each representing a borrowing event from a donor language to a receiver language at a specific time (t). For each contact edge, binary parameters (Z) indicate which words were borrowed. The model assumes that each word follows its own word tree (Tw), which may deviate from the language tree (T) if the word was borrowed. The contacTrees likelihood is calculated as the product of standard tree likelihoods for each word, accounting for its individual evolutionary history (word tree). The model incorporates a network prior, combining a tree prior (P(T|θT)) with a contact prior (P(C|T, Γ)), defined by a Poisson process along the tree branches. The contact prior allows for variation in the number of contact events based on the total tree length (L). The prior distribution of borrowing events is modeled using a binomial distribution, allowing for stochasticity in which words are borrowed along each contact edge. Inference is conducted using Markov Chain Monte Carlo (MCMC) methods within the BEAST 2 framework. The MCMC process employs standard and newly-developed operators that facilitate efficient sampling of the high-dimensional parameter space, including tree operators, contact edge operators, and borrowing operators. A new Gibbs operator improves the sampling efficiency of the borrowing parameters. The methodology was tested in a simulation study and applied in a case study on a subset of Indo-European languages.

Key Findings

The simulation study demonstrated that contacTrees accurately reconstructs simulated language family histories, including contact events. It also showed that ignoring contact in phylogenetic reconstructions introduces systematic biases in tree height and clock rate estimations, resulting in underestimation of tree height and overestimation of clock rates. The case study analyzing Celtic, Germanic, and Romance languages revealed that contacTrees provides a statistically superior fit to the data compared to conventional phylogenetic models (Bayes Factor of 1056.24), resulting in a significantly younger reconstructed tree and a lower clock rate. The contacTrees model identified contact edges that correspond to well-known historical language contact events, such as the Norman influence on English and the influence of Latin on Brittonic languages. Additionally, the model accurately identified many known loanwords. However, the model also detected contact events and potential borrowings between closely related languages, highlighting the complexity of language contact and subtle borrowing events that might not be easily identifiable through traditional methods. The ROC curve analysis showed that the model achieves a significantly high true-positive rate (TPR) and acceptably low false-positive rate (FPR) in identifying known loanwords. The results confirm the importance of explicitly modeling language contact for accurate phylogenetic reconstruction.

Discussion

The results strongly support the integration of horizontal transfer into phylogenetic models for language evolution. Ignoring contact introduces systematic biases, particularly in the estimation of tree height and clock rate. The contacTrees model addresses this limitation by providing a statistically robust and biologically plausible framework for incorporating borrowing events, leading to more accurate and reliable reconstructions. The ability of contacTrees to identify both major historical contact events and more subtle borrowing patterns highlights its potential to reveal previously obscured aspects of language evolution. While the study focused on Indo-European languages, the model's general framework makes it applicable to a wide range of language families, especially those with limited documentation of historical contact. The improved accuracy of tree height estimation has implications for our understanding of language family diversification and spread. The identification of contact edges offers valuable insights into the dynamics of language interaction and cultural exchange. Future research may benefit from expanding the data types beyond basic vocabulary lists to include additional features like etymon traits, phonemic transcriptions, or typological traits.

Conclusion

The contacTrees model provides a significant advancement in the field of language phylogenetics by integrating language contact and horizontal transfer within a Bayesian framework. Simulation and case studies demonstrate the model's ability to provide more accurate reconstructions, refine estimates of tree height and clock rate, and identify known and potentially novel instances of language contact. The model's generalizability and integration with existing BEAST 2 packages facilitate future applications to a wider array of language families and data types. Future research might explore improved priors that incorporate geographic proximity and phylogenetic distance to further enhance the accuracy of contact event detection, as well as incorporating variable borrowing probabilities across different linguistic features. The contacTrees model closes a crucial gap between theoretical and empirical models of cultural evolution, setting the stage for a more nuanced and comprehensive understanding of the complex processes shaping the world's languages.

Limitations

The accuracy of contacTrees' results depends on the quality and completeness of the input data. The use of form-meaning traits, while convenient, may limit the detection of certain types of borrowing, particularly those not reflected in basic vocabulary. The model's reliance on pre-defined meaning classes can affect the interpretation of semantic shifts and parallel innovations, as these can be mistaken for borrowing events. The availability and accuracy of historical information about language contact can also influence the results. Furthermore, the computational demands of the model increase with the size of the dataset, potentially limiting its application to very large language families. Despite these limitations, contacTrees offers substantial improvement over previous approaches, and future refinements can address these remaining issues.

Related Publications

Explore these studies to deepen your understanding of the subject.

Veterinary Science

Dietary prebiotics promote intestinal *Prevotella* in association with a low-responding phenotype in a murine oxazolone-induced model of atopic dermatitis

A. Laigaard, L. Krych, et al.

Medicine and Health

A critical role of brain network architecture in a continuum model of autism spectrum disorders spanning from healthy individuals with genetic liability to individuals with ASD

B. Khundrakpam, N. Bhutani, et al.

Psychology

Crisis of objectivity: using a personalized network model to understand maladaptive sensemaking in a patient with psychotic, affective, and obsessive-compulsive symptoms

A. Oblak, M. Kuclar, et al.

Business

A humanistic model of corporate social responsibility in e-commerce with high-tech support in the artificial intelligence economy

E. B. Zavyalova, V. A. Volokhina, et al.

Listen, Learn & Level Up

Over 10,000 hours of research content in 25+ fields, available in 12+ languages.

No more digging through PDFs, just hit play and absorb the world's latest research in your language, on your time.

listen to research audio papers with researchbunny