logo
ResearchBunny Logo
Detecting directional forces in the evolution of grammar: A case study of the English perfect with intransitives across EEBO, COHA, and Google Books

Linguistics and Languages

Detecting directional forces in the evolution of grammar: A case study of the English perfect with intransitives across EEBO, COHA, and Google Books

S. Okuda, M. Hosaka, et al.

This research by Shimpei Okuda, Michio Hosaka, and Kazutoshi Sasahara delves into the evolutionary forces driving the English perfect tense's transition from 'be+PP' to 'have+PP' constructions in intransitive verbs. Using extensive corpora analysis and a neural network model, the study reveals the dominance of natural selection in this grammatical shift.

00:00
00:00
~3 min • Beginner • English
Introduction
The study investigates whether the historical shift in English perfect auxiliary selection—from earlier be+PP to modern have+PP—was driven by directional selection or by random drift. Situated within cultural evolution research, the work leverages advances in computational methods and big diachronic corpora to quantify grammatical change. Prior work showed aggregate increases in HAVE-perfect forms and decreases in BE-perfect forms, but the causal forces remained debated. This paper focuses specifically on intransitive verbs to avoid confounds with the passive, formulates the problem within an evolutionary framework, and asks if verb-level trajectories show signatures of selection rather than neutral drift.
Literature Review
Auxiliary selection in Indo-European languages usually assigns HAVE to transitives and BE to certain intransitives, with fine-grained patterns such as unaccusatives selecting BE and unergatives selecting HAVE (e.g., Sorace’s work; cross-linguistic evidence in French, German, Dutch). In English, earlier stages exhibited BE-perfect with many intransitives, but HAVE-perfect became dominant in Modern English, with prior corpus studies indicating that HAVE increased from Late Middle English while BE declined markedly in the 19th century. Earlier analyses used relatively small parsed corpora (YCOE, PPCME2, PPCEME, PPCMBE), finding that HAVE rose before BE definitively fell, and identifying linguistic (e.g., modality, aspectual meanings, telicity) and extralinguistic factors (chronology, text type) influencing the change. Methodological work on detecting evolutionary forces (Newberry et al., 2017) suggested that both selection and drift can shape language change, but also highlighted limitations of the Frequency Increment Test (FIT), such as sensitivity to binning and normality assumptions. A neural time series classifier (TSC) trained on Wright–Fisher simulations has been proposed to robustly distinguish selection from drift in cultural time series.
Methodology
Data: Three large-scale sources were combined to cover 1473–2009/2000: (1) EEBO (1473–1700; 7.55M words) for British-related texts; (2) COHA (1810–2009; 400M words) for American English; and (3) Google Books Ngram (1700–2000; 468B tokens) to bridge coverage gaps. Because Google Books provides only relative frequencies (no POS/matched tokens) and is orders of magnitude larger, results were scaled to be comparable with COHA using the overlapping period (1810–2000), then used to fill 1700–1810. Targets: To avoid conflating perfect be+PP with passives, only intransitive verbs (per LDOCE Online) were analyzed. Verb selection followed a frequency-based pipeline using the COCA frequency list: (1) extract verbs; (2) filter to intransitives (LDOCE); (3) retain those occurring >200 times in each of EEBO, COHA, and Google Books; (4) among those, keep verbs with be+PP share ≥0.5 in EEBO to ensure the transition is observable within the covered period. This yielded 13 verbs (Group A: arrive, bound, come, creep, degenerate, expire, fall, insist, look, rise, stay, tumble, vanish). Additionally, six intransitive verbs previously discussed in the literature but not meeting the frequency criteria were added as Group B (ascend, become, depart, descend, escape, go). Queries and normalisation: Search patterns targeted basic be/have+past participle strings while accounting for historical spelling variants compiled from EEBO evidence. EEBO and COHA yielded raw counts per year for be+PP and have+PP. Google Books returned yearly relative frequencies; these were scaled against COHA’s overlapping period to reconstruct comparable time series for 1700–1810. Binning: Following Newberry et al. (2017), time series were binned such that each bin had comparable data size, using a bin size proportional to log N (N = total counts), assigning the median year to each bin. This is necessary to approximate the Wright–Fisher population-size assumption. Detection of evolutionary forces: Two methods were considered under a Wright–Fisher null model. The primary inference used a neural time series classifier (TSC; nnfit library) trained on simulations to label trajectories as selection vs drift. For reference, the Frequency Increment Test (FIT) was also computed, but a post-hoc power analysis showed insufficient power (d < 0.8) for all verbs, so FIT results were not used for inference. Validation: Visual inspection confirmed alignment between Google Books and COHA (1810–2000) and general continuity at corpus boundaries; minor gaps around 1700–1750 were mitigated by binning, with inflection points around 1800, consistent with prior literature.
Key Findings
- For Group A (13 high-frequency intransitives), all verbs except bound exhibited a clear historical transition from be+PP dominance (pre-1600) to have+PP dominance, with a sharp rise in have+PP between roughly 1750–1800. - For Group B (6 literature-selected verbs), trajectories similarly showed a marked rise in have+PP. - Neural TSC classifications (probability of selection): arrive 1.00, come 1.00, creep 1.00, degenerate 1.00, expire 1.00, fall 1.00, insist 1.00, rise 1.00, stay 1.00, vanish 1.00, look 1.00, tumble 1.00, ascend 1.00, become 1.00, depart 1.00, escape 1.00, descend 0.78, bound 0.07, go 0.01. In total, 17 verbs were classified as selection with high confidence. - FIT was underpowered for all verbs (post-hoc power d < 0.8) and not used for conclusions. - Sensitivity analysis with a milder threshold (≥30 occurrences per dataset) yielded comparable results: 33/36 verbs classified as selection; exceptions (meddle, bound, go) are explainable by passive confounds or PP adjective readings, which suppress the rise in have+PP counts. - Overall, evidence indicates selection, not random drift, predominantly drove the be→have transition in perfect auxiliaries with intransitives.
Discussion
The results address the central question by showing that most verb-specific trajectories of have+PP frequency exhibit selection-like patterns, with the neural TSC classifying the majority as selection. This suggests that the be→have auxiliary shift in English perfect formation largely reflects directional forces rather than neutral drift. The timing of the frequency increase (circa late 18th to early 19th century) aligns with previous aggregated findings, but here the verb-level dynamics, speeds, and S-shaped transitions become visible. Exceptions (e.g., bound, go) likely reflect confounds such as be+PP passives and PP-as-adjective usages, which artificially dampen the have+PP rise in counts. Compared with prior claims emphasizing drift (e.g., Newberry et al., 2017), this study provides countervailing evidence for selection in this specific grammatical domain and highlights the utility of large-scale, cross-corpus time series aligned through scaling and binned to respect model assumptions.
Conclusion
Using three large-scale corpora aligned across five centuries and a neural time series classifier, the study provides quantitative evidence that selection likely drove the historical replacement of be+PP by have+PP in English perfects with intransitives. It corroborates prior insights on timing while adding verb-level evolutionary trajectories and classifications. Future research should (i) probe the linguistic mechanisms behind HAVE’s rise (e.g., aspectual and modal developments, functional differentiation from BE to avoid passive ambiguity), (ii) account for genre and regional (British vs American) dynamics more explicitly, and (iii) extend the methodology cross-linguistically to test the generality of selection vs drift in perfect auxiliary evolution and related grammatical changes.
Limitations
- Model assumptions: Neural TSC is trained on Wright–Fisher simulations; violations (e.g., varying effective population sizes, mixture of sources) may affect classification, as with other methods including FIT. - Corpus composition: British and American English are mixed across sources; different regional timelines may bias raw counts, though relative frequencies mitigate some effects. - Google Books biases: Lack of POS/matched contexts, increasing share of scientific texts over time, and scaling steps introduce uncertainties that may affect frequency estimates and genre-specific variation. - Passive and adjectival confounds: Some intransitives (e.g., meddle) appear in passive-like usages or PP-adjective constructions, complicating counts intended to isolate perfects. - Statistical power: FIT lacked sufficient power despite large datasets, limiting cross-method triangulation. - Thresholding and selection of verbs: Frequency thresholds and initial intransitive classification (LDOCE) constrain coverage and may exclude relevant verbs or include edge cases.
Listen, Learn & Level Up
Over 10,000 hours of research content in 25+ fields, available in 12+ languages.
No more digging through PDFs, just hit play and absorb the world's latest research in your language, on your time.
listen to research audio papers with researchbunny