Linguistics and Languages
Synthesizing theories of human language with Bayesian program induction
K. Ellis, A. Albright, et al.
A key aspect of human intelligence is our ability to build theories about the world. This faculty is most clearly manifested in the historical development of science but also occurs in miniature in everyday cognition and during childhood development. The similarities between the process of developing scientific theories and the way that children construct an understanding of the world around them have led to the child-as-scientist metaphor in developmental psychology, which views conceptual changes during development as a form of scientific theory discovery. Thus, a key goal for both artificial intelligence and computational cognitive science is to develop methods to understand—and perhaps even automate—the process of theory discovery.
In this paper, we study the problem of AI-driven theory discovery, using human language as a testbed. We primarily focus on the linguist’s construction of language-specific theories, and the linguist’s synthesis of abstract cross-language meta-theories, but we also propose connections to child language acquisition. The cognitive sciences of language have long drawn an explicit analogy between the working scientist constructing grammars of particular languages and the child learning their languages. Language-specific grammar must be formulated within a common theoretical framework, sometimes called universal grammar. For the linguist, this is the target of empirical inquiry, for the child, this includes those linguistic resources that they bring to the table for language acquisition.
Natural language is an ideal domain to study theory discovery for several reasons. First, on a practical level, decades of work in linguistics, psycholinguistics, and other cognitive sciences of language provide diverse raw material to develop and test models of automated theory discovery. There exist corpora, data sets, and grammars from a large variety of typologically distinct languages, giving a rich and varied testbed for benchmarking theory induction algorithms. Second, children easily acquire language from quantities of data that are modest by the standards of modern artificial intelligence. Similarly, working field linguists also develop grammars based on very small amounts of elicited data. These facts suggest that the child-as-linguist analogy is a productive one and that inducing theories of language is tractable from sparse data with the right inductive biases. Third, theories of language representation and learning are formulated in computational terms, exposing a suite of formalisms ready to be deployed by AI researchers. These three features of human language—the availability of a large number of highly diverse empirical targets, the interfaces with cognitive development, and the computational formalisms within linguistics—conspire to single out language as an especially suitable target for research in automated theory induction.
Ultimately, the goal of the language sciences is to understand the general representations, processes, and mechanisms that allow people to learn and use language, not merely to catalog and describe particular languages. To capture this framework-level aspect of the problem of theory induction, we adopt the paradigm of Bayesian Program Learning (BPL). A BPL model of an inductive inference problem, such as theory and grammar induction, works by inferring a generative procedure represented as a symbolic program. Conditioned on the output of that program, the model uses Bayes’ rule to work backward from data (program outputs) to the procedure that generated it (a program). We embed classic linguistic formalisms within a programming language provided to a BPL learner. Only with this inductive bias can a BPL model then learn programs capturing a wide diversity of natural language phenomena. By systematically varying this inductive bias, we can study elements of the induction problem that span multiple languages. By doing hierarchical Bayesian inference on the programming language itself, we can also automatically discover some of these universal trends. But BPL comes at a steep computational cost, and so we develop new BPL algorithms which combine techniques from program synthesis with intuitions drawn from how scientists build theories and how children learn languages.
We focus on theories of natural language morpho-phonology—the domain of language governing the interaction of word formation and sound structure. Discovering a language’s morphophonology means inferring its stems, prefixes, and suffixes (its morphemes), and also the phonological rules that predict how concatenations of these morphemes are actually pronounced. Thus acquiring the morpho-phonology of a language involves solving a basic problem confronting both linguists and children: to build theories of the relationships between form and meaning given a collection of utterances, together with aspects of their meanings.
We evaluate our BPL approach on 70 data sets spanning the morphophonology of 58 languages. These data sets come from phonology textbooks: they have high linguistic diversity, but are much simpler than full language learning, with tens to hundreds of words at most, and typically isolate just a handful of grammatical phenomena. We will then shift our focus from linguists to children, and show that the same approach for finding grammatical structure in natural language also captures classic findings in the infant artificial grammar learning literature. Finally, by performing hierarchical Bayesian inference across these linguistic data sets, we show that the model can distill universal cross-language patterns, and express those patterns in a compact, human understandable form. Collectively, these findings point the way toward more human-like AI systems for learning theories, and for systems that learn to learn those theories more effectively over time by refining their inductive biases.
Modeling framework: The system explains a set X of form-meaning pairs (f, m) by inferring a language-specific theory T (phonological rules and morphological composition) and a lexicon L (stems, prefixes, suffixes). Initial inference uses MAP estimation to maximize P(T, L | UG) ∏ I[f = Phonology(Morphology(m))], where UG encodes higher-level knowledge. Morphology is concatenative, mapping meanings to prefix+stem+suffix via L and then passing through K ordered phonological rewrite rules r1..rK. The prior P(T, L | UG) favors compact lexica and fewer, simpler rules via a description-length prior.
Representation of sounds and rules: Phonemes are binary feature vectors (e.g., nasal, sonorant, voice). Phonological rules are SPE-style context-dependent rewrites of the form (focus → structural change / left_trigger right_trigger) with natural-class specifications and optional Kleene star for unbounded local contexts. Rules are ordered and non-cyclic, corresponding with 2-way rational functions/finite-state transducers, which is argued to be sufficient for morphophonology.
Bayesian Program Learning: The theory T is encoded as a program drawn from a domain-specific language embodying Universal Grammar constraints. The model embeds classic linguistic formalisms in a program space, enabling symbolic, human-interpretable grammars.
Inference algorithm: Direct search over all programs is intractable, so the approach uses constraint-based program synthesis with SAT solving (Sketch). To scale beyond small grammars, they implement an incremental, counterexample-guided synthesis procedure: start with an empty/small theory, identify counterexamples (data not yet explained), and use the solver to search for small edits to T that explain newly batched counterexamples while remaining close (by edit distance) to the previous theory. This repeats, growing the theory progressively. This heuristic loses completeness guarantees but yields tractable subproblems and exploits parallelism.
Program synthesis details (Methods): They maximize an objective over rules and lexicon consistent with data subject to a rule-count bound, using Sketch 1.7.5 from Python 2.7. Incremental solving constructs sequences (X_t, T_t) with X_{t+1} = X_t ∪ counterexamples and T_{t+1} found by maximizing fit under a small edit distance from T_t. Counterexamples are minibatched by lexeme and ordered by occurrence.
Probabilistic framing for evaluation and few-shot: For few-shot and held-out stems, they define a likelihood P^lik(X | T, L) = ∑_{L'} P(X | T, L ∪ L') P(L' | UG) over stems L' unseen in L, lower-bounded by taking the most likely underlying stem per row. They relate this to the MAP objective by partitioning lexicon into affixes and stems. For artificial grammar learning, they treat the training set as a single-column paradigm (one inflection) with different stems and compute P(X_test | X_train) by marginalizing over Pareto-optimal grammars (as found by Sketch’s multiobjective optimization), approximating P^lik with the shortest compatible stem under (T, L).
Meta-theory (hierarchical learning): To discover cross-language biases, they learn a fragment grammar M (a PCFG over phonological rule syntax) that induces reusable rule schemas. They optimize a variational-particles lower bound: maximize log P(M) + Σ_d log Σ_{T_d, L_d} P(X^d | T_d, L_d) P(T_d, L_d | M), alternating: (i) find top-k (k=100) high-probability grammars per dataset under current M via synthesis, and (ii) update M by proposing fragments via anti-unification over rule trees and re-estimating PCFG parameters with Inside-Outside (Dirichlet prior pseudocount 1). P(T, L | M) assumes independent rule generation by M and L independent of M.
Allophony problems: When given an allophone substitution mapping, the system must infer rules that map between underlying and surface allophones; underlying forms are permitted in the lexicon while alternate allophones must be derived by rules. They solve both directional hypotheses and select via the posterior.
Implementation: Python 2.7 orchestration, Sketch SAT-based synthesis; multi-core parallelization for incremental counterexample-guided search.
Coverage across typologically diverse datasets: Applied to 70 textbook-style morpho-phonology problems spanning 58 languages (including tonal languages, vowel harmony, assimilation, epenthesis), the system synthesized human-interpretable grammars and lexica. Lexicon recovery: It exactly matched the full gold lexicon in 60% of benchmarks and correctly explained the majority of the lexicon in 79% of problems. Manual grading on 15 problems showed rule precision and recall correlate positively with lexicon accuracy; when the lexicon is fully correct, the system rarely introduces spurious rules and typically recovers all true rules.
Efficiency and convergence: With 40 cores and a 24-hour timeout, the full model typically converged within half a day; average solution time was 3.6 hours. The incremental, counterexample-guided approach (full) outperformed building a large theory in one shot (CEGIS batch), particularly on harder, non-allophony tasks.
Importance of representations: Using richer articulatory features outperformed simpler acoustic features; removing key representational capacities (e.g., Kleene star, cross-phoneme feature generalization) drastically reduced solvable coverage. Compared to SyPhon (2019), which runs faster for interactive use but assumes independence between lexicon and rules and disallows non-local rules, this system solved substantially more of the dataset (SyPhon failed on 76% of the dataset under those constraints) at the cost of longer runtimes.
Qualitative correctness on complex languages: The model captured interacting nonlocal tone rules in Kerewe and multiple vowel harmony processes in Sakha, among other phenomena (also shown for Yowlumne, Serbo-Croatian, Turkish, Hungarian, Lumasaba). Some partial failures illustrate limits where approximately 20% of data remained unexplained.
Few-shot artificial grammar learning (AGL): The model discriminated between grammars such as ABB, ABA, AAx, AxA, and Pig Latin with few examples, mirroring infant AGL findings. Syllabic representations improved few-shot generalization, though phoneme-level copying rules could suffice with more examples. Pareto front analyses showed how additional examples sharpen preferences toward the correct grammar (appearance of a “kink” near the true rule set). For a Polish problem, Pareto fronts surfaced an unintended pattern (limited vowel inventory) as an alternative analysis despite the MAP being correct, demonstrating diagnostic value.
Meta-theory (cross-language biases): Joint hierarchical inference over 30 training problems produced a fragment-grammar meta-model encoding reusable rule schemas (e.g., vowel harmony, spirantization, nasal place assimilation). Applying this meta-model when re-solving the hardest problems increased the fraction of lexicon solved by an average of 31%. In Sakha, solutions improved from 4% to 100% with the meta-model, enabling synthesis of six interacting rules. The discovered 21 schemas are human-interpretable and align with typological tendencies identified by linguists.
The work demonstrates that automated, interpretable theory induction is feasible for core aspects of human morpho-phonology by integrating Bayesian program learning with constraint-based program synthesis and linguistically motivated representations. Like human linguists, optimal inference depends critically on higher-level biases and constraints; here, these are instantiated as a universal-grammar-inspired program space and a learned meta-theory that captures cross-language tendencies. The method bridges natural language analysis and infant artificial grammar learning: the same inductive machinery explains textbook phonological phenomena and few-shot AGL behavior, supporting the view that common cognitive resources underlie both tasks.
As a computational tool for phonology, the system allows mapping, scoring, and comparing analyses under different objectives and inductive biases across languages, enabling quantitative exploration of extensionally equivalent alternatives (e.g., deletion vs. epenthesis). The Pareto-front view reveals ambiguity in few-shot settings and offers linguistic "debugging" by exposing plausible competing grammars consistent with the data.
The approach underscores the promise of hybrid neuro-symbolic models to combine crisp, systematic productivity (symbolic rules over features) with graded generalizations (distributional tendencies), and suggests that scaling theory induction will benefit from jointly solving many related problems to synthesize over-hypotheses (meta-theories) that both capture common structure and accelerate future learning. The broader agenda aligns with automating aspects of scientific discovery by leveraging program induction, hierarchical Bayesian inference, and reusable conceptual fragments.
This paper introduces a framework that synthesizes human-interpretable morpho-phonological theories by combining Bayesian Program Learning with constraint-based program synthesis over linguistically structured hypothesis spaces. Across 70 datasets from 58 languages, the system recovers much of the correct lexicon and rule behavior, learns few-shot artificial grammars, and discovers a cross-language meta-theory that encodes typological rule schemas and measurably improves performance (average 31% gain on the hardest problems, with complete recovery in challenging cases like Sakha).
Main contributions include: (1) a scalable, incremental counterexample-guided synthesis algorithm for grammar induction; (2) a principled representation of phonological rules as SPE-style rewrite programs over feature structures within a BPL framework; (3) probabilistic evaluation via Pareto-optimal grammars enabling few-shot predictions; and (4) hierarchical meta-theory induction using fragment grammars to capture and reuse cross-language motifs.
Future directions: extend coverage beyond concatenative morphology to nonconcatenative processes (e.g., reduplication), relax simplifying assumptions about fusional affixes and known inflectional classes, integrate richer neural representations within the symbolic framework for graded generalization, improve scalability and interactive runtimes, and broaden application to other domains where child and scientist learning overlap (e.g., intuitive physics, cognitive models). Developing capabilities for proposing novel theoretical constructs and designing discriminative experiments would further advance automated scientific discovery.
Scope and representational limits: The approach assumes concatenative morphology and a fusional framing where inflectional combinations map to single affixes; nonconcatenative processes (e.g., reduplication) are not modeled. It presumes morphemes are categorized (prefix/suffix/stem) and uses ordered, non-cyclic SPE-style rewrite rules (finite-state expressivity), potentially missing phenomena requiring richer formalisms.
Data and supervision: Evaluations are on simplified, textbook-style datasets with tens to hundreds of words, often isolating specific phenomena and exposing inflectional classes in the meanings; this is easier than full language acquisition. Allophony problems rely on provided allophone mappings.
Computation and search: Exact SAT-based synthesis does not scale to large corpora; the incremental counterexample-guided heuristic sacrifices completeness and optimality guarantees. Runtimes (average 3.6 hours with 40 cores) are not interactive compared to faster but less expressive alternatives (e.g., SyPhon), and performance depends strongly on feature representations (simpler features degrade results).
Statistical approximations: Few-shot predictions marginalize over Pareto-optimal grammars rather than the full posterior; likelihoods use lower bounds that are tight when stems are unambiguous. Meta-theory learning assumes independent rule generation and hill-climbing over fragment grammars, which may find suboptimal priors.
Related Publications
Explore these studies to deepen your understanding of the subject.

