Linguistics and Languages

Non-Māori-speaking New Zealanders have a Māori proto-lexicon

Y. Oh, S. Todd, et al.

Discover how exposure to Māori throughout their lives equips non-Māori-speaking New Zealanders with an impressive proto-lexicon and phonotactic knowledge akin to fluent speakers. This fascinating research conducted by Y. Oh, S. Todd, C. Beckner, J. Hay, J. King, and J. Needle unveils the hidden language skills developed in everyday environments.... show more

Introduction

The study asks whether adults who are regularly exposed to a language they do not speak develop an implicit proto-lexicon—knowledge of word forms and sub-word units without meanings—and whether they can generalize over this knowledge to form phonotactic judgments. While active second-language learning emphasizes explicit vocabulary acquisition, infants acquire a proto-lexicon implicitly before meanings are established. New Zealand provides an ecologically valid context: most New Zealanders encounter Māori frequently in everyday life yet have small active Māori vocabularies. The authors test whether non-Māori-speaking New Zealanders (NMS) can (1) distinguish Māori words from Māori-like nonwords and (2) use statistical generalizations over their proto-lexicon to rate the well-formedness of novel Māori-like nonwords, potentially approaching the performance of fluent Māori speakers.

Literature Review

Prior work shows infants and adults can use statistical learning to segment speech and build lexical knowledge from distributional cues, with such learning being robust across the lifespan. Adults retain implicit knowledge of artificial languages and can relearn previously known words more easily after attrition. However, much of this literature relies on artificial languages with limited ecological validity. Māori presents a naturalistic case: widespread ambient exposure in New Zealand (songs, greetings, place names, loanwords) coexists with generally small active Māori vocabularies among NMS (about 70–80 words known semantically). Phonotactic well-formedness judgments in prior studies track lexical statistics, often modeled via n-gram probabilities computed from dictionaries, and generalization tends to be over types rather than tokens. Contemporary Māori production shows vowel length distinctions are not always stably produced, suggesting potential simplifications in modeling phonotactics.

Methodology

Experiment 1 (Word identification): An online rating task tested NMS ability to discern Māori words from Māori-like nonwords. Stimuli: 1,000 Māori words spanning five frequency bins (from very frequent to rare) and 1,000 closely matched Māori-like nonwords, with both sets covering a range of phonotactic probability scores and phoneme lengths (3–12 phonemes). Each word was paired with a nonword of similar phonotactic score. Each participant saw 300 randomized stimuli drawn as 30 word–nonword pairs from each frequency bin, without explicit pairing on presentation. Participants rated confidence that an item was a Māori word on a 1–5 scale. Participants: 85 adult NMS after exclusions, most reporting very basic Māori knowledge and low-level ongoing exposure. Analysis: ordinal mixed-effects regression models assessed effects of word/nonword status and phonotactic probability.

Experiment 2 (Well-formedness ratings): Online rating tasks on Māori-like nonwords. Stimuli: 1,760 nonwords (240–320 per length) spanning phoneme lengths 3–8. Task: rate how good each item would be as a real Māori word (1 = non Māori-like nonword, 5 = highly Māori-like nonword). Participants: three groups—NMS (N = 113; primarily New Zealand English L1, with 12 long-term residents born outside NZ), fluent Māori speakers (MS; N = 40), and non-Māori-speaking Americans (US; N = 94). Phonotactic modeling: computed trigram phoneme probabilities from various sources to predict ratings, comparing models built over dictionary word types versus experiential streams (tokens) and testing assumptions about vowel length (tracking vs not tracking). Ordinal mixed-effects models compared predictive performance; the best-fitting approach used dictionary type statistics without vowel length distinctions, while retaining sensitivity to the visual presence of macrons.

Monte Carlo simulations (word-based): To estimate proto-lexicon size sufficient to capture NMS ratings, dictionaries were subsampled to sizes from 1,000 to 18,000 types using three schemes: unweighted (uniform), frequency-weighted, and N-highest-frequency. For each subsample and participant group (NMS, MS), phonotactic probabilities were recomputed and separate ordinal regression models fitted; model fit was compared via AIC.

Morph-based analyses: A fluent Māori speaker segmented dictionary entries into 3,636 morphs (including simplex words and excluding many morphologically complex forms due to compounding). Phonotactic probabilities were computed over morph types, and Monte Carlo sampling (500–3,500 morphs; unweighted, frequency-weighted, N-highest-frequency) assessed how many morphs suffice to predict NMS and MS ratings. Two parsing assumptions were tested: stimuli phonotactically parsed into morphs vs left unparsed. Ordinal mixed-effects regression with AIC comparisons evaluated predictive adequacy.

Ethics: Approved by the Human Ethics Committee of the University of Canterbury; informed consent obtained; experiments conducted in accordance with guidelines.

Key Findings

Experiment 1: NMS gave significantly higher ratings to real Māori words than to closely matched Māori-like nonwords across all five frequency bins, including very low-frequency words. Ratings increased with phonotactic probability. Crucially, beyond phonotactic probability, there was a significant main effect of word/nonword status, indicating implicit knowledge of specific items.
Experiment 2: Well-formedness ratings increased with phonotactic probability in all groups. US participants showed only a small sensitivity. NMS and MS exhibited strong, overlapping sensitivities, indicating substantial Māori-specific phonotactic knowledge among NMS approaching that of MS.
Source of phonotactic knowledge: Models using dictionary word-type statistics best predicted ratings, especially when vowel length distinctions were not tracked in phonotactic computations (though participants were sensitive to macrons visually).
Proto-lexicon size (word-based): Monte Carlo simulations showed that a vocabulary of 3,000–6,000 word types, sampled to reflect experiential frequency (frequency-weighted or highest-frequency), predicted NMS and MS ratings as well as or better than the full 18,703-word dictionary; random unweighted samples were insufficient.
Borrowings insufficient: The subset of Māori words and place names borrowed into New Zealand English did not provide adequate phonotactic knowledge to explain NMS ratings, implying a larger Māori-specific proto-lexicon.
Morph-based model: Participants’ ratings were best predicted by phonotactic knowledge based on morph types. Approximately 1,500 of the most common morphs sufficed to adequately predict NMS ratings, outperforming the smallest adequate word-based sets. NMS ratings were better predicted when assuming they attempted to parse stimuli into morphs; MS ratings were better predicted without such parsing, reflecting grammatical parsing constraints.
Exposure effects: Self-reported Māori exposure showed a small, non-significant trend toward greater sensitivity to phonotactic probability, suggesting that prolonged incidental exposure alone is sufficient for acquiring a substantial proto-lexicon.

Discussion

The findings demonstrate that adults can implicitly acquire a substantial proto-lexicon from long-term ambient exposure to a language they do not speak. NMS distinguished real Māori words from well-matched nonwords and used sophisticated, native-like phonotactic knowledge to rate novel nonwords. Their ratings were best modeled by phonotactic statistics computed over word types from a large dictionary and, more effectively, over morph types, indicating generalization over sub-word units learned from exposure. Monte Carlo analyses suggest that the effective proto-lexicon need not encompass the entire dictionary: a frequency-biased subset of roughly 3,000–6,000 words or about 1,500 common morphs can provide sufficient statistical structure. The close match between NMS and MS in phonotactic sensitivity underscores the power of passive statistical learning, while differences in morphological parsing strategies imply deeper differences in grammatical knowledge and parsing constraints. Overall, the study addresses the research question by showing that ambient exposure can yield substantial form-based lexical knowledge in adults, supporting ecologically valid accounts of statistical learning beyond artificial language paradigms.

Conclusion

Ambient exposure enables non-Māori-speaking New Zealanders to acquire a sizable Māori proto-lexicon comprising words and, critically, morphs, allowing them to identify real words above phonotactic expectations and to make phonotactic judgments comparable to fluent speakers. The most explanatory models derive from dictionary type-based statistics without vowel length contrasts and, even more strongly, from morph-based statistics, with about 1,500 frequent morphs sufficient to account for NMS judgments. These results highlight the robustness and ecological validity of adult statistical learning and its role in building proto-lexical knowledge from passive exposure.

Limitations

Not all NMS performance can be attributed solely to a Māori proto-lexicon; some sensitivity may arise during the task, from other language experience, or universal phonotactic tendencies (as suggested by US participants’ small, non-zero sensitivity).
The well-formedness stimuli were designed to match Māori phonotactics but not constrained by a full grammar of word construction, influencing differences in parsing strategies between NMS and MS.
Phonotactic modeling did not track vowel length distinctions, aligning with reported instability in production; while improving model fit, this simplification may not capture all phonological contrasts.
Morphological segmentation relied on a single fluent speaker’s analysis, which may introduce subjective segmentation choices.
The study used written stimuli as proxies for phonotactics; although justified by Māori orthography, results may not fully generalize to auditory processing without further testing.

Related Publications

Explore these studies to deepen your understanding of the subject.

Business

The non-linear relationship between ESG performance and bank stability in the digital era: new evidence from a regime-switching approach

A. Bouattour, M. Kalai, et al.

Medicine and Health

Prolonged response to first-generation tyrosine kinase inhibitor in a metastatic non-small cell lung cancer harbouring complex G719X and S768I mutations: A case report from Vietnam and literature review

K. H. Do, D. T. Le, et al.

Education

How humour travels in the new and dynamic mediascape: a case study of a short video platform, Little Red Book, and an online teaching platform, Rain Classroom

L. Liang

Engineering and Technology

Impact of a surfer rescue training program in Australia and New Zealand: a mixed methods evaluation

W. Koon, A. E. Peden, et al.

Listen, Learn & Level Up

Over 10,000 hours of research content in 25+ fields, available in 12+ languages.

No more digging through PDFs, just hit play and absorb the world's latest research in your language, on your time.

listen to research audio papers with researchbunny