Computer Science

What artificial intelligence might teach us about the origin of human language

A. Kilpatrick

Delve into an intriguing study by Alexander Kilpatrick that uncovers how AI research leverages sound symbolism. Discover how machine learning algorithms may reflect our instinctual caution by overpredicting perceived threats while analyzing Pokémon names across languages. This fascinating work examines the intersection of technology and human psychology.

00:00

~3 min • Beginner • English

Index

Introduction

The study investigates whether supervised machine learning models trained on sound features in names exhibit a bias toward classifying items into categories associated with greater threat, and what this may reveal about the origins of human language. Traditional views emphasize gesture-first origins of language and arbitrariness of the sound–meaning relationship. However, accumulating evidence shows systematic, probabilistic sound symbolism across languages (for example, high-front [i] associated with smallness and [a] with largeness, and the bouba/kiki shape effect). Within the specialized domain of Pokémonastics—sound symbolism in Pokémon names—sound-symbolic patterns have been found to be stronger for attributes important to in-game survivability. Prior machine learning work using random forests on Pokémon names and on human given names showed above-chance classification accuracy accompanied by a systematic skew in errors: false positives tended to overclassify into categories arguably associated with greater threat (e.g., post-evolution or male). Framed by error management theory (EMT), which posits selection for biases that minimize costly errors under uncertainty, the author hypothesizes that ambiguous vocal signals are more often interpreted as threats, yielding an overrepresentation of false positive errors in threat-relevant classifications. The present study tests this hypothesis using XGBoost models across Chinese, Japanese, and Korean Pokémon name datasets for combat-related (Attack, Defend) and size-related (Height, Weight) attributes.

Literature Review

Evidence for non-arbitrary sound–meaning mappings is long-standing and cross-linguistic. Classic work has shown systematic associations between phonetic features and perceived size, often explained by the frequency code. Cross-linguistic robustness of shape-related symbolism (bouba/kiki) further supports non-arbitrariness. In Pokémonastics, attributes tied to survivability show stronger sound-symbolic cues than those with little game impact. Cross-language analyses found plosive counts correlate with perceived friendship: [p] with higher friendship across languages and [g] with lower friendship where voicing contrasts exist. Supervised ML studies using random forests to classify Pokémon evolution status (pre vs. post) and human given-name gender (female vs. male) achieved above-chance performance. Notably, confusion matrices and feature importance analyses indicated higher error rates for the less-threatening category and feature distributions skewed toward the threatening category. This pattern suggests a potential alignment with EMT: systems may evolve to favor false alarms (false positives) over misses (false negatives) in threat detection. Animal alarm call literature (e.g., vervet monkeys) provides convergent evidence of adaptive biases under uncertainty. These findings motivate testing whether such error skews persist with a different algorithm (XGBoost) and whether they are stronger for combat-related than size-related classifications.

Methodology

Data consisted of Pokémon names in Japanese, Korean, and Chinese sourced from an online repository. Names were converted into counts of speech features (e.g., phonemes, tones), producing a sparse feature matrix (many nulls) of 898 samples. To mitigate overfitting in tree-based models, 3-fold cross-validation was used: each of three iterations trained on two folds and tested on the remaining fold. Dependent variables were four continuous in-game attributes: Attack, Defend (combat-related), and Height, Weight (size-related). Because higher values should correspond to higher threat, each variable was dichotomized by a median split; median cases were removed and classes were balanced by randomly omitting samples from the majority class. For each language and dependent variable, XGBoost models were trained and evaluated. Two hypotheses were pre-registered for testing with statistics on per-iteration results (36 total algorithms): H1: False positive (FP) error rate would exceed 50% for all models; H2: FP error rate would be higher for combat models (Attack, Defend) than for size models (Height, Weight). To assess H1, intercept-only linear models compared FP rates against 50% chance. To assess H2, simple linear models compared FP rates between combat and size models. Additionally, to probe the alternative explanation that longer names might drive FP skew ("longer-is-stronger"), simple linear regressions tested relationships between the continuous attributes and name length (count of features, excluding tones for Chinese) across languages, including previously omitted samples. Hyperparameters in XGBoost included standard practices for handling sparse, high-dimensional features, with attention to the number of features considered at splits and cross-validation for robustness. Data were made available at https://tinyurl.com/mw9u43ya.

Key Findings

- All XGBoost models achieved accuracy above chance (>50%). - FP error skew toward the "higher-threat" class was present for combat models across languages, but not consistently for size models. - Aggregate results (Table 4) for XGBoost: • Attack accuracy/FP%: Japanese 59.15% / 51.93%; Chinese 57.04% / 57.08%; Korean 56.10% / 55.47%; Average 57.43% / 54.83%. • Defend accuracy/FP%: Japanese 59.16% / 55.96%; Chinese 55.66% / 53.35%; Korean 54.07% / 55.64%; Average 56.30% / 54.98%. • Height accuracy/FP%: Japanese 63.03% / 48.82%; Chinese 59.27% / 44.17%; Korean 57.77% / 48.77%; Average 60.03% / 47.26%. • Weight accuracy/FP%: Japanese 62.33% / 47.97%; Chinese 57.51% / 54.05%; Korean 55.38% / 51.50%; Average 58.41% / 51.17%. - Hypothesis tests (on per-iteration results): • H1: Only the Attack model showed FP significantly above 50%; t(8)=2.6, p=0.031. Defend p=0.145; Height p=0.522; Weight p=0.699. When combining combat variables, FP exceeded chance: t(17)=2.8, p=0.012; size variables combined were not significant (p=0.757). • H2: FP in combat variables (M=55%, SD=7%) was significantly higher than in size variables (M=49%, SD=10%); t(34)=25.55, p<0.001. - Longer-is-stronger exploration: Simple linear regressions revealed significant positive correlations (p<0.05) between name length and Attack, Defend, Height, and Weight in most datasets, except for Chinese Height (p=0.67) and Chinese Weight (p=0.70). - Prior RF-based studies cited showed similar error skews toward post-evolution and male classifications, with feature importance distributions favoring the threatening category, aligning with the pattern observed here.

Discussion

Findings partially support the central hypothesis: XGBoost models trained on sound features tend to overclassify into threat-relevant categories, particularly for combat attributes (Attack, Defend), consistent with an error management strategy favoring false alarms over misses. This pattern persisted despite using a different algorithm from prior RF studies, suggesting the phenomenon is not algorithm-specific within tree-ensemble methods. The absence of a robust FP skew in size-related models (Height, Weight) may reflect that these attributes have no direct in-game effect, consistent with earlier Pokémonastics findings that sound symbolism is stronger for attributes important to survivability. The alternative explanation that longer names (fewer null features) drive FP bias is not well supported: while name length correlates positively with attributes in most cases, it does not straightforwardly explain why FP skew concentrates in combat variables. The results imply that ambiguous sound-symbolic cues may be preferentially interpreted as signaling threat, a bias that could be adaptive under EMT. If such a bias is an evolutionarily rooted property of vocal communication, it challenges a strictly gesture-first origin of language and suggests that vocal signaling played a meaningful role early in language evolution. Cross-validation producing multiple runs enables traditional statistical evaluation of skew, strengthening inferential claims compared to single-model observations.

Conclusion

Using Pokémon names in Chinese, Japanese, and Korean, XGBoost models classified items into high/low Attack, Defend, Height, and Weight based on sound features. Models related to combat showed significantly elevated false positive rates relative to chance and relative to size models, indicating a bias toward classifying ambiguous inputs as higher threat. Interpreted through error management theory, this bias could be adaptive because false alarms are less costly than misses. Although preliminary and requiring broader validation, the findings suggest that aspects of human sound symbolism may be tuned toward perceiving threat, with implications for theories of language origins that include a substantive vocal component alongside gesture. Future research should extend to natural language corpora, more languages, transparent modeling approaches, and experimental tests with human listeners to link machine patterns to human perception.

Limitations

- The corpus comprises fictional character names (Pokémon), not natural language usage, limiting ecological validity. - Only three East Asian languages (Chinese, Japanese, Korean) were studied, constraining cross-linguistic generalizability. - Tree-ensemble models like XGBoost and RF are relatively opaque, making causal interpretation of feature–error relationships difficult. - The study risks anthropomorphizing machine learning by inferring human-like biases from algorithmic behavior. - Potential confounds such as name length and feature sparsity are explored but not fully resolved. - The work has not undergone peer review and should be interpreted cautiously.

Related Publications

Explore these studies to deepen your understanding of the subject.

Linguistics and Languages

What are the differences? A comparative study of generative artificial intelligence translation and human translation of scientific texts

L. Fu and L. Liu

Interdisciplinary Studies

The political and social contradictions of the human and online environment in the context of artificial intelligence applications

R. Rakowski and P. Kowaliková

Computer Science

What do algorithms explain? The issue of the goals and capabilities of Explainable Artificial Intelligence (XAI)

M. Renftle, H. Trittenbach, et al.

Earth Sciences

What the geological past can tell us about the future of the ocean's twilight zone

K. A. Crichton, J. D. Wilson, et al.

Listen, Learn & Level Up

Over 10,000 hours of research content in 25+ fields, available in 12+ languages.

No more digging through PDFs, just hit play and absorb the world's latest research in your language, on your time.

listen to research audio papers with researchbunny