Linguistics and Languages
Augmenting a colour lexicon
D. Mylonas, S. Caparos, et al.
The study investigates how languages augment their colour lexicons, focusing on the Himba of Namibia who were previously reported to have a 5-term GRUE (green-blue conflated) system. The authors aim to identify mechanisms driving the increase in the number of colour categories, examining whether perceptual universals or cultural/linguistic factors are primary. They revisit the Himba with a computer-based method sampling colours across the full gamut (including desaturated interior), overcoming limitations of previous saturated-only swatch studies. The work also frames the perceptual vs linguistic debate through machine learning—unsupervised (perceptual similarity) vs supervised (linguistic similarity)—to determine which better accounts for effective colour communication and category augmentation.
Classical accounts (Berlin and Kay) proposed an ordered acquisition of basic colour terms rooted in universal physiology and opponent-process mechanisms, with categories emerging via partitioning of a universal space. Alternative approaches include optimal partitioning of perceptual colour space (e.g., Regier et al.), categorical perception shaped by language, and culturally driven hypotheses such as emergence in previously unnamed/inconsistently named regions and loanwords. Empirical reports note changes in lexicons over time (e.g., modern Japanese with a new light-blue term; English augmentation to include turquoise and lilac). Prior cross-cultural studies typically used saturated stimuli and offered single snapshots, limiting understanding of interior/desaturated regions and temporal change. Computationally, unsupervised clustering (e.g., k-means) yields universal categories based on perceptual structure, while supervised models learn language-specific categories from labelled data; prior models often constrained outputs to the 11 basic terms. The authors propose using Rotated Split Trees (RST), an ensemble decision-tree method, to infer indispensable colour terms from labelled data and compare it to unsupervised k-means across datasets and colour spaces.
Participants: 55 native Himba speakers (23 female, 32 male; mean age 27.4, range 16–60; mean schooling 1.4 years, 38 with no schooling) from remote NW Namibia villages. Compensation: flour; ethics approval Goldsmiths, University of London (No.1390, 04/06/2018). Stimuli and apparatus: 600 colour samples presented as 2° discs with 1-pixel black outline against a neutral grey background (40 cd/m²). Samples: 589 approximately uniformly distributed Munsell Renotation colours (restricted to sRGB gamut) plus 11 achromatics. Sampling followed Billmeyer guidance (variable hue counts with increasing chroma); overlap with WCS surface colours 91%. Two calibrated 10.1" Asus Transformer Mini T102HA displays; white points near 6816–6907 K with minimal drift (<0.003 in x,y); control via PsychoPy 1.84.2. Procedure: Seated in a tent (~80 cm viewing distance), participants named each stimulus aloud freely (no enforced modifiers or required “don’t know”). Responses were audio-recorded and transcribed by a native-language assistant. Data processing: 33,000 raw responses; excluded unique single-observer terms (0.8%) and “don’t know” responses (665 across 434 samples), yielding 32,087 responses. Frequency and modal analyses conducted; consensus defined as peak P(name|colour). Computational modeling:
- Supervised: Rotated Split Trees (RST) ensemble of B=100 trees. Each tree uses full training data with a random rotation via Householder QR; splits made independently of target variable; leaf predictions aggregated to yield probability distributions over names. RST favours consistent, frequent categories and can discover indispensable terms without predefining category count. Generalisation assessed with leave-one-out cross-validation (LOOCV), training on c−1 samples and predicting the left-out sample, aggregated across all c samples.
- Unsupervised: k-means clustering in CIELAB or sRGB; k set to number of observed modal categories (or tested alternatives). To compare with observed categories, centroid matching performed using a distance matrix and Munkres (Hungarian) assignment; repeated 100 times to avoid local optima and select best solution by minimal mean Euclidean distance in CIELAB. Evaluations conducted on: (1) 160 Munsell-surface samples from 2005 Himba dataset (CIELAB), (2) 600-sample current dataset (CIELAB), and (3) the same 600 samples expressed in sRGB.
- Expansion of Himba colour lexicon: From previously reported 5 terms (GRUE system) to 7 indispensable categories. Newly identified major categories: GREEN (grine) and BROWN (vinde).
- Frequency and consensus (current dataset, 600 samples): Most frequent terms: serandu (reddish) 19.6% (all 55 speakers), burou (bluish) 19.3% (all 55), grine (greenish) 12.4% (51 speakers), zoozu (blackish) 6.9% (39), dumbu (yellowish) 5.9% (45), vapa (whitish) 5.3% (54), pinge (pinkish) 5.1% (32), zorondu (blackish-2) 3.7% (20), ranje (orange-ish) 3.2% (28), ngara (pale yellowish) 3.2% (33), vinde (brownish) 3.1% (33). No consistent purple term.
- Demographic effects: For stimuli where grine was the most frequent term (n=98), younger participants used grine over burou more than older (t(53)=3.1, p<0.003); schooled more than unschooled (t(53)=3.2, p=0.002). Among young only, education difference not significant (t(29)=1.5, p=0.14). No significant male-female difference.
- Surface vs interior: RST on Munsell surface (n=320 grid) with current data identifies six terms (serandu, burou, grine, vapa, zoozu, dumbu) with a large grine area (19.4%). Across full gamut (CIELAB grid, n=5693), RST identifies seven categories: serandu 34.5%, grine 23.2%, burou 22.9%, zoozu 8.1%, dumbu 7.7%, vapa 3.5%, vinde 0.2% (emerges at L*=33–44). No stable grey or purple categories at mid-lightness.
- Temporal change (2005 vs current): Previous GRUE split into distinct GREEN (grine) and BLUE (burou). Boundary chips between GREEN and BLUE now have lower confidence than earlier GRUE peak (boundary Mconf=0.6 vs old GRUE Mconf=0.8; t(14)=3.55, p=0.003). Highest-confidence grine and burou in current data were boundary colours in 2005 data; no evidence of latent separate foci in 2005.
- Supervised vs unsupervised performance: • 2005 dataset (160 surface samples, CIELAB): k-means k=5 accuracy 64%; RST LOOCV accuracy 89%; difference significant (McNemar x²(1,N=160)=32, p<0.01). k-means k=6 accuracy 54%. • 2018/2019 dataset (600 samples, CIELAB): k-means k=10 accuracy 40%; RST LOOCV accuracy 93%; significant (x²(1,N=600)=296, p<0.01). k-means k=24 (all terms with freq≥2) accuracy 16%; k=6 (major terms) accuracy 60% (significantly lower than RST; x²(1,N=600)=170, p<0.01). • 2018/2019 dataset (600 samples, sRGB): k-means k=7 accuracy 49%; RST LOOCV accuracy 93%; significant (x²(1,N=600)=243, p<0.01). k-means diverged across spaces (59% in CIELAB vs 49% in sRGB; x²(1,N=600)=25, p<0.01). Overall, supervised learning based on linguistic labels consistently outperformed unsupervised clustering across datasets and colour spaces.
The findings directly address whether perceptual structure or linguistic/cultural forces drive colour category augmentation. The Himba lexicon shows clear augmentation: a split of the former GRUE into GREEN and BLUE and the emergence of BROWN (vinde) in desaturated interior regions, shifting the language from a five-term to a seven-term system. Analyses show no evidence that GREEN and BLUE were latent within the earlier GRUE category; high-confidence GRUE colours in 2005 became boundary colours between GREEN and BLUE in the current data. Machine-learning comparisons demonstrate that models leveraging linguistic labels (supervised/RST) predict human naming with high accuracy across time, stimuli distributions, and colour spaces, whereas perceptually driven unsupervised clustering underperforms and is sensitive to coordinate scaling. This indicates that linguistic similarity and cultural processes, rather than solely perceptual universals, are the critical drivers for effective colour communication and category augmentation. The likely mechanism for GREEN augmentation is cultural transfer via loanwords (e.g., from Herero), supported by the alignment of the GREEN centroid with neighbouring languages and English. Other less frequent terms (e.g., ngara, pinge, ranje) suggest ongoing borrowing and potential future stabilisation. While perceptual commonalities may shape broad cross-language regularities, they are insufficient alone to explain the observed changes; social contact, globalization, and technology likely facilitate rapid shifts in colour lexicons.
This study reports the first documented augmentation of major colour categories (GREEN and BROWN) in a remote population’s lexicon, moving the Himba from a 5-term GRUE system to 7 independent categories. Employing a computer-based, gamut-wide sampling and a supervised RST model, the work identifies indispensable categories across both surface and interior colour spaces and shows that supervised, linguistically grounded learning far outperforms unsupervised perceptual clustering in predicting human naming. The results challenge physiology-only accounts of category origins and emphasize cultural and linguistic mechanisms, notably loanwords, as primary drivers of augmentation. Future research should examine how colour naming functions are learned within communities through cultural interaction, contextual usage, and technological exposure; revisit other preindustrial societies using gamut-wide sampling to uncover interior categories; and track longitudinal changes to disentangle the roles of contact, education, media, and environmental factors in category emergence and stabilisation.
- Machine-learning models are statistical and do not specify underlying cognitive/physiological mechanisms of change.
- Colour differences above ~10 ΔEab units (CIELAB) are less reliable; alternative methods were required to corroborate augmentation beyond centroid distances.
- Sampling density, though high, may still have been insufficient to capture small regions where minor modal terms might exhibit higher consensus.
- Stimuli were constrained to the sRGB gamut; despite careful calibration, display-based presentation may not perfectly match paper swatches, though analyses suggest this did not drive the observed category changes.
- Findings from the Himba may not generalize to all languages or mechanisms of augmentation; loanwords are implicated here but may not account for all cases.
Related Publications
Explore these studies to deepen your understanding of the subject.

