logo
Loading...
CEFR vocabulary level as a predictor of user interest in English Wiktionary entries

Linguistics and Languages

CEFR vocabulary level as a predictor of user interest in English Wiktionary entries

R. Lew and S. Wolfer

Discover fascinating insights from the research conducted by Robert Lew and Sascha Wolfer, which uncovers a significant correlation between CEFR vocabulary levels and user engagement with English Wiktionary entries. Their findings reveal that lower CEFR level entries captivate more viewers, offering valuable implications for lexicography and language learning.... show more
Introduction

The Common European Framework of Reference for Languages (CEFR) provides widely used proficiency levels (A1–C2) and can-do descriptors that guide curriculum, assessment, and materials design. While CEFR is pedagogically focused, vocabulary is a core component and various independent resources (e.g., English Vocabulary Profile, EVP) assign CEFR levels to English words. This study examines whether CEFR vocabulary levels align with observable user behavior as reflected in dictionary searches. Using English Wiktionary page views as a proxy for user interest, the authors ask: (1) whether CEFR level predicts how often an English entry is viewed, expecting lower-level items to attract more views (A1 > A2 > B1 > B2 > C1/2); and (2) whether CEFR level retains predictive power after controlling for other lexical variables (frequency, polysemy, word prevalence, and age of acquisition). The work leverages freely available Wikimedia server logs (2019–2022) and EVP CEFR assignments to assess the predictive utility of CEFR levels for user look-ups, with implications for lexicography and language learning.

Literature Review

Prior research using dictionary log files has explored user behavior to improve dictionaries and the relationship between look-ups and corpus frequency. Early studies found only weak positive relationships restricted to the highest-frequency items. Later work, using refined methods and accounting for long-tail distributions, established that dictionary views are positively related to corpus frequency. Polysemy (having multiple senses) has also been found to increase look-ups. Additional factors identified include word prevalence (the proportion of speakers who know a word), which shows a weak negative relationship to look-ups, and age of acquisition (AoA), where later-acquired words tend to be looked up more. These variables are interrelated: more frequent words tend to be earlier acquired, more prevalent, and often more polysemous. In contrast, CEFR vocabulary levels represent expert pedagogical judgments about usefulness for learners at specific proficiency stages. The present study situates CEFR level alongside frequency, polysemy, prevalence, and AoA to evaluate its unique contribution to predicting Wiktionary views.

Methodology

Data sources: English Wiktionary page-view counts were extracted from Wikimedia daily server logs for 2019-01-01 to 2022-12-31, filtered to English Wiktionary entries. CEFR levels were sourced from the English Vocabulary Profile (EVP). When items had multiple CEFR levels by sense, the lowest level was used, reflecting the most frequent/essential sense likely driving look-ups. Coverage of EVP items in Wiktionary decreases with higher CEFR levels (near-complete for A1 and much lower at upper levels). Additional predictors were compiled as follows: lexical frequency from SUBTLEX-US; polysemy from Wiktionary (number of senses extracted programmatically via R); AoA ratings from Kuperman et al. (2012); and word prevalence from Brysbaert et al. (2019). Modeling: Logged Wiktionary views served as the outcome. CEFR levels were contrast-coded using backward-difference contrasts to reflect stepwise progression (C1 and C2 conflated to level C due to sparse, fragmentary coverage). Predictors were processed as follows: views and corpus frequency were log-transformed to reduce skew and approximate normality; continuous predictors (log frequency, AoA, prevalence) were standardized (z-scores) to aid comparability. Model assumptions were checked; collinearity was assessed with VIF and GVIF^(1/(2-df)), with all predictors in acceptable ranges.

Key Findings
  • Coverage: EVP items matched to English Wiktionary entries showed declining coverage with higher CEFR levels, from nearly 90% at A1 to roughly half at the highest level (C2), reflecting the open-ended nature and formatting peculiarities of upper-level items.
  • Simple CEFR-only model: Regressing logged views on CEFR level (C1/C2 conflated as C) yielded R² = 0.235 (adjusted R² = 0.234). All consecutive level contrasts were significant with negative coefficients, indicating fewer views at higher CEFR levels: A2–A1: −0.728 (SE 0.049, t −14.90), B1–A2: −0.318 (SE 0.040, t −8.04), B2–B1: −0.341 (SE 0.033, t −10.36), C–B2: −0.329 (SE 0.030, t −10.99).
  • Multiple regression (CEFR + lexical predictors): With logged views as outcome and standardized predictors, the model achieved R² = 0.503 (adjusted R² = 0.502). Effects: Frequency (β = 0.572, SE 0.013, t 43.16, p < 0.001), Polysemy TRUE (β = 0.622, SE 0.034, t 18.50, p < 0.001), AoA (β = 0.037, SE 0.012, t 2.98, p = 0.003), Prevalence (β = −0.033, SE 0.009, t −3.52, p < 0.001). CEFR step contrasts remained mostly significant: A2–A1 (β = −0.158, SE 0.039, t −4.06, p < 0.001), B1–A2 (β = −0.067, SE 0.031, t −2.18, p = 0.029), B2–B1 (β = −0.051, SE 0.026, t −1.94, p = 0.052, marginal), C–B2 (β = −0.040, SE 0.025, t −1.62, p = 0.105).
  • Relative importance (drop-one adjusted R²): From the full model adj. R² = 0.502, dropping each predictor reduced adj. R² by: Frequency −0.161 (adj. R² = 0.341), Polysemy −0.0295 (adj. R² = 0.472), CEFR Level −0.00423 (adj. R² = 0.498), Prevalence −0.000985 (adj. R² = 0.501), AoA −0.000678 (adj. R² = 0.501). Thus, frequency and polysemy were the strongest predictors, with CEFR third, followed by prevalence and AoA. Overall, lower CEFR levels consistently attracted more views, and CEFR contributed unique variance beyond other lexical factors.
Discussion

The findings directly address the research questions. First, CEFR level alone significantly predicts Wiktionary look-ups: lower-level (more basic) words attract more views than higher-level items. Second, CEFR retains independent predictive power after controlling for lexical frequency, polysemy, prevalence, and age of acquisition, indicating it is not merely a proxy for these variables. The strong effects of frequency and polysemy replicate prior work, while the smaller positive effect of AoA and negative effect of prevalence align with earlier findings. CEFR’s unique contribution underscores its pedagogical value for anticipating user interest in lexical items, suggesting that CEFR annotations could help prioritize dictionary coverage and inform language-learning materials, particularly if applied at the sense level. The coverage analysis also highlights practical challenges at higher CEFR levels (e.g., open-endedness, meta-notation), which may impact mapping to dictionary entries and thus observed look-up patterns.

Conclusion

This study demonstrates that CEFR vocabulary level is a meaningful predictor of user interest in English Wiktionary entries: lower-level words are looked up more often, both in a CEFR-only model and when controlling for frequency, polysemy, prevalence, and AoA. CEFR ranks as the third most important predictor after frequency and polysemy. These results support incorporating CEFR information into lexicographic resources—ideally at the sense level—and leveraging CEFR grading to guide dictionary design and educational content. Future work should improve mapping between CEFR entries (including multiword expressions and meta-notations) and dictionary titles, and address the fragmentary nature of upper-level CEFR inventories (C1/C2). Further modeling could use CEFR–lexical variable relationships to refine, supplement, or automate CEFR-level assignments.

Limitations
  • User characteristics unknown: Wikimedia page-view logs contain no demographic or proficiency data, making it impossible to determine what proportion of views come from language learners versus other users (e.g., teachers, writers).
  • CEFR labeling completeness and representativeness: Upper-level (C1/C2) coverage is fragmentary and arguably open-ended, potentially unrepresentative; similar concerns may affect B2. CEFR assignments can be somewhat arbitrary, lack nuance, and vary in specificity.
  • Item formulation and mapping: Many higher-level CEFR entries include placeholders (sb, sth), slashes, and multiword variants that do not map cleanly to Wiktionary entry titles, lowering apparent coverage and complicating automated linkage. These limitations may affect generalizability and the precision of CEFR–view relationships, especially at higher levels.
Listen, Learn & Level Up
Over 10,000 hours of research content in 25+ fields, available in 12+ languages.
No more digging through PDFs, just hit play and absorb the world's latest research in your language, on your time.
listen to research audio papers with researchbunny