Linguistics and Languages

The effectiveness of ChatGPT as a lexical tool for English, compared with a bilingual dictionary and a monolingual learner's dictionary

R. Lew, B. Ptasznik, et al.

Discover how ChatGPT outperformed traditional dictionaries in enhancing English learners' lexical skills! This captivating study by Robert Lew, Bartosz Ptasznik, and Sascha Wolfer reveals groundbreaking insights into language acquisition techniques.

00:00

~3 min • Beginner • English

Index

Introduction

The paper examines whether an AI-driven chatbot (ChatGPT) can function as an effective lexical tool compared to traditional dictionaries in language learning. Prior work debated dictionary types (bilingual, monolingual, bilingualized) and later focused on medium (print vs. digital) and navigation. Evidence suggests bilingual resources aid reception by leveraging learners’ L1, while monolingual learners’ dictionaries often hinder production because users must already know the target lexeme. With digital dictionaries dominant and large language models increasingly capable, the study asks: how effective is ChatGPT relative to a leading monolingual learner’s dictionary (LDOCE) and a popular bilingual dictionary (Diki.pl) for receptive (understanding) and productive (writing) tasks involving English phrasal verbs for advanced Polish learners? The purpose is to quantify comparative effectiveness and clarify the roles of AI chatbots versus traditional dictionaries in practical learning contexts.

Literature Review

Research on dictionary type indicates a robust advantage for bilingual information in reception across proficiency levels (Lew, 2004; Thompson, 1987; Wingate, 2002). Earlier conflicting findings (Laufer & Melamed, 1994) likely stemmed from unequal dictionary coverage. Monolingual learners’ dictionaries pose challenges in production because guidance is embedded under specific lemmas, making them less useful when learners do not know which word to search (Lew & Adamska-Sałaciak, 2015). Learners overwhelmingly prefer bilingual dictionaries (Atkins & Varantola, 1998; Baxter, 1980; Tomaszczyk, 1979), though misuse is possible (Abecassis, 2008). Regarding medium, studies comparing print and digital forms produce mixed results (Dziemianko, 2010, 2011, 2012, 2017); digital formats now dominate and offer dynamic, adaptive views. Recent developments in LLMs and AI chatbots have spurred interest in automating lexicographic content and potentially replacing dictionaries in some use cases (De Schryver, 2023; Lew, 2023; Rundell, 2023; Rees & Lew, 2024). This backdrop motivates direct comparison of chatbots and dictionaries as lexical support tools.

Methodology

Aim: Compare effectiveness of three tools—ChatGPT-3.5 (free version), LDOCE online (monolingual learner’s dictionary), and Diki.pl (Polish–English bilingual dictionary)—for receptive and productive lexical tasks. Participants: 166 native Polish students in a 3-year BA in English at a large Polish state university; Year 1 (N=78), Year 2 (N=46), Year 3 (N=42). Proficiency: B2–C1. Materials: Two paper-based tests (20 items each; total 40 items) focusing on infrequent English phrasal verbs with common verbs. Production: translate 20 Polish sentences into English, with the verb of the target phrasal verb given (particle hidden). Reception: read 20 English sentences with underlined target fragments and translate those fragments into Polish. Source sentences adapted from CALD, COBUILD, OALD; some minor edits; one sentence authored based on corpus attestations. Approximately equal groups used LDOCE (N=56), Diki (N=55), or ChatGPT-3.5 (N=55). All tools accessed on institution desktop computers in a computer lab. Procedure: 15-minute instruction and demo; 90-minute total time for both tests. Participants randomized to one tool. ChatGPT group could interact freely to obtain answers; dictionary groups instructed to use only their assigned dictionary. Two randomized versions of each test controlled for order effects. Scoring: Production—Target phrasal verb use scored via iteratively refined regular expressions (regex), manually checked for misclassification. Reception—understanding measured by Polish translations of target fragments, scored with regex capturing acceptable L1 paraphrases; more elaborate patterns than production; patterns in Supplementary Material. Meaning Conveyed (production, irrespective of using the target phrasal verb)—all non-blank English responses (N=3639) were deduplicated to 1639 unique responses and graded by GPT-4 with custom instructions on an A–F scale, then dichotomized at B/C into ‘satisfactory’ vs ‘not’. Validation: Two human judges independently scored a 10% random sample (N=164). Agreement: human–human 88% (binary), human–ChatGPT-4 82% each; weighted κ: H1–H2=0.89; H1–GPT-4=0.73; H2–GPT-4=0.72; Spearman: 0.88, 0.74, 0.74; three-way Light’s κ=0.68. Adopted all GPT-4 grades in binary form for analysis. Data and analysis: Of 6640 potential observations (166×40), 13 contiguous blanks from one participant were treated as omitted; 35 isolated blanks counted as failures; 6627 non-missing responses analyzed. Mixed-effects logistic regression (lme4::glmer) with model selection via buildmer; BOBYQA optimizer; dispersion checks (blmeco), residual diagnostics (DHARMa), effects estimation (effects, ggeffects). Reference level for Tool = ChatGPT; Year reference = Year 1. Final models: - Production (target phrasal verb): Target ~ 1 + Tool + Year + Tool:Year + (1+Tool | Item) + (1 | Participant) - Reception (understanding): Target ~ 1 + Tool + (1+Tool | Item) + (1 | Participant) - Meaning conveyed (production): Meaning ~ 1 + Tool + Year + (1 | Item) + (1 | Participant) Type III Wald tests used to confirm predictor significance.

Key Findings

Descriptive performance (Table 3): - Production (target item accuracy): ChatGPT 84.1%; Diki 67.0%; LDOCE 40.0%. - Meaning conveyed (production): ChatGPT 97.2%; Diki 81.1%; LDOCE 68.9%. - Reception (target item accuracy): ChatGPT 71.3%; Diki 84.6%; LDOCE 67.4%. Production model (Table 4): - Tool effect highly significant (χ²=84.7, df=2, p<0.001). Year not significant overall; Tool×Year interaction significant (χ²=10.5, df=4, p=0.03). - Odds ratios (vs ChatGPT, Year 1 reference): Diki OR=0.14 (≈7× less effective than ChatGPT), LDOCE OR=0.03 (≈30× less), both p<0.001. Year 2 and Year 3 main effects ns; Year 3×LDOCE OR≈2.29, p=0.068 (marginal). Greater variability by Item than Participant. Reception model (Table 5): - Tool effect significant (χ²=18.9, df=2, p<0.001). Year excluded. - Odds ratios (vs ChatGPT): Diki OR=1.75 (p=0.209); LDOCE OR=0.46 (p=0.06). Marginal means: Diki 91% [83–95], ChatGPT 85% [67–94], LDOCE 72% [61–81]. Diki significantly higher than LDOCE (non-overlapping CIs). Meaning conveyed model (Table 6): - Strong Tool effects: vs ChatGPT, Diki OR=0.11; LDOCE OR=0.05 (both p<0.001). Year 3 advantage OR=1.81, p=0.001 (no Tool×Year interaction). - Estimated success rates (averaged over Years): ChatGPT 98% [97–99]; Diki 84% [79–88]; LDOCE 72% [64–78]. By Year (averaged over Tools): Year 1=88% [84–91], Year 2=88% [84–92], Year 3=93% [90–95]. Validation of GPT-4 grading for Meaning conveyed: - Percent agreement (binary): H1–H2 88%; H1–GPT-4 82%; H2–GPT-4 82%. - Weighted κ: H1–H2 0.89; H1–GPT-4 0.73; H2–GPT-4 0.72. Three-way Light’s κ=0.68. Overall: ChatGPT outperforms both dictionaries in production (target use and meaning conveyed). In reception, Diki performs best; ChatGPT outperforms LDOCE but not Diki.

Discussion

The study’s results directly address the research question by showing that ChatGPT is a highly effective lexical aid, especially for production tasks where learners must generate idiomatic English or accurately convey meaning. This aligns with ChatGPT’s strengths in producing natural English. For reception (understanding), ChatGPT’s performance lags behind the bilingual dictionary and is only moderately better than the monolingual LDOCE, likely because participants provided Polish translations and ChatGPT’s non-English proficiency—particularly Polish—leads to mistranslations. These findings reaffirm prior lexicographic research: bilingual resources leveraging the learner’s L1 facilitate understanding more effectively than monolingual paraphrases. Item-level variability outweighed participant-level variability, indicating that specific phrasal verbs differ greatly in difficulty across tools. Advanced students (Year 3) showed improved production outcomes overall, but this did not alter the relative ranking of tools for conveying meaning. The results suggest that while traditional dictionaries remain useful—especially bilingual ones for decoding—AI chatbots can surpass them in encoding tasks. As multilingual LLMs improve, the reception gap relative to bilingual dictionaries may narrow.

Conclusion

The paper demonstrates that a general-purpose chatbot (ChatGPT) can serve as a viable alternative to traditional dictionaries for advanced learners, with a pronounced advantage in production (both using specified phrasal verbs and conveying meaning without requiring target forms). For reception, a high-quality bilingual dictionary still offers the strongest support, while ChatGPT outperforms a leading monolingual learner’s dictionary. The study contributes empirical evidence comparing AI chatbots to established dictionary types across receptive and productive tasks using controlled materials and rigorous mixed-effects modeling. Future research should examine learning outcomes beyond immediate task success, such as delayed retention and how interactive chatbot use influences vocabulary acquisition. As multilingual models mature, their capacity to support reception in learners’ L1s should be reassessed, while recognizing that for smaller or low-resource languages, dictionaries may remain indispensable.

Limitations

Identified and implied limitations include: (1) ChatGPT’s weaker proficiency in Polish (and non-English languages) likely depressed reception performance; (2) the participant sample was homogeneous (Polish L1, advanced BA students), which may limit generalizability; (3) tasks focused on 40 uncommon English phrasal verbs, potentially constraining scope; (4) only one monolingual (LDOCE) and one bilingual (Diki.pl) dictionary were tested; (5) ChatGPT use was limited to version 3.5 (free) during Jan–Apr 2024; (6) outcomes measured immediate task success, not delayed retention or learning gains; (7) Meaning conveyed scoring used GPT-4 (validated against humans but still an automated rater); and (8) tool performance may evolve rapidly due to updates to models and online dictionaries.

Related Publications

Explore these studies to deepen your understanding of the subject.

Political Science

Science diplomacy as a foreign policy tool for Turkey and the ramifications of collaboration with the EU

D. B. Karacan

Medicine and Health

The 2022 Massive Open Online Course (MOOC) to train physiotherapists in the management of people with spinal cord injuries: a qualitative and quantitative analysis of learners’ experiences and its impact

J. V. Glinsky, J. Ilha, et al.

Medicine and Health

Risk factors for and pregnancy outcomes after SARS-CoV-2 in pregnancy according to disease severity: A nationwide cohort study with validation of the SARS-CoV-2 diagnosis of Nordic Federation of Societies of Obstetrics and Gynecology (NFOG)

A. J. M. Aabakke, T. G. Petersen, et al.

Social Work

Evaluating the effectiveness of the Kidogo model in empowering women and strengthening their capacities to engage in paid labor opportunities through the provision of quality childcare: a study protocol for an exploratory study in Nakuru County, Kenya

K. Okelo, M. Nampijja, et al.

Listen, Learn & Level Up

Over 10,000 hours of research content in 25+ fields, available in 12+ languages.

No more digging through PDFs, just hit play and absorb the world's latest research in your language, on your time.

listen to research audio papers with researchbunny