logo
ResearchBunny Logo
Introduction
Automated Essay Scoring (AES) has seen significant advancements with the rise of machine learning, particularly large language models (LLMs). While LLMs are increasingly used for language assessment, their application to non-native Japanese AES remains under-explored. This study addresses this gap by investigating the potential of LLMs for evaluating the Japanese writing proficiency of non-native speakers. The increasing number of Japanese language learners globally and the rising demand for efficient and cost-effective assessment methods necessitate the exploration of automated solutions. Traditional methods like the Japanese Language Proficiency Test (JLPT), J-CAT, and TTBJ primarily focus on reading, listening, vocabulary, and grammar, neglecting writing proficiency. This study aims to determine if LLMs can accurately and reliably assess various aspects of Japanese writing, thereby offering a valuable tool for educators, employers, and learners themselves. The research seeks to answer two key questions: (a) How accurately do LLM-driven AES and linguistic feature-based AES reflect test takers' actual performance? and (b) What is the impact of prompt engineering on the accuracy of LLM-based AES methods?
Literature Review
Early AES systems relied on conventional machine learning techniques, using pre-defined linguistic features like lexical richness, syntactic complexity, and text cohesion to predict human ratings. Examples include Jess and JWriter for Japanese, which, however, suffer from limitations such as manual feature selection and potential for system manipulation. Deep learning, particularly transformers like BERT and GPT, offers a more powerful approach. BERT, trained on masked language modeling and next-sentence prediction, has shown promise in AES, often combined with LSTM models. GPT, with its generative capabilities, also holds potential but requires careful prompt engineering. The Open-Calm large model (OCLL), a Japanese-specific LLM, represents a further avenue for exploration. Prior research using BERT in non-native Japanese AES demonstrated its superiority over conventional methods, and OCLL showed promise in another study, but both were limited by dataset size and model accessibility. A major challenge lies in prompt engineering: finding effective prompts to elicit accurate and reliable scores from LLMs. Techniques like zero-shot chain of thought (CoT) and Auto-CoT have been proposed to address this challenge.
Methodology
This study employed a dataset of 1400 story-writing essays from the International Corpus of Japanese as a Second Language (I-JAS), representing 12 first languages and six proficiency levels (A1-C2) based on J-CAT scores. 714 essays were used for reliability testing, and 686 for proficiency level discrimination. Five AES methods were compared: Jess, JWriter, BERT, GPT-4, and OCLL. Writing proficiency was assessed using 16 measures categorized into five criteria: lexical richness (lexical diversity, density, and sophistication), syntactic complexity (mean dependency distance, mean length of clause, etc.), cohesion (semantic similarity), content elaboration (metadiscourse marker rate), and grammatical accuracy (error rate). T-unit analysis was used to assess syntactic units. Lexical diversity was measured using the moving average type-token ratio (MATTR). The multi-faceted Rasch model was applied to evaluate the effectiveness of the measures in distinguishing proficiency levels. Prompt engineering was explored using 18 different prompt variations for GPT-4, ranging from a baseline (no additional information) to a prompt including all 16 measures and prompts with individual measures. Annotation reliability was assessed using precision, recall, F-score, and quadratically weighted kappa (QWK). The accuracy of AES methods was evaluated using QWK and proportional reduction of mean squared error (PRMSE).
Key Findings
The inter-annotator agreement (human-human) showed high reliability (F-scores ranging from 0.666 to 1.000, QWK from 0.695 to 0.950). LLM annotation also demonstrated high agreement with human annotation (F-scores from 0.706 to 1.000, QWK from 0.807 to 0.962). GPT-4 achieved the highest QWK values compared to BERT and OCLL, indicating superior agreement with human ratings across all writing proficiency criteria (lexical richness, syntactic complexity, content, grammatical accuracy). Specifically, GPT-4's agreement with human raters exceeded the agreement between human raters themselves. The PRMSE analysis further confirmed GPT-4's superior performance compared to other LLMs and conventional methods. In the prompt engineering experiment, GPT-4 with all 16 measures incorporated into the prompt (Model 2) significantly outperformed the baseline (Model 1) and other single-measure prompts (Models 3-18), achieving the highest PRMSE score (0.711). Analysis of variance (ANOVA) and post-hoc tests revealed that several measures showed statistically significant differences and linear progressions across proficiency levels, indicating their efficacy in distinguishing learners at different levels. Syntactic complexity measures were particularly effective in discriminating proficiency levels.
Discussion
The findings demonstrate the feasibility and effectiveness of using LLMs, specifically GPT-4, for AES of non-native Japanese writing. The superior performance of GPT-4 over other LLMs and traditional methods can be attributed to its larger parameter size and superior capacity for language understanding. The strong correlation between GPT-4 scores and human ratings, even surpassing inter-rater reliability, suggests that GPT-4 can serve as a reliable tool to assist in, or even potentially replace, human assessment in certain contexts. The results highlight the crucial role of prompt engineering in maximizing the performance of LLMs. The inclusion of relevant linguistic features and clear instructions significantly improves the accuracy of the automated scoring. This suggests that carefully crafted prompts and the selection of appropriate linguistic features are crucial for ensuring the validity and reliability of LLM-based AES.
Conclusion
This study successfully demonstrates the potential of LLMs, particularly GPT-4, for automated essay scoring in non-native Japanese writing. GPT-4 significantly outperforms other models and conventional methods, exhibiting higher agreement with human ratings than inter-rater reliability. Furthermore, the study emphasizes the vital role of prompt engineering in achieving optimal LLM performance. Future research should focus on refining prompt strategies for diverse language types, expanding LLM application to oral proficiency assessment, and exploring their use in under-resourced languages.
Limitations
The study's reliance on a specific dataset (I-JAS) might limit the generalizability of the findings to other datasets or writing tasks. The specific prompts used might also influence the results, and further research is needed to optimize prompt design. While GPT-4 showed superior performance, it is a computationally expensive model, limiting accessibility for some users.
Listen, Learn & Level Up
Over 10,000 hours of research content in 25+ fields, available in 12+ languages.
No more digging through PDFs—just hit play and absorb the world's latest research in your language, on your time.
listen to research audio papers with researchbunny