Linguistics and Languages

Applying large language models for automated essay scoring for non-native Japanese

W. Li and H. Liu

Discover how Wenchao Li and Haitao Liu utilized advanced large language models like GPT-4 to revolutionize automated essay scoring for non-native Japanese writing. Their findings reveal the remarkable effectiveness of these models compared to traditional methods, paving the way for enhanced learning tools.

00:00

~3 min • Beginner • English

Index

Introduction

The study investigates whether large language models can reliably support automated essay scoring (AES) for non-native Japanese writing and how prompt design affects LLM performance. Despite growth in AES using machine learning and deep learning, applications to Japanese L2 writing remain limited and challenged by Japanese morpho-syntactic characteristics (agglutination, SOV order, idiomatic expressions). This work compares five systems—two conventional feature-based tools (Jess, JWriter), two LLMs (GPT, BERT), and a Japanese local LLM (Open-Calm large model, OCLL)—on a corpus of learner narratives. Research questions: (a) To what extent do LLM-driven AES and linguistic feature-based AES accurately reflect test takers’ performance as support for human rating? (b) How does prompting influence the accuracy and performance of LLM-based AES? Addressing these questions is important for scalable, consistent, and cost-effective assessment that complements existing tests which under-emphasize writing.

Literature Review

Conventional machine learning AES relies on engineered linguistic features and labeled data, with systems like Intelligent Essay Assessor, e-rater, MyAccess (IntelliMetric), and Bayesian Essay Test Scoring automating scoring but limited by predefined features, manual intervention, and susceptibility to gaming. In Japanese, Jess deducts from a perfect score using newspaper-based features (rhetoric, organization, content via LSI), while JWriter uses linear regression over indices (e.g., sentence length, character counts). Pilot results showed mixed discriminative validity for Jess measures and limited level separation. Deep learning approaches, particularly transformers, improved text representation. BERT-based AES (often combined with LSTM) outperformed conventional features in non-native Japanese; yet BERT faces input length limits and reliance on score-only labels. GPT-3 showed moderate-to-fair AES accuracy in English (TOEFL11) and improved with added linguistic measures. For Japanese, Open-Calm models (GPT-NeoX-based) showed promise; OCLL performed best within its family but requires local deployment and adequate GPU. Prompt engineering (zero-shot CoT, Auto-CoT, Tree of Thoughts) can substantially affect LLM outcomes, motivating systematic prompt design for AES.

Methodology

Dataset: Essays were sourced from the International Corpus of Japanese as a Second Language (I-JAS), with 1000 participants across 12 L1s completing two PC-based story-writing tasks from 4-panel illustrations (“Picnic”, “The key”). Proficiency was benchmarked using J-CAT and SPOT and mapped to CEFR/RFJLE levels (Beginner, A1–C2). A total of 1400 essays were used: 714 for evaluating annotation reliability and 686 as development data to assess proficiency discrimination. J-CAT served as the primary benchmark. Measures: Five scoring categories with 16 measures were used: (1) Lexical richness (MATTR with 50-word moving window; lexical density; lexical sophistication defined as types not in Japanese Education Vocabulary List Ver 1.0). (2) Syntactic complexity (mean length of clause; verb phrases per T-unit; clauses per T-unit; dependent clauses per T-unit; complex nominals per T-unit; adverbial clauses rate; coordinate phrases rate; mean dependency distance, MDD). (3) Cohesion (synonym overlap/topic; synonym overlap/keywords; word2vec cosine similarity). (4) Content elaboration (metadiscourse marker rate). (5) Grammatical accuracy (grammatical error rate). T-unit criteria for Japanese followed Hunt (1966) and Sakoda & Hosoi (2020). MDD was computed as the average absolute distance between dependents and governors. Validation of measures: Multi-faceted Rasch measurement estimated item difficulties, person abilities, and rater severities. All 16 measures showed acceptable Infit MNSQ (0.76–1.28). Measure separation was 4.02 (reliability 0.891); person separation reliability 0.802. Weights for score prediction highlighted high contributions from MATTR (0.0391), MDD (0.0388), complex nominals per T-unit (0.0379), mean length of clause (0.0374), coordinate phrases rate (0.0325), and grammatical error rate (0.0322). Prompting and systems: The GPT-4 API was used. The primary prompt instructed holistic CEFR A1–C2 scoring and trait scoring, while explicitly listing and requesting calculation over all 16 measures. Prompt language experiments (English vs Japanese) showed GPT-4 produced consistent grades (B1 in the sample) regardless of prompt language, whereas OCLL with Japanese prompts yielded inconsistent outcomes across 10 runs (6 consistent B1; 4 varied A1/B2), suggesting model capacity and parameters, rather than prompt language, drive reliability. Evaluation metrics: Annotation agreement used precision, recall, F-score, and QWK. Scoring agreement and accuracy used QWK, Pearson correlations, and PRMSE, with human-human agreement as the benchmark. Discrimination was assessed via ANOVAs and Bonferroni post hoc tests across aggregated proficiency groups (primary, intermediate, advanced). All analyses were implemented in Python.

Key Findings

- Human-human annotation agreement: F-scores ranged from 1.000 (sentence and word counts) to 0.666 (grammatical errors); QWK spanned 0.950 (sentences/words) to ~0.695 (synonym overlap keywords, grammatical errors), indicating strong but imperfect inter-rater consistency. - Human–LLM annotation agreement: GPT-4 achieved perfect F-scores (1.000) with humans on basic units (sentences, clauses, T-units, words) and high QWK (e.g., words 0.962). OCLL–human F-score for grammatical error counts was lower (0.706). - Human–GPT-4 scoring agreement by measure (QWK): Ranged from 0.819 (lexical density) to 0.644 (word2vec similarity). Other notable values included MLC 0.812, DCT 0.803, CNT 0.807, ACC 0.794, SOPK 0.805, VPT 0.754, MDD 0.743, MATTR 0.655. - Human–GPT-4 category correlations (Pearson): Lexical richness 0.708; Syntactic complexity 0.672; Cohesion 0.751; Content elaboration 0.722; Grammatical accuracy 0.734. - Proficiency discrimination (ANOVA on 686 essays): Seven measures showed linear progression across primary, intermediate, advanced levels: MATTR (lexical diversity), MDD, MLC, CNT, CPC (syntactic complexity), word2vec cosine similarity (cohesion), and GER (grammatical error rate). Significant adjacent-level differences were observed (e.g., primary vs intermediate for MLC and GER). Lexical sophistication did not significantly differ by level. - System comparison against human ratings (QWK | PRMSE): • GPT-4–human: Lexical 0.665 | 0.711; Syntactic 0.655 | 0.733; Content 0.657 | 0.634; Grammar 0.689 | 0.754. • BERT–human: Lexical 0.653 | 0.701; Syntactic 0.652 | 0.746; Content 0.638 | 0.628; Grammar 0.671 | 0.749. • OCLL–human: Lexical 0.639 | 0.687; Syntactic 0.648 | 0.699; Content 0.623 | 0.619; Grammar 0.662 | 0.694. • Jess–human: Lexical 0.608 | 0.691; Syntactic 0.591 | 0.658; Content 0.518 | 0.601; Grammar 0.655 | 0.689. • JWriter–human: Lexical 0.600 | 0.683; Syntactic 0.589 | 0.675; Content 0.521 | 0.584; Grammar 0.661 | 0.732. • Human–human: Lexical 0.657 | 0.691; Syntactic 0.639 | 0.745; Content 0.578 | 0.590; Grammar 0.677 | 0.744. GPT-4 achieved the highest overall alignment with human ratings, including surpassing human–human agreement for Content. - Prompt strategy (18 models): Baseline GPT-4 (no added measures) was fair, while GPT-4 prompted with all 16 measures achieved the best PRMSE (0.681), outperforming any single-measure prompts. Syntactic complexity cues contributed most to gains, followed by lexical diversity. GPT-4 produced consistent grades across prompt languages, whereas OCLL showed within-model inconsistency across runs. - Additional observations: OCLL had strengths with certain idiomatic expressions but overall lagged GPT-4, likely due to smaller parameterization and training scale.

Discussion

Findings address RQ1 by demonstrating that LLM-driven AES, particularly GPT-4 with detailed measure-based prompting, closely aligns with human ratings, sometimes exceeding human–human agreement, and reliably annotates structural units. This suggests LLMs can support or partially automate rating with high consistency, potentially mitigating rater variability and bias through standardized criteria embedded in prompts. For RQ2, prompt engineering significantly improves performance: comprehensive, measure-rich prompts yield better accuracy than baseline or single-measure prompts, with syntactic complexity and lexical diversity being especially informative for Japanese L2 writing quality. The system also discriminates proficiency levels via multiple measures showing linear progression. While the Japanese local model (OCLL) shows promise (e.g., idiomatic handling), its inconsistency and lower agreement underscore the advantage of larger models like GPT-4 and the importance of robust prompt designs.

Conclusion

The study shows GPT-4-based AES outperforms BERT, OCLL, and conventional feature-based tools (Jess, JWriter) across multiple writing proficiency dimensions, achieving strong agreement with human raters and effective proficiency discrimination. Comprehensive measure-informed prompting substantially boosts performance over minimal prompting, highlighting the centrality of prompt design in LLM-based AES. LLM-supported assessment can reduce costs, time, and rater biases while offering consistent feedback for learning. Future work should: (1) refine prompt engineering strategies across languages and task types; (2) extend LLM assessment to oral proficiency and under-resourced languages, leveraging advances in self-supervised speech technologies; and (3) further investigate AES robustness to language- and learner-specific factors (e.g., L1 transfer, idioms) and enhance transparency of trait scoring.

Limitations

Reported constraints include: (1) Model-side limitations such as BERT’s input length constraints and reliance on score-only labels for training; (2) Prompt engineering challenges, as reliability and effectiveness depend on careful design; (3) OCLL deployment constraints (no public server access; requires local GPU such as RTX 3060) and observed inconsistency across runs; (4) Linguistic challenges specific to Japanese (agglutination, SOV order, idiomatic expressions) that complicate AES; (5) Potential biases inherent in human ratings used as benchmarks, though mitigated via standardized prompts and multi-faceted Rasch modeling. The study focuses on written essays from the I-JAS corpus; generalization to other genres, tasks, or modalities (e.g., speech) requires further validation.

Related Publications

Explore these studies to deepen your understanding of the subject.

Medicine and Health

Large language models streamline automated machine learning for clinical studies

S. T. Arasteh, T. Han, et al.

Computer Science

PENTESTGPT: Evaluating and Harnessing Large Language Models for Automated Penetration Testing

G. Deng, Y. Liu, et al.

Psychology

Large language models could change the future of behavioral healthcare: a proposal for responsible development and evaluation

E. C. Stade, S. W. Stirman, et al.

Computer Science

Large Language Models for Code Analysis: Do LLMs Really Do Their Job?

C. Fang, N. Miao, et al.

Listen, Learn & Level Up

Over 10,000 hours of research content in 25+ fields, available in 12+ languages.

No more digging through PDFs, just hit play and absorb the world's latest research in your language, on your time.

listen to research audio papers with researchbunny