logo
ResearchBunny Logo
Introduction
The integration of artificial intelligence (AI) into education is transforming teaching, learning, and assessment. AI's role in language education, especially EFL, is growing, with ChatGPT gaining prominence for its potential in EFL writing assessments. Assessing EFL writing is complex and time-consuming, requiring expertise and resources. EFL educators seek efficient assessment methods while maintaining quality. Studies have shown AI tools like ChatGPT can automate assessment, provide instant feedback, and personalize learning. However, understanding ChatGPT's effectiveness in scoring essays and providing qualitative feedback, particularly comparing different versions (3.5 and 4), is crucial. This study addresses this gap by investigating how ChatGPT versions 3.5 and 4 can aid teachers in assessing EFL essays, focusing on the reliability of scoring and relevance of feedback.
Literature Review
EFL writing assessment involves both scoring and providing qualitative feedback. Holistic or analytic scoring by human raters can be inconsistent due to factors like experience and scoring criteria adherence. Reliability is essential for sound assessment, with inter-rater reliability assessing agreement among raters. Numerous studies have examined inter-rater reliability in EFL writing. Qualitative feedback should be relevant and effective, enabling students to improve their writing. Generalizability (G-) theory offers a robust framework for evaluating and improving assessment reliability, surpassing classical test theory (CTT). G-theory disentangles multiple sources of variance affecting scores, examining rater variability, task variability, and interactions. It quantifies variance contributions from different sources and helps optimize assessment procedures through D-studies. G-theory promotes fairness by addressing measurement error and ensuring scores are not unduly influenced by extraneous factors. The integration of ChatGPT in EFL writing classrooms has shown benefits in enhancing writing abilities and assessment processes. Studies suggest that ChatGPT can enhance the writing process, providing assistance with outlining, revising, editing, and proofreading. Comparisons of ChatGPT's performance with human instructors show that in some cases, ChatGPT outperforms human teachers in feedback quality. The impact of ChatGPT on student motivation and writing skills has also been shown to be positive. However, concerns exist regarding academic integrity and the need for new pedagogical strategies to optimize AI integration.
Methodology
This study used 30 CET-4 essays written by non-English major undergraduate students at a Beijing university and four experienced college English teachers. ChatGPT versions 3.5 and 4 also served as raters. The teachers and ChatGPT rated essays holistically using the CET-4 rubric and provided qualitative feedback on language, content, and organization. G-theory, using GENOVA software, analyzed score variability and reliability. Person-by-rater G-studies and D-studies were conducted for ChatGPT 3.5, ChatGPT 4, and the teachers. Qualitative feedback was analyzed through a three-level coding scheme (language, content, organization), quantifying effective feedback statements and identifying major themes. The writing task involved choosing between a job and graduate school, requiring an essay of 120-180 words. Teacher raters received brief online training before assessment, while ChatGPT versions were trained on three sample essays aligned with CET-4 criteria before assessing the 30 essays. Data collection with ChatGPT was repeated four times (with a three-day interval).
Key Findings
G-studies revealed that the person variance component (students' writing abilities) was the most significant source of variance in all three rater groups (ChatGPT 3.5, ChatGPT 4, and teachers). However, the residual variance component was considerably higher for ChatGPT 3.5 (32.88%) compared to ChatGPT 4 (10.83%) and teachers (18.58%), indicating more unexplained variance in ChatGPT 3.5's scoring. The rater variance component was the least significant in all three cases, suggesting consistent scoring across raters. D-studies showed an increase in reliability with more raters. ChatGPT 3.5 had lower reliability than teachers when assessed by a single rater (G = 0.66 vs. G = 0.80), but ChatGPT 4 showed higher reliability (G = 0.89). Qualitative analysis (Table 3) revealed that both ChatGPT versions provided significantly more relevant feedback than teachers across language, content, and organization (e.g., ChatGPT 4: language mean = 201.25, content mean = 103.5, organization mean = 106.5; teachers: language mean = 54.5, content mean = 20.75, organization mean = 36.25). ChatGPT versions provided more balanced feedback across all three aspects, while teachers focused more on language. Table 4 presents major feedback themes for each rater group: grammar and syntax errors, sentence structure and complexity, word choice and vocabulary, spelling and typos (language); lack of depth and specificity, argument clarity and support, engagement and persuasiveness (content); cohesion and flow, structural concerns (organization). Specific examples of feedback on a sample essay are provided to illustrate the differences in feedback quality and focus.
Discussion
The findings show a clear improvement in ChatGPT's capabilities from version 3.5 to 4, with version 4 surpassing human raters in reliability. Both ChatGPT versions provided more comprehensive feedback than teachers, particularly in content and organization. This aligns with previous research highlighting ChatGPT's potential for enhancing writing processes and providing feedback. The study suggests that advanced AI can augment EFL assessment processes, allowing teachers to focus on personalized teaching. The more balanced feedback provided by ChatGPT compared to teachers' focus on language is noteworthy. The study supports the potential of using AI tools like ChatGPT4 to improve EFL writing assessments.
Conclusion
This study demonstrates ChatGPT 4's superior reliability in scoring EFL essays compared to human raters, alongside both versions' provision of more comprehensive feedback. This suggests AI's potential to improve the fairness and effectiveness of EFL writing assessment. For educators, ChatGPT can free up time for personalized instruction. Future research should involve larger samples, diverse student populations, and comparisons with other AI models. Despite limitations, the study's results strongly support ChatGPT's transformative potential in language education.
Limitations
The study's limitations include a small sample size (30 essays) from a single university, limiting generalizability. Using teachers from one institution could introduce bias, and focusing on only two ChatGPT versions might not capture the variability of other AI models. Future research should address these limitations by using larger, more diverse samples and exploring various AI models and configurations.
Listen, Learn & Level Up
Over 10,000 hours of research content in 25+ fields, available in 12+ languages.
No more digging through PDFs—just hit play and absorb the world's latest research in your language, on your time.
listen to research audio papers with researchbunny