logo
ResearchBunny Logo
Evaluating the role of ChatGPT in enhancing EFL writing assessments in classroom settings: A preliminary investigation

Education

Evaluating the role of ChatGPT in enhancing EFL writing assessments in classroom settings: A preliminary investigation

J. Li, J. Huang, et al.

This exciting study explores how ChatGPT can enhance English-as-a-foreign-language writing assessments. By analyzing holistic scores and qualitative feedback from ChatGPT versions 3.5 and 4 alongside college teachers, researchers found that both AI tools provide more relevant feedback on language, content, and organization. Conducted by authors Junfei Li, Jinyan Huang, Wenyan Wu, and Patrick B. Whipple, this research may redefine how we approach EFL writing evaluations.

00:00
00:00
~3 min • Beginner • English
Introduction
The study investigates whether and how ChatGPT can enhance EFL writing assessment in classroom contexts. It addresses the complexity and resource intensity of assessing EFL essays and the need for reliable scoring and relevant qualitative feedback. With AI increasingly used in language education, the paper focuses on comparing ChatGPT versions 3.5 and 4 to college English teachers in two areas: (a) reliability of holistic scores for EFL essays and (b) relevance of qualitative feedback on language, content, and organization. The purpose is to determine ChatGPT’s role in real classroom settings, potentially improving assessment efficiency, fairness, and instructional support.
Literature Review
The review outlines two major components of EFL writing assessment: scoring (holistic/analytic) and qualitative feedback on language, content, and organization. It highlights challenges in inter-rater reliability due to rater experience, adherence to criteria, and error tolerance, emphasizing the need for reliable measures. Generalizability (G-) theory is presented as superior to classical test theory for disentangling variance sources (persons, raters, tasks) and for optimizing assessment design via D-studies to improve reliability and fairness. A synthesis of recent studies on ChatGPT in EFL writing indicates benefits across outlining, revising, editing, and proofreading; potential to complement or surpass teacher feedback in depth and clarity; and positive effects on motivation and engagement. Some concerns remain around academic integrity and the necessity for ethical guidelines. The review identifies a gap in comparative evidence between ChatGPT 3.5 and 4 in authentic classroom assessment scenarios, particularly regarding scoring reliability and feedback relevance.
Methodology
Design: Mixed-methods evaluation using G-theory for quantitative reliability and qualitative feedback analysis with coding and quantification. Participants and setting: 30 non-English major undergraduates from a Beijing university (21 females, 9 males) wrote CET-4 essays in a regular class session (30 minutes). Four college English teachers (each with over 10 years of teaching and national CET-4 rater experience) and ChatGPT versions 3.5 and 4 served as raters. Task: Authentic CET-4 prompt requiring a 120–180 word essay choosing between working after graduation or attending graduate school and explaining reasons. Rater training and procedures: Teacher raters received brief online training: review of CET-4 criteria, rating of three benchmark essays, and discussion. ChatGPT 3.5 and 4 were similarly calibrated on the same three samples to align with CET-4 criteria. Each rater type holistically scored all 30 essays and provided qualitative feedback on language, content, and organization. ChatGPT assessments (both versions) were conducted four separate times with three-day intervals, following the same procedures as human raters. Data analysis: G-theory analyses used GENOVA for balanced designs. Three person-by-rater random-effects G-studies were conducted: person × ChatGPT3.5 rater, person × ChatGPT4 rater, and person × teacher rater, considering only the rater facet to compare score variability and reliability across rater types. D-studies computed G- and Phi-coefficients under scenarios of 1–10 raters. Qualitative feedback was independently color-coded and thematically categorized (language, content, organization) using a three-level coding scheme, then quantified as counts of relevant/effective feedback for descriptive statistics and thematic synthesis.
Key Findings
Reliability and variance components (G-studies): - Person (student ability) variance accounted for most variance: ChatGPT3.5 64.91%, ChatGPT4 87.63%, Teachers 76.59%. - Residual (unexplained) variance: ChatGPT3.5 32.88%, ChatGPT4 10.83%, Teachers 18.58%. - Rater variance (leniency/stringency differences) was minimal: ChatGPT3.5 2.21%, ChatGPT4 1.54%, Teachers 4.83%, indicating generally consistent scoring across raters within each group. Reliability coefficients (D-studies): - With 1 rater per essay: G/Phi coefficients were ChatGPT3.5 G=0.66, Phi=0.65; ChatGPT4 G=0.89, Phi=0.88; Teachers G=0.80, Phi=0.77. - With 2 raters per essay: ChatGPT3.5 G=0.80, Phi=0.79; ChatGPT4 G=0.94, Phi=0.93; Teachers G=0.89, Phi=0.87. - Reliability increased with more raters for all groups; ChatGPT4 consistently outperformed teachers, whereas ChatGPT3.5 lagged behind teachers. Qualitative feedback quantity and relevance (descriptive): - Language feedback (total relevant): ChatGPT3.5 718 (mean 179.5), ChatGPT4 805 (201.25), Teachers 218 (54.5). - Content feedback (total relevant): ChatGPT3.5 358 (89.5), ChatGPT4 414 (103.5), Teachers 83 (20.75). - Organization feedback (total relevant): ChatGPT3.5 377 (94.25), ChatGPT4 426 (106.5), Teachers 145 (36.25). Thematic feedback patterns: - Language: All groups noted grammar/syntax issues, sentence structure/complexity, word choice/vocabulary; ChatGPT4 and teachers also flagged spelling/typos. - Content: Common issues included lack of depth/specificity, unclear arguments and limited support; ChatGPT4 additionally highlighted missing counterarguments and persuasiveness. - Organization: Frequent concerns about cohesion/flow, transitions, and structural problems (unclear organization, weak or missing conclusions). Overall: ChatGPT4 yielded the highest reliability and the most relevant feedback across language, content, and organization; ChatGPT3.5 provided more relevant feedback than teachers but had lower scoring reliability than teachers.
Discussion
The study’s findings address the research questions by demonstrating that ChatGPT4 can produce highly reliable holistic scores for EFL essays, surpassing experienced teacher raters, while ChatGPT3.5 provides lower reliability than teachers. Both versions of ChatGPT generated more comprehensive and relevant qualitative feedback than teachers, distributing attention across language, content, and organization, whereas teachers emphasized language more heavily. These results suggest that advanced AI models can enhance fairness and consistency in classroom assessments and provide richer feedback to support revision and learning. They also align with prior literature on AI’s capacity to augment assessment and instruction, indicating practical potential for offloading routine assessment tasks to AI so teachers can focus on individualized pedagogy. Ethical considerations (e.g., academic integrity) and implementation guidelines remain crucial.
Conclusion
This preliminary investigation shows that ChatGPT4 offers higher reliability than human teachers in holistic scoring of EFL essays, and both ChatGPT versions provide more relevant feedback across multiple writing dimensions than teachers. The study contributes evidence that AI can enhance classroom-based EFL writing assessment by improving scoring consistency and supplying actionable, balanced feedback. Future research should expand samples across institutions and proficiency levels, compare additional AI models and configurations, and examine long-term learning outcomes and ethical practices. Adopting ChatGPT may reduce teacher workload, facilitate personalized learning, and modernize assessment practices when used with clear policies and pedagogical integration.
Limitations
Generalizability is limited by the small sample (30 essays) from a single university and the inclusion of teacher raters from one institution. Only two ChatGPT versions were examined, potentially omitting variability across other AI systems or settings. Although rater calibration was conducted, unmodeled facets may have contributed to residual variance. Findings may not transfer to different tasks, proficiency levels, or cultural contexts.
Listen, Learn & Level Up
Over 10,000 hours of research content in 25+ fields, available in 12+ languages.
No more digging through PDFs, just hit play and absorb the world's latest research in your language, on your time.
listen to research audio papers with researchbunny