logo
ResearchBunny Logo
A large-scale comparison of human-written versus ChatGPT-generated essays

Education

A large-scale comparison of human-written versus ChatGPT-generated essays

S. Herbold, A. Hautli-janisz, et al.

This groundbreaking study by Steffen Herbold, Annette Hautli-Janisz, Ute Heuer, Zlata Kikteva, and Alexander Trautsch reveals that ChatGPT produces superior argumentative essays compared to human high-school students. The findings highlight significant differences in writing styles and suggest a pressing need for educators to adapt their teaching methods to incorporate AI tools effectively.

00:00
00:00
Playback language: English
Introduction
The rapid rise of large language models (LLMs) like ChatGPT has sparked widespread debate about their societal impact, particularly on education. While anecdotal evidence and proprietary benchmarks exist, a rigorous, independent assessment of AI-generated content in educational contexts is lacking. This study addresses this gap by systematically comparing human-written and ChatGPT-generated argumentative essays, a crucial skill taught in high schools. Argumentative essays require students to formulate a position on a controversial topic, gather evidence, and construct a coherent argument. Understanding LLMs' abilities in this domain is crucial for gauging their capabilities and anticipating the challenges facing educators. Previous research offers individual examples and limited analyses, but this study is the first to systematically compare human and AI-generated argumentative essays across different ChatGPT versions using a large pool of human experts (high school teachers) for evaluation. The goal is to determine the quality of AI-generated essays compared to human-written ones and to identify linguistic characteristics distinguishing both.
Literature Review
Recent research on large-scale Natural Language Generation (NLG) systems, especially ChatGPT, is still emerging. Studies demonstrate that ChatGPT displays some human-like language capabilities, including reuse of sentence structure, understanding word meaning, and thematic structure identification. However, limitations exist; for example, it struggles with low-frequency grammatical constructions. Research also analyzes the structure of AI-generated language, noting differences in linguistic devices used compared to humans. The field of linguistic fingerprinting, which focuses on identifying unique linguistic patterns in individual writing styles, is relevant to this study as it offers methods to characterize and differentiate human and AI-generated text. Existing evaluations of ChatGPT by OpenAI, while impressive in some areas, suffer from significant limitations, including conflict of interest, methodological flaws, and lack of transparency in the assessment process. Other studies have emerged examining ChatGPT's performance in various contexts, including AP classes, physics essays, question-answer scenarios, medical licensing exams, mathematical tasks, and even scientific paper writing. These studies reveal both strengths and weaknesses of the model, highlighting inconsistent performance across different domains and tasks and suggesting the need for more robust evaluations.
Methodology
This study employed argumentative essay topics from an existing corpus, prompting ChatGPT-3 and ChatGPT-4 to generate essays. The prompt instructed the models to write approximately 200-word essays on each topic, avoiding structured prompts or few-shot learning to reflect a more realistic zero-shot scenario. A dataset of human-written essays from a German high-school essay forum served as a baseline for comparison. The essays (human, ChatGPT-3, ChatGPT-4) were rated by 139 high school teachers who participated in a training session explaining the study's methodology and rating criteria. The rating rubric assessed seven criteria: topic and completeness, logic and composition, expressiveness and comprehensiveness, language mastery, complexity, vocabulary and text linking, and language constructs. Each teacher evaluated six randomly selected essays, resulting in a total of 658 ratings across 270 essays. The data analysis involved calculating inter-rater reliability (Cronbach's alpha), statistical comparisons of ratings between essay types (Wilcoxon-rank-sum tests), and analysis of linguistic characteristics (lexical diversity, syntactic complexity, nominalizations, modals, epistemic markers, discourse markers). The self-assessments of English skills and confidence in ratings were used as confounding variables in correlation analyses.
Key Findings
The results show that AI models generated significantly higher-quality argumentative essays than those written by students across all rating criteria. ChatGPT-4 significantly outperformed ChatGPT-3 in logical structure, language complexity, vocabulary richness, and text linking. The inter-rater reliability was excellent (α > 0.9) for most criteria, indicating consistent evaluation. The average scores for the seven criteria consistently ranked student essays lowest, ChatGPT-3 in the middle, and ChatGPT-4 highest. Effect sizes comparing human versus ChatGPT essays were generally medium to large. Specifically, ChatGPT-4 outperformed human essays by an average of approximately one point on a seven-point Likert scale. Analysis of linguistic characteristics revealed significant differences between human and AI-generated essays. AI models exhibited higher sentence complexity, more nominalizations, and lower use of modal and epistemic constructions compared to human essays. While expected correlations between linguistic features and rating criteria were observed (e.g., complexity and sentence complexity), a counter-intuitive negative correlation was found between logic ratings and discourse markers in the AI-generated texts, potentially due to the rigid structure of the AI-generated essays.
Discussion
The findings address the central research questions by demonstrating that ChatGPT, particularly ChatGPT-4, produces argumentative essays superior in quality to those of the student group. The consistent outperformance of the AI models underscores the potential implications for educational assessment and suggests a need for adapting teaching strategies. The differences in linguistic characteristics highlight that AI models create text with distinct structural features, such as highly structured introductions and conclusions, potentially reflecting the model's training data and generation process. The lack of hallucinations in this task may relate to the focus on argumentation rather than factual accuracy. While the study controlled for potential confounding variables, the use of non-native speaking students and raters might affect the generalizability of the results to native-speaking populations. Although language mastery was a strength of AI, the quality differences in content are likely to remain consistent across native and non-native writers. The specific essay topics might not fully represent the diversity of essay topics encountered in real educational settings, but the results suggest that the observed patterns are robust.
Conclusion
This study demonstrates that LLMs like ChatGPT generate argumentative essays exceeding the quality of those written by high-school students. This necessitates a shift in educational approaches, emphasizing critical thinking and the responsible use of AI tools. Educators need to adapt assessment methods and integrate AI models as valuable learning aids, similar to the integration of calculators in mathematics. Future research should explore how AI affects writing skills and argumentation strategies, expand to native speakers, and investigate diverse AI models. Examining the long-term effects of AI on learning processes and the evolution of AI capabilities are also crucial.
Limitations
Several limitations warrant consideration. The use of non-native speakers in both essay writing and rating might limit the generalizability of findings to native-speaking populations. The selection of essay topics may not represent the full range of topics typically encountered in educational settings. Finally, the rapid pace of development in LLMs means that the specific models studied may not perfectly represent the current state of AI.
Listen, Learn & Level Up
Over 10,000 hours of research content in 25+ fields, available in 12+ languages.
No more digging through PDFs, just hit play and absorb the world's latest research in your language, on your time.
listen to research audio papers with researchbunny