Computer Science

AI, write an essay for me: A large-scale comparison of human-written versus ChatGPT-generated essays

S. Herbold, A. Hautli-janisz, et al.

This large-scale study by Herbold, Hautli-Janisz, Heuer, Kikteva, and Trautsch reveals that AI tools like ChatGPT can generate argumentative essays of higher quality than their human counterparts. Explore the intriguing linguistic characteristics found in AI writing and how educators can adapt their methods to harness these powerful technologies.

00:00

Playback language: English

Index

Introduction

The rapid rise of large-scale Natural Language Generation (NLG) systems, particularly ChatGPT, has sparked a global discussion about their societal impact, especially on education and information generation. While anecdotal evidence and vendor-provided benchmarks exist, rigorous scientific evaluation is lacking. This study aims to fill this gap by systematically comparing the quality of human-written and ChatGPT-generated argumentative essays, a common high-school assignment. The study examines essays written by German high school students and those generated by ChatGPT-3 and ChatGPT-4, evaluating them using a fine-grained scoring rubric applied by experienced high school teachers. The research questions address the quality of ChatGPT-generated essays, their comparison to human-written essays, and the characteristic linguistic differences between human and AI-generated content. This study aims to provide the first statistically rigorous and systematic study of essay quality, along with a computational linguistic analysis, offering insights into the development of NLG models over time.

Literature Review

Existing research on large-scale NLG systems is still in its early stages. Some studies focus on ChatGPT's ability to replicate human language use, including sentence structure, word meaning, and thematic structure. Other research explores its limitations in understanding complex grammatical constructions. Several studies assess ChatGPT's performance across various domains, such as physics, medicine, and mathematics. While some evaluations suggest impressive capabilities, often conducted by the model's developers, issues of conflict of interest, assessment methodology, and potential data contamination (benchmark tasks being part of the training data) raise concerns about the robustness of these findings. The current literature mostly focuses on ChatGPT-3, with limited large-scale, independent evaluations of ChatGPT-4 in the context of essay writing.

Methodology

The study used argumentative essay topics from an existing corpus. ChatGPT-3 and ChatGPT-4 were prompted to write essays on these topics, resulting in a dataset of student essays, ChatGPT-3 essays, and ChatGPT-4 essays. The dataset was evaluated by 139 secondary school teachers who participated in a two-hour online training. Each teacher rated six randomly selected essays using a seven-point Likert scale across seven categories: topic and completeness, logic and composition, expressiveness and comprehensiveness, language mastery, complexity, vocabulary and text linking, and language constructs. Teachers also self-assessed their English proficiency and confidence in their ratings. A computational linguistic analysis was conducted, examining lexical diversity, syntactic complexity (depth and clause structure), nominalization, modals, epistemic markers, and discourse markers. Statistical analyses, including Cronbach's alpha for inter-rater reliability, Wilcoxon rank-sum tests for comparing groups, and Pearson's rank correlation for examining relationships between variables, were used to analyze the data.

Key Findings

The study revealed several key findings. First, both ChatGPT-3 and ChatGPT-4 generated significantly higher-quality argumentative essays than human students across all evaluation criteria. The effect sizes were large for ChatGPT-4, indicating a substantial performance difference. ChatGPT-4 outperformed ChatGPT-3, particularly in logical structure, language complexity, vocabulary, and text linking. Second, significant linguistic differences were observed between human and AI-generated texts. ChatGPT models used more nominalizations, resulting in more complex sentence structures. However, humans used more modal and epistemic constructions. Lexical diversity was higher for humans than ChatGPT-3 but lower than ChatGPT-4. ChatGPT-4 used significantly fewer discourse markers. Third, the inter-rater reliability was high (Cronbach's α > 0.9 for most categories), suggesting strong agreement among the raters. Analyses also indicated that neither the teachers' self-assessed English proficiency nor their confidence in their ratings correlated with their actual ratings.

Discussion

The findings strongly support the concern among educators about the impact of generative AI models on student assignments. The superior performance of ChatGPT in generating high-quality essays raises questions about assessment methods and the future of homework. The study's results do not imply that AI should replace human writing entirely, but rather that educators should adapt their teaching strategies to integrate these tools effectively. The distinct linguistic characteristics of AI-generated essays highlight the need for critical analysis skills in students. While hallucinations were not a major concern in this context, the study suggests a need for further research into the argumentative structures produced by AI models compared to humans.

Conclusion

This study demonstrates that ChatGPT models significantly outperform humans in generating argumentative essays. This necessitates a re-evaluation of homework assignments and assessment methods in education. Future research should focus on adapting teaching to leverage AI tools and on further investigation of argumentative structures in AI-generated text. Exploring the integration of AI in various learning contexts is crucial for effective pedagogical practice in the age of advanced AI.

Limitations

The study's participants were not native English speakers, although measures were taken to mitigate this limitation. The essay topics might not be fully representative of the general population of essay topics, and results might not generalize perfectly to other AI models. Despite these limitations, the high inter-rater agreement and consistent findings across multiple criteria and models suggest the robustness of the results.

Related Publications

Explore these studies to deepen your understanding of the subject.

Education

A large-scale comparison of human-written versus ChatGPT-generated essays

S. Herbold, A. Hautli-janisz, et al.

Medicine and Health

Design and Analysis of a Deep Learning Ensemble Framework Model for the Detection of COVID-19 and Pneumonia Using Large-Scale CT Scan and X-ray Image Datasets

X. Xue, S. Chinnaperumal, et al.

Interdisciplinary Studies

A large-scale audit of dataset licensing and attribution in AI

S. Longpre, R. Mahari, et al.

Computer Science

Is Your Code Generated by ChatGPT Really Correct? Rigorous Evaluation of Large Language Models for Code Generation

J. Liu, S. Chunqiu, et al.

Listen, Learn & Level Up

Over 10,000 hours of research content in 25+ fields, available in 12+ languages.

No more digging through PDFs, just hit play and absorb the world's latest research in your language, on your time.

listen to research audio papers with researchbunny