
Education
A large-scale comparison of human-written versus ChatGPT-generated essays
S. Herbold, A. Hautli-janisz, et al.
This groundbreaking study by Steffen Herbold, Annette Hautli-Janisz, Ute Heuer, Zlata Kikteva, and Alexander Trautsch reveals that ChatGPT produces superior argumentative essays compared to human high-school students. The findings highlight significant differences in writing styles and suggest a pressing need for educators to adapt their teaching methods to incorporate AI tools effectively.
~3 min • Beginner • English
Introduction
The paper examines the societal and educational impact of generative AI by providing the first independent, systematic comparison of AI-generated and human-written argumentative high-school essays. Argumentative essay writing is central to secondary education, requiring students to adopt and defend positions on controversial topics. The study addresses the need for scientifically rigorous evaluation beyond anecdotal reports and vendor-run benchmarks. It evaluates ChatGPT (GPT-3.5 and GPT-4 via the ChatGPT web interface) using a fine-grained scoring rubric (content and language mastery) and a large pool of expert raters (secondary school teachers). The research questions are: RQ1: How good is ChatGPT based on GPT-3 and GPT-4 at writing argumentative student essays? RQ2: How do AI-generated essays compare to essays written by students? RQ3: What linguistic devices are characteristic of student versus AI-generated content? The motivation is to understand the capabilities and limits of generative AI for core educational tasks and to inform pedagogical responses.
Literature Review
The related work covers: (1) Natural language generation: ChatGPT (InstructGPT/GPT-3.5) and GPT-4 fine-tuned with human feedback (FLAN-style approaches) differ from purely self-supervised models like GPT-3 and GPT-J. Public instruction-tuned open models (e.g., Alpaca, Dolly) replicate aspects of the pipeline. Studies show mixed linguistic competencies of ChatGPT, including strengths in acceptability judgments and certain structural language tasks, but weaknesses in low-frequency constructions and complex embeddings. (2) Limitations of OpenAI’s own evaluations: Potential conflicts of interest, limited transparency (especially for free-text scoring in AP/GRE tasks), unclear rater agreement, few-shot prompting that may influence outputs, and possible data contamination undermine generalizability, particularly for essay-writing capability. (3) Independent scientific assessments: Early studies (mostly preprints) report that ChatGPT-3 can produce high-quality essays in specific domains (e.g., physics) with variability akin to humans; humans can often distinguish AI from human answers but still find AI outputs helpful. Other evaluations reveal limits in course-level performance (e.g., physics coursework), medical exams (near passing), mathematics (below average student level for graduate tasks), and arithmetic (stronger than many LMs but still error-prone). Concerns about benchmark contamination (memorization) exist. Linguistic fingerprinting literature shows computational features can distinguish authors and proficiency, and aid tasks like readability, cohesion, grading, and user identification—motivating analysis of stylistic differences between AI and human essays.
Methodology
Materials and methods: Data—Ninety argumentative essay topics were taken from Stab and Gurevych’s corpus (from Essay Forum), typically high-school level (grades 11–13). Each topic includes one human-written student essay (non-native speakers; average 19 sentences, 388 tokens). For each topic, two AI essays were generated on 22 March 2023 using the ChatGPT web interface: GPT-3.5 (ChatGPT-3) and GPT-4. Prompt: “Write an essay with about 200 words on ‘[topic]’.” Zero-shot prompting; no additional examples or templates. A shorter word target was chosen to avoid length-based advantages. Final dataset: 270 essays (90 human, 90 GPT-3.5, 90 GPT-4). Core lengths (words/essay; sentences/essay; words/sentence): Students 339.13; 18.98; 18.60. ChatGPT-3: 247.96; 12.40; 20.31. ChatGPT-4: 253.70; 13.08; 19.57. Annotation study—Participants: 139 secondary school teachers registered for an online training; 111 completed the questionnaire. Diverse subject backgrounds (languages, religion/ethics, sciences). Study procedure: 45-minute lecture on LLMs, then instructions for a 30-minute rating task. Participation voluntary and anonymous; consent obtained via participation. Within-subjects design: each rater assessed six randomly selected essays; submissions saved after each essay. Rating rubric: seven criteria on a seven-point Likert scale—Topic and completeness; Logic and composition; Expressiveness and comprehensibility; Language mastery; Complexity; Vocabulary and text linking; Language constructs. Raters also provided self-assessed English proficiency and confidence in ratings. Linguistic features—Computed per essay to analyze stylistic differences: lexical diversity (e.g., MTLD-style validated approaches), syntactic complexity by parse tree depth and number of dependency clauses, nominalizations, modals, epistemic markers (e.g., “I think,” “it is believed,” “in my opinion” from a dialogue argumentation corpus), and discourse markers (PDTB list adjusted to remove ambiguous items). Tools included spaCy for NLP processing. Statistical analysis—Inter-rater reliability measured via Cronbach’s alpha per criterion and source. Differences across sources (human vs. GPT-3.5 vs. GPT-4) tested using two-sided Wilcoxon tests with Holm–Bonferroni correction (family-wise error 0.05). Effect sizes reported using Cohen’s d. Means with 95% confidence intervals estimated via non-parametric bootstrap; distributions visualized (violin plots). Pearson correlations assessed: (a) between raters’ self-assessed English/confidence and ratings; (b) between linguistic features and corresponding criteria (e.g., complexity vs. syntactic complexity and nominalizations; logic vs. discourse markers; vocabulary vs. lexical diversity). Significance via two-sided tests; multiple testing corrected via Holm–Bonferroni. Implementation in Python (pandas 1.5.3, numpy 1.24.2, pingouin 0.5.3, scipy 1.10.1, seaborn 0.12.2).
Key Findings
- Data and reliability: 111 teachers produced 658 ratings across 270 essays (90 topics × 3 sources). Inter-rater agreement was excellent (Cronbach’s α > 0.9) for all criteria except language mastery (α = 0.89), which was still good.
- Overall performance ordering: Across all seven criteria, mean scores followed a consistent order—Students < ChatGPT-3.5 < ChatGPT-4. Standard deviations were similar across groups (slightly >1), indicating consistent spread.
- Mean ratings (Table 2): Humans vs. ChatGPT-3 vs. ChatGPT-4, respectively—
• Topic and completeness: 3.58 vs. 4.24 vs. 4.54
• Logic and composition: 3.64 vs. 4.29 vs. 4.64
• Expressiveness and comprehensibility: 3.42 vs. 3.90 vs. 4.23
• Language mastery: 3.90 vs. 5.03 vs. 5.25
• Complexity: 3.72 vs. 4.20 vs. 4.60
• Vocabulary and text linking: 3.78 vs. 4.41 vs. 4.81
• Language constructs: 3.80 vs. 4.47 vs. 4.73
• Overall: 3.69 vs. 4.36 vs. 4.68 (Likert 1–7)
- Statistical significance and effect sizes (Table 4):
• Humans vs. ChatGPT-3.5: all criteria significantly better for ChatGPT-3.5; Cohen’s d ranged 0.52–1.15 (medium to large). Smallest differences in expressiveness and complexity; largest in language mastery.
• Humans vs. ChatGPT-4: all criteria significantly better for ChatGPT-4; d ranged 0.88–1.43 (large to very large). Again, smallest differences in expressiveness and complexity; largest in language mastery.
• ChatGPT-3.5 vs. ChatGPT-4: GPT-4 significantly higher for logic and composition (d ≈ 0.45), complexity (d ≈ 0.48), and vocabulary and text linking (d ≈ 0.50). Other criteria showed higher means for GPT-4 but differences were not statistically significant after correction.
- Linguistic characteristics (Table 3):
• Humans vs. AI: AI essays exhibited more nominalizations and higher sentence complexity (clauses), while human essays used more modals and epistemic markers. Lexical diversity: Humans > GPT-3.5 but < GPT-4. Discourse markers: no significant difference between humans and GPT-3.5; GPT-4 used significantly fewer discourse markers.
• Significant pairwise differences were observed for most linguistic features with effect sizes ranging from small to very large (d ≈ 0.39–1.93). Syntactic complexity (depth) showed no significant difference for Humans vs. GPT-3.5 and GPT-3.5 vs. GPT-4; discourse markers showed no significant difference for Humans vs. GPT-3.5.
- Correlations between features and ratings: Positive but weak correlations between complexity ratings and syntactic complexity (r = 0.16 depth; r = 0.19 clauses) and nominalizations (r = 0.22). Vocabulary ratings correlated weakly with lexical diversity (r = 0.12). Logic ratings were weakly negatively correlated with discourse markers (r = −0.14), contrary to intuitive expectations but aligning with prior work. All correlations significant but small in magnitude.
- Dataset characteristics (Table 1): Average lengths—Students: 339.13 words, 18.98 sentences, 18.60 words/sentence; ChatGPT-3.5: 247.96 words, 12.40 sentences, 20.31 words/sentence; ChatGPT-4: 253.70 words, 13.08 sentences, 19.57 words/sentence.
- Qualitative observations: AI essays often showed rigid structural patterns (e.g., “In conclusion, …” endings, formulaic openings), suggesting standardized organization distinct from human essays.
Discussion
The study provides clear evidence for RQ1 and RQ2: ChatGPT (GPT-3.5 and GPT-4) writes argumentative essays that human teachers rate higher than human-written student essays across all assessed criteria. GPT-4 exhibits additional gains in logical organization, vocabulary/text linking, and complexity relative to GPT-3.5. For RQ3, the linguistic fingerprint of AI differs from that of human writers: AI uses more nominalizations and shows higher sentence complexity, while humans use more modals and epistemic markers; GPT-4 employs fewer discourse markers than both humans and GPT-3.5. The negative association between discourse markers and logical ratings may reflect that AI achieves coherence through paragraph structuring rather than explicit connective usage. Hallucinations did not materially affect performance in this argumentative genre because topics emphasized reasoning over factual recall; still, subtle hallucinations cannot be fully ruled out. Potential data contamination (topics present in training data) is deemed less problematic here, as tasks rely on general conceptual reasoning rather than memorization of specific solutions. Overall, the findings suggest that generative AI can reliably produce high-quality argumentative essays, posing challenges and opportunities for educational assessment and instruction.
Conclusion
AI models, including ChatGPT-3.5 and especially GPT-4, can produce argumentative essays that receive higher quality ratings than human student essays from expert teacher raters. This compels a rethinking of homework practices and assessment: educators should integrate AI tools as supports—analogous to calculators—after foundational skills are established, emphasizing critical evaluation of AI-generated reasoning and content. The paper contributes an independent, large-scale, statistically rigorous evaluation of AI versus human essays and documents distinct linguistic profiles between them (e.g., reduced discourse markers in GPT-4). Future work should: (1) conduct more systematic analyses of argumentative structures and reasoning quality, (2) examine how exposure to AI-generated style may influence human discourse marker usage, (3) extend evaluations to additional models and native-speaker populations, and (4) track how rapid model iterations affect essay quality and style over time.
Limitations
- Participant and author language background: Both student essays and teacher raters were non-native English speakers, limiting generalizability, particularly for language-related criteria. While student essays likely benefited from native-speaker feedback in the forum, differences with native speakers may narrow AI–human gaps.
- External validity: Topics sourced from Essay Forum may not represent the full distribution of essay prompts. Results are specific to high-school-level argumentative essays and may differ for advanced writers/expert domains.
- Model scope: Only ChatGPT-3.5 and GPT-4 (as of March 22, 2023) via the web interface were tested; results may not generalize to other models (e.g., Bard, Alpaca, Dolly) or to updated versions of ChatGPT.
- Evolving systems: Rapid changes to deployed models mean current versions may differ from those evaluated; lack of transparency about updates limits reproducibility over time.
- Potential data contamination: Some topics may have appeared in training data; while less critical for general argumentative tasks, contamination cannot be fully ruled out.
- Hallucinations: Although not prominent for these tasks, subtle hallucinations may have been overlooked in ratings.
Related Publications
Explore these studies to deepen your understanding of the subject.