logo
ResearchBunny Logo
AI, write an essay for me: A large-scale comparison of human-written versus ChatGPT-generated essays

Computer Science

AI, write an essay for me: A large-scale comparison of human-written versus ChatGPT-generated essays

S. Herbold, A. Hautli-janisz, et al.

This large-scale study by Herbold, Hautli-Janisz, Heuer, Kikteva, and Trautsch reveals that AI tools like ChatGPT can generate argumentative essays of higher quality than their human counterparts. Explore the intriguing linguistic characteristics found in AI writing and how educators can adapt their methods to harness these powerful technologies.

00:00
00:00
~3 min • Beginner • English
Introduction
The paper addresses the rapid emergence of large-scale natural language generation systems, notably ChatGPT, and their anticipated impact on education and knowledge transfer. Motivated by calls for independent, rigorous evaluation beyond vendor-reported benchmarks, the study investigates the quality of AI-generated argumentative essays compared to human-written essays typical of high-school contexts. The authors formulate three research questions: RQ1: How good is ChatGPT (GPT-3.5 and GPT-4) at writing argumentative student essays? RQ2: How do AI-generated essays compare to essays written by humans? RQ3: What linguistic devices are characteristic of human versus AI-generated content? The purpose is to provide an independent, systematic, and statistically rigorous assessment of essay quality and linguistic characteristics, informing educators and stakeholders about the capabilities and implications of generative AI in educational settings.
Literature Review
The related work surveys foundation models (GPT-3, GPT-J) and instruction-tuned models (InstructGPT/ChatGPT, FLAN family) and notes the lack of transparency regarding GPT-4 training. Prior evaluations show mixed evidence: some linguistic competencies align with human usage (ambiguity resolution, thematic structure), while others reveal weaknesses in low-frequency grammatical constructions and reasoning. Studies comparing human and ChatGPT outputs (e.g., Turing-test style assessments) suggest humans can often detect AI text, yet AI responses are sometimes rated more helpful. Evaluations on academic benchmarks (AP exams, USMLE, mathematics, arithmetic) indicate variable performance, with concerns about data contamination and methodological rigour in proprietary benchmarks. Closest to this work, small-scale essay studies (e.g., physics essays) find ChatGPT can produce high-quality essays but with variability and limited sample sizes precluding strong statistical conclusions. The review also covers linguistic fingerprinting and its applications in authorship identification, educational assessment (readability, cohesion), and automated essay scoring, motivating the linguistic feature analyses employed in this study.
Methodology
Design: Within-subjects study where each rater (secondary-school teacher) evaluates six essays blind to authorship (human vs. ChatGPT-3.5 vs. ChatGPT-4). Data: 90 argumentative topics sourced from Stab and Gurevych (2014) based on essayforum.com (non-native high-school students seeking feedback). For each topic, three essays were available: one human-written student essay and two AI-generated essays (ChatGPT-3.5 and ChatGPT-4 as of March 22, 2023). ChatGPT essays were produced via the browser interface using a zero-shot prompt: “Write an essay with about 200 words on ‘[topic]’.” No few-shot or template guidance was used. Dataset statistics (words per essay / sentences per essay / words per sentence): Human 339.13 / 18.98 / 18.60; ChatGPT-3.5 247.96 / 12.40 / 20.31; ChatGPT-4 253.70 / 13.08 / 19.57. Raters: 111 German secondary-school teachers (predominantly grammar schools), attending a 2-hour online training on ChatGPT. After an initial lecture (45 minutes), teachers completed the questionnaire (30 minutes) and then continued the training. Participation was voluntary and did not affect certification. Questionnaire and scoring: Raters self-assessed English proficiency (CEFR-based) and per-essay confidence. Each essay was rated on seven criteria using 7-point Likert scales (0–6), following guidelines from the Ministry for Education of Lower Saxony: Topic and completeness; Logic and composition; Expressiveness and comprehensibility; Language mastery; Complexity; Vocabulary and text linking; Language constructs. After rating all six essays, raters saw a summary including authorship information. Computational linguistic analysis: The authors computed essay-level linguistic features: lexical diversity (MTLD); syntactic complexity via (a) maximum dependency tree depth and (b) counts of specific clause-related dependency relations; nominalizations (counts of nouns with derivational suffixes such as -ion, -ment, -ance); modals (POS-based) and epistemic markers (lexicon-based, e.g., “I think,” “in my opinion”); discourse markers (based on PDTB list, pruned to exclude ambiguous items like “like,” “for,” “in”). Statistical analysis: Inter-rater reliability assessed via Cronbach’s α per criterion and source. Group differences tested using Wilcoxon tests with Holm–Bonferroni correction (family-wise α=0.05); effect sizes via Cohen’s d. Means and 95% confidence intervals estimated with bootstrap; distributions visualized via violin plots. Potential confounds (rater English proficiency, rating confidence) examined using Pearson’s r correlations with ratings. Associations between linguistic features and relevant rating criteria (e.g., complexity with syntactic complexity/nominalizations; vocabulary with lexical diversity; logic with discourse markers) were also examined via Pearson’s r. Implementation used Python (pandas, numpy, pingouin, scipy, seaborn). Replication: Data and analysis code are publicly available (https://github.com/sherbold/chatgpt-student-essay-study).
Key Findings
- Sample and reliability: 111 teachers participated; 658 ratings across 270 essays (90 topics × 3 sources). Most essays received 2–3 independent ratings. Inter-rater reliability was excellent across criteria (Cronbach’s α > 0.9) except Language mastery (α = 0.89), indicating robust assessment consistency. - Overall quality: Clear ordering of mean scores across all criteria: Human < ChatGPT-3.5 < ChatGPT-4. Overall means (M, SD): Human 3.69 (1.26), ChatGPT-3.5 4.36 (1.14), ChatGPT-4 4.68 (1.11) on 0–6 scales. - Human vs. ChatGPT-3.5: Significant improvements for AI across all criteria with medium to large effect sizes (Cohen’s d ≈ 0.52–1.15). Smallest effects for Expressiveness/comprehensibility and Complexity; largest for Language mastery, indicating fewer language errors in AI outputs. - Human vs. ChatGPT-4: Larger differences with large to very large effects (d ≈ 0.88–1.43). Again, smallest for Expressiveness/comprehensibility and Complexity; largest for Language mastery. Humans’ mean Language mastery remained moderate (M=3.90) but AI achieved exceptionally high scores (ChatGPT-3.5 M=5.03; ChatGPT-4 M=5.25). - ChatGPT-4 vs. ChatGPT-3.5: ChatGPT-4 had higher mean ratings on all criteria, but significant gains appeared specifically for Logic and composition, Vocabulary and text linking, and Complexity with small-to-medium effects (d ≈ 0.45–0.50). - Linguistic characteristics: • Syntactic complexity (clauses) and nominalizations: Higher in AI essays than human essays; GPT-4 generally highest. • Modals and epistemic markers: More frequent in human essays; AI uses fewer, with GPT-4 the lowest (epistemic markers near zero). • Lexical diversity (MTLD): Humans > ChatGPT-3.5 but ChatGPT-4 > humans, indicating improved diversity in newer model. • Discourse markers: No significant difference between humans and ChatGPT-3.5; GPT-4 uses significantly fewer discourse markers than both. • Syntactic complexity (depth): No significant differences between human vs. ChatGPT-3.5 and between ChatGPT-3.5 vs. ChatGPT-4. - Correlations (all weak): Complexity ratings correlated positively with syntactic complexity (depth r≈0.16; clauses r≈0.19) and nominalizations (r≈0.22). Vocabulary ratings correlated with lexical diversity (r≈0.12). Logic ratings correlated negatively with discourse markers (r≈−0.14), echoing prior findings that overt cohesion markers may not signal higher-quality logic. - Confounds: No correlation of rater English proficiency or self-rated confidence with assigned scores, reducing concerns about these as confounding factors.
Discussion
The findings directly address the research questions: (1) ChatGPT produces high-quality argumentative essays, with GPT-4 achieving the highest ratings; (2) AI-generated essays significantly outperform human student essays across all rubric criteria, especially in language mastery; (3) Distinct linguistic fingerprints differentiate AI and human writing. AI essays are more structurally rigid and formal, featuring greater syntactic complexity and nominalizations, and fewer modals/epistemic markers. GPT-4’s higher lexical diversity and reduced use of discourse markers suggest it achieves coherence through paragraph-level organization rather than explicit connectives. These results imply that generative AI can meet and exceed typical high-school essay standards, challenging traditional homework and assessment practices. Hallucinations were not problematic for the argumentative genre studied, as tasks emphasize reasoning over factual recall. Potential data contamination is discussed as unlikely to drive results, given the generic nature of topics and lack of resemblance to the human essay corpus. Overall, the study provides a rigorous, independent benchmark contrasting with less transparent vendor evaluations and highlights evolving linguistic properties between model generations.
Conclusion
The study demonstrates that ChatGPT—especially GPT-4—consistently outperforms human student essays across a comprehensive quality rubric, with clear advances in logical structure, complexity, and vocabulary. Given the ease of access to such tools, educational practice must adapt: like calculators in mathematics, AI should be integrated thoughtfully after foundational skills are established. Educators should shift from solely assigning essays to guiding students in critical analysis of AI outputs, reasoning quality, coherence, and correctness. The work contributes an independent, large-scale, statistically sound evaluation of AI-generated essays and their linguistic characteristics, offering insight into model development over time. Future research should examine argumentative structure in more detail, explore generalization to other models, and refine pedagogical strategies leveraging AI for competency-based learning objectives.
Limitations
- Participant and author populations: Neither essay authors (students) nor raters (teachers) were native English speakers; however, teacher self-assessed proficiency and rating confidence did not correlate with scores, and inter-rater reliability was high. - Topic/source representativeness: Human essays were sourced from a single forum and may not represent the broader population of essay topics or student abilities. - Model scope and configuration: Only ChatGPT-3.5 and GPT-4 were evaluated, via browser interface, with a single zero-shot prompt (“~200 words”). No few-shot prompts, alternative temperatures, or multiple generations per topic were explored. - Generalizability to other models: Results may not transfer to other LLMs (e.g., Bard, Alpaca, Dolly), especially regarding linguistic fingerprints. - Potential training data contamination: Topics may overlap with model training data; proprietary training details prevent verification. Authors argue topics are general and results unlikely due to memorization, but contamination cannot be fully ruled out. - Measures and correlations: Some expected feature–rating associations were weak; discourse marker usage correlated negatively with logic ratings, warranting further investigation. - Genre/task scope: Focused on short argumentative essays; findings may not extend to other genres (e.g., research reports) or tasks requiring factual precision.
Listen, Learn & Level Up
Over 10,000 hours of research content in 25+ fields, available in 12+ languages.
No more digging through PDFs, just hit play and absorb the world's latest research in your language, on your time.
listen to research audio papers with researchbunny