logo
ResearchBunny Logo
This Paper Was Written with the Help of ChatGPT: Exploring the Consequences of AI-Driven Academic Writing on Scholarly Practices

Computer Science

This Paper Was Written with the Help of ChatGPT: Exploring the Consequences of AI-Driven Academic Writing on Scholarly Practices

H. Li, S. Lee, et al.

Recent advances in large language models are blurring the line between human and machine writing. This paper — conducted by Hongming Li, Seiyon Lee, and Anthony F. Botelho — presents a case study on using generative models to support academic writing, explores questions of originality and ethics, and shows how open science practices can help address emerging concerns while offering practical guidance for responsible use.

00:00
00:00
~3 min • Beginner • English
Introduction
The paper addresses the rise of large language models (LLMs) like ChatGPT and their integration into scholarly writing, highlighting both opportunities and ethical concerns. The authors acknowledge that GPT-4 Turbo was used to help write and revise portions of the paper, framing the work as a case study on practical, ethical, and methodological implications of LLM-assisted academic writing and as an example of how open science practices might address related concerns. They note risks of brittleness, undetected errors due to black-box models, and questions around originality and intellectual contribution. The study focuses on examining the potential risks of using AI-detection tools to identify LLM-assisted texts and offering guidance on addressing concerns through open science practices. It poses two research questions: (1) How accurately can current AI content detectors identify AI-generated text from LLMs like GPT-4 Turbo? (2) What adjustments do the academic publishing community need to make to promote transparency and explore best practices in response to the rise of generative AI tools such as ChatGPT? The context includes ongoing debates about AI as co-authors and the need for transparent policies in publishing.
Literature Review
Background situates the emergence of generative AI and chatbots, from early systems like ALICE to modern assistants (Siri, Watson, Assistant) and the paradigm shift with GPT-3’s scale and capabilities. Generative models support creative and academic writing, debugging, and can boost efficiency in scientific workflows (e.g., literature reviews, drafting, editing, summarization, technical language elaboration). In education, AI is used to optimize learning experiences and support educators. Ethical debates focus on accountability and authorship; many venues require disclosure of AI use and deny authorship to AI. Calls exist for codes of practice to ensure ethical standards and human oversight, viewing AI as an augmentative tool. Detecting AI-generated text is challenging: traditional plagiarism tools (e.g., Turnitin) rely on text matching, which fails with generative AI. New AI detectors have emerged with varied approaches, but efforts are fragmented and often rely on human oversight. Literature also points to risks of bias in detectors (e.g., against non-native English writers) and questions reliability of AI-text detection given models’ evolving human-like outputs.
Methodology
The study evaluates how well current AI content detectors identify AI-generated or AI-revised abstracts within the Educational Data Mining (EDM) and Learning Analytics and Knowledge (LAK) communities. Data sources: titles and abstracts from LAK22 (123 works) and EDM2022 (118 works). These proceedings predate ChatGPT’s public release (Nov 30, 2022), increasing the likelihood abstracts were human-authored and reducing the chance these texts were in GPT training at analysis time. Dataset generation: Using GPT-4 Turbo via ChatGPT, five content categories were created: (1) H: original human-authored abstracts; (2) GPT: new abstracts generated from titles with a prompt to write a 250-word abstract for the relevant venue; (3) GPTR: GPT-4 Turbo rewrites/polishes the human abstracts using a prompt to act as a professional researcher and return a 250-word revised abstract; (4) GPT/H: 50% GPT content followed by 50% human content; (5) H/GPT: 50% human content followed by 50% GPT content. Detector selection: five detectors chosen based on Google ranking, usability, and API availability, with preference for free or partially free tools: ContentDetector.AI (free, v2 model), ZeroGPT (free/open-source), GPTZero (partial free, academic integrations), Originality.ai (subscription, claims high accuracy), Winston.ai (trial then subscription). Evaluation: Detectors output probabilistic scores in [0,1], higher indicating more likely AI-authored. Metrics include RMSE (between detector scores and ground truth) and AUC. Ground truth: H=0, GPT=1, GPTR=1. For mixed content (GPT/H, H/GPT), two formulations were analyzed: (a) ground truth 0.5 reflecting 50/50 composition (with correctness band 0.4–0.6), and (b) ground truth 1 treating any AI presence as AI-generated. AUC analyses: (i) GPT>H (AI vs human); (ii) GPT>GPTR (pure AI vs AI-revised human); (iii) hierarchical GPT>GPTR>H. Additional analysis for mixed content examined RMSE variation across assumed ground truths from 0 to 1 with 95% confidence intervals to infer where detectors collectively minimize error.
Key Findings
- Mean prediction scores (0–1; higher indicates AI-likely): Originality.ai strongly separates classes. For LAK: H=0.079, GPT=0.955, GPTR=0.941, GPT/H=0.502, H/GPT=0.623. For EDM: H=0.119, GPT=0.975, GPTR=0.993, GPT/H=0.609, H/GPT=0.566. Winston.ai also shows high GPT means (LAK GPT=0.820; EDM GPT=0.759). ContentDetector.AI shows modest separation; ZeroGPT assigns low scores broadly (e.g., EDM GPT=0.003). GPTZero yields mid-range values with slight bias towards AI. - RMSE (lower is better): Originality.ai exhibits low RMSE for GPT content (LAK 0.172; EDM 0.101) and relatively low for H (LAK 0.205; EDM 0.252). Winston.ai shows low RMSE for H (LAK 0.190; EDM 0.254). GPTZero achieves very low RMSE on mixed content under 0.5 ground truth (e.g., LAK GPT/H=0.048; LAK H/GPT=0.012) but higher errors when treating mixed as fully AI (≈0.5+). ZeroGPT performs poorly on GPT/GPTR (RMSE ≈0.99). - AUC: For GPT>H, Winston.ai is near-perfect to perfect (LAK 1.000; EDM 0.991), and Originality.ai is strong (LAK 0.902; EDM 1.000). For GPT>GPTR, GPTZero excels on LAK (0.928) and remains strong on EDM (0.861), while Originality.ai is lower (LAK 0.698; EDM 0.581). Hierarchical GPT>GPTR>H mean AUC shows Winston.ai highest (LAK 0.667; EDM 0.690); Originality.ai moderate (LAK 0.537; EDM 0.280). - Mixed-content bias: Continuous RMSE analysis over assumed ground truths reveals minima near 0.36 (LAK) and 0.40 (EDM), indicating a systematic tendency across detectors to classify 50/50 mixed texts as more human than balanced. Table 5 corroborates that most detectors have lower RMSE when mixed content is treated closer to 0.5 than 1, with significant performance degradation when forced to ground truth 1. - Overall: Subscription-based detectors (Originality.ai, Winston.ai) generally outperform free tools in identifying AI text, with Originality.ai particularly strong at flagging AI-generated content and Winston.ai highly discriminative for GPT vs H. Free tools (ContentDetector.AI, ZeroGPT) better at identifying human text but weak at AI detection. GPTZero shows balanced behavior with notable strength differentiating GPT from GPTR.
Discussion
Findings indicate a performance gap between subscription-based and free AI detectors. Originality.ai and Winston.ai demonstrate strong ability to detect AI-generated content and discriminate GPT vs human, whereas ContentDetector.AI and ZeroGPT are better at recognizing human-authored text but are unreliable for detecting AI-generated text. A key challenge identified is mixed content: detectors tend to underestimate the AI proportion in 50/50 mixes, leaning toward classifying such text as mostly human. This bias is reflected in RMSE minima near ground truths below 0.5 for mixed content and suggests current tools are not well calibrated for partial AI involvement—an increasingly common real-world scenario. Practically, these findings caution against relying solely on AI detection to police LLM use in academic contexts. The work underscores the need for improved detector sensitivity and calibration for mixed authorship and emphasizes transparency and open science practices to address ethical and methodological concerns surrounding LLM-assisted writing. The results directly address the research questions by quantifying current detectors’ accuracy and highlighting the adjustments needed in academic publishing—namely, better guidance on disclosure, nuanced evaluations of AI involvement, and skepticism toward detector outputs used in isolation.
Conclusion
This study contributes a case-based, quantitative evaluation of popular AI content detectors on a controlled dataset derived from LAK22 and EDM2022 titles and abstracts, including fully human, fully GPT-4 Turbo generated, GPT-revised, and mixed composites. It demonstrates that premium detectors can effectively flag fully AI-generated text, while widely used free tools often underperform on AI detection and mixed cases. Crucially, detectors show a systematic bias to classify evenly mixed content as predominantly human. These insights inform publishers, reviewers, and authors on the current limits of AI detection and the importance of transparent disclosure and open science practices. Future research should (1) evaluate a broader set of detectors and modeling approaches over time to track rapid tool evolution; (2) extend beyond abstracts and beyond two conferences to diverse disciplines and text genres; and (3) develop more nuanced definitions and benchmarks for ground truth in mixed-authorship scenarios to better calibrate and assess detectors.
Limitations
The detector set is not exhaustive and reflects a snapshot in a rapidly evolving landscape. Analyses are limited to abstracts from two conferences (LAK22 and EDM2022), which may constrain generalizability across domains and text types. The style and implementation of AI-generated or AI-assisted text vary widely, potentially affecting detector performance. The assumption that human-generated text is an ideal standard is increasingly tenuous as AI assistance becomes commonplace. Ground truth definitions for mixed content are inherently uncertain, necessitating more nuanced approaches.
Listen, Learn & Level Up
Over 10,000 hours of research content in 25+ fields, available in 12+ languages.
No more digging through PDFs, just hit play and absorb the world's latest research in your language, on your time.
listen to research audio papers with researchbunny