logo
ResearchBunny Logo
This Paper Was Written with the Help of ChatGPT: Exploring the Consequences of AI-Driven Academic Writing on Scholarly Practices

Computer Science

This Paper Was Written with the Help of ChatGPT: Exploring the Consequences of AI-Driven Academic Writing on Scholarly Practices

H. Li, S. Lee, et al.

This paper explores how large language models reshape academic writing—examining originality, ethical use, and open science remedies—presented as a case study into generative model assistance. Research conducted by Hongming Li, Seiyon Lee, and Anthony F. Botelho. Listen to the audio to hear practical guidance on ethical, effective LLM use in scholarly writing.

00:00
00:00
~3 min • Beginner • English
Introduction
The paper situates the rise of large language models (LLMs) like ChatGPT within academic writing, noting both opportunities and ethical concerns, including error risks due to black-box systems and questions of originality and intellectual contribution. Acknowledging that GPT-4 Turbo was used to revise portions of the paper, the authors focus on realistic adoption scenarios rather than prescriptive policies. The study aims to examine the risks of relying on AI-detection tools to police AI-assisted writing and to explore how open science practices can mitigate concerns around transparency and ethics. Research questions: (1) How accurately can current AI content detectors identify AI-generated text from LLMs like GPT-4 Turbo? (2) What adjustments does the academic publishing community need to make to promote transparency and best practices in response to generative AI tools?
Literature Review
Background covers the evolution of chatbots and LLMs, from ALICE to GPT-3 and GPT-4, and their documented capabilities across creative and technical tasks. Generative AI is increasingly used in research workflows (literature review, drafting, editing, summarization, technical language), and in education to improve efficiency for learners and educators. Authorship and accountability debates have led venues (e.g., AIED, ACL) to require disclosure of AI usage and to reject AI as authors due to accountability limitations. The push is toward a code of practice where AI augments human oversight. Detecting AI-generated text is challenging because generative models increasingly mimic human language; traditional plagiarism tools (e.g., Turnitin) based on text matching are inadequate. Multiple detection approaches exist but are fragmented; human oversight remains important. Prior work has scrutinized detector effectiveness and biases, including potential disadvantages for non-native English writing.
Methodology
Data sources: Titles and abstracts from LAK22 (123 works) and EDM2022 (118 works). These venues’ submission deadlines in 2021–2022 predate ChatGPT’s November 30, 2022 public release, increasing the likelihood that abstracts are human-authored and were not in training data for the models tested. Focus is on text-only (titles/abstracts) to avoid multimodal confounds. Dataset generation: Using GPT-4 Turbo via ChatGPT, the authors created five content categories per paper: (H) human-authored abstracts from proceedings; (GPT) abstracts generated solely from titles using the prompt: “With the following paper title, write a 250-word abstract for a {learning analytics | educational data mining} research article: {paper_title}”; (GPTR) revised abstracts where ChatGPT rewrites the original human abstract using the prompt: “I would like you to be a professional educational data mining researcher. Based on the following paper title and abstract, please help me polish the abstract and rewrite it into a 250-word abstract for this research paper. Please only return the revised abstract. Article Title: {paper_title} Abstract: {paper_abstract}.”; (GPT/H) mixed abstracts containing 50% GPT text followed by 50% human text; (H/GPT) mixed abstracts with 50% human text followed by 50% GPT text. Detectors: Five widely used detectors selected by search ranking, usability, and API availability, with attention to free/low-cost access: ContentDetector.AI (v2), ZeroGPT (open-source), GPTZero (academic use, partial free, API/Canvas integration), Originality.ai (subscription, publisher focus), Winston.ai (trial credits, subscription thereafter). Evaluation: Detectors produce probabilistic scores in [0,1] indicating AI-likelihood. Metrics include RMSE (error against ground truth) and AUC (discriminative ability across thresholds). Ground truth set as 0 for H, 1 for GPT and GPTR. For mixed content (GPT/H, H/GPT), analyses considered ground truths of 0.5 (reflecting equal composition, with predictions within [0.4,0.6] treated as correct) and 1 (any AI presence treated as AI). Additional continuous analyses varied ground truth from 0 to 1 to study RMSE behavior. Mean prediction scores were also examined for bias across content types.
Key Findings
- Mean prediction scores (Table 2): Originality.ai strongly separates AI vs. human (e.g., LAK GPT=0.955, GPTR=0.941; EDM GPT=0.975, GPTR=0.993; H≈0.079–0.119). ContentDetector.AI identifies human content reasonably (H≈0.234–0.246) but is less consistent for AI (GPT≈0.355–0.391). ZeroGPT yields very low scores for AI (LAK GPT=0.011; EDM GPT=0.003) and low for human (H≈0.080–0.022), indicating poor AI identification. GPTZero shows a balanced profile with moderate AI-likeness scores (LAK GPT=0.502; EDM GPT=0.517). Winston.ai detects AI relatively well (LAK GPT=0.820; EDM GPT=0.759) but struggles on mixed content. - RMSE (Table 3): Lower is better. Originality.ai excels on AI-generated content (LAK GPT RMSE=0.172; EDM GPT RMSE=0.101; EDM GPTR RMSE=0.038). GPTZero shows very low RMSE for specific mixed cases (LAK GPT/H=0.048; LAK H/GPT=0.012), aligning closely with mixed-content ground truth assumptions, but moderate errors elsewhere. ZeroGPT has near-maximal RMSE on GPT content (≈0.993), indicating poor detection. Winston.ai shows relatively low RMSE on human content (LAK H=0.190; EDM H=0.254) but high error for GPTR in LAK (0.870). - AUC (Table 4): Winston.ai achieves near-perfect discrimination for GPT vs. H (LAK=1.000; EDM=0.991). Originality.ai is strong on GPT>H (LAK=0.902; EDM=1.000) but weaker on GPT>GPTR (LAK=0.698; EDM=0.581). GPTZero performs best on GPT>GPTR for LAK (0.928). Hierarchical ordering (GPT>GPTR>H) mean AUC is highest for Winston.ai (LAK=0.667; EDM=0.690). - Mixed content analysis (Table 5, Figure 2): When treating mixed content as 0.5 ground truth, detectors frequently under-estimate the AI proportion. Continuous RMSE analysis shows minima closer to 0.36 (LAK) and 0.40 (EDM) rather than the expected 0.5, indicating a systematic bias toward classifying evenly mixed text as predominantly human. - Overall: Subscription-based detectors (e.g., Originality.ai, Winston.ai) generally outperform free tools at identifying AI-generated content; free tools often better identify human text but are inconsistent for AI. Across detectors, mixed content remains challenging and is often misclassified as more human-authored than it is.
Discussion
Findings indicate notable variability among detectors, with subscription-based tools (Originality.ai, Winston.ai) generally better at identifying AI-generated text, and free tools (ContentDetector.AI, ZeroGPT) more reliable for human text yet weaker for AI. AUC and RMSE analyses highlight that some detectors excel at specific tasks (e.g., GPTZero for distinguishing GPT from GPTR; Winston.ai for GPT vs. H) but lack uniformity across content types. A critical insight is the bias on mixed content: detectors tend to underrate the AI fraction, classifying evenly mixed text as predominantly human. Given these limitations and inconsistencies, the authors caution against using detector outputs as sole evidence to critique or police academic writing. They argue for transparency, ethical disclosure, and open science practices to navigate AI’s role in scholarly work, and for the academic publishing community to develop best-practice guidelines that recognize nuanced AI assistance rather than binary AI/human categorizations.
Conclusion
Limitations
The study examines only five detectors and provides a time-bound snapshot amid rapidly evolving AI/detection technologies. The data are limited to titles and abstracts from two conference proceedings (LAK22, EDM2022), which may constrain generalizability across disciplines or full manuscripts. Ground truth assumptions (e.g., treating GPTR as AI=1; mixed content as 0.5 or 1) are simplifications that may not reflect real-world, nuanced AI involvement. Increasing, often unacknowledged, AI assistance in writing complicates defining a stable human-vs-AI ‘ground truth.’ Broader and longitudinal analyses across diverse academic texts and detector tools are needed.
Listen, Learn & Level Up
Over 10,000 hours of research content in 25+ fields, available in 12+ languages.
No more digging through PDFs, just hit play and absorb the world's latest research in your language, on your time.
listen to research audio papers with researchbunny