logo
ResearchBunny Logo
ChatGPT-3.5 as writing assistance in students’ essays

Education

ChatGPT-3.5 as writing assistance in students’ essays

Ž. Bašić, A. Banovac, et al.

This groundbreaking study by Željana Bašić, Ana Banovac, Ivana Kružić, and Ivan Jerković delves into the effectiveness of ChatGPT-3.5 as a writing assistant for master's students. Surprisingly, despite some usage, both groups scored an average grade of C, raising critical questions about the tool's real impact on writing quality for inexperienced users.... show more
Introduction

The study investigates whether using ChatGPT-3.5 as a writing assistant improves students’ argumentative essay performance. Contextualized by ChatGPT’s rapid adoption and prior evidence of AI-generated academic text quality, the authors note ethical and assessment concerns, as well as students’ known challenges with argumentative writing. The research aims to determine if ChatGPT assistance increases essay grades, reduces writing time, and affects text authenticity in a real student setting.

Literature Review

The paper situates ChatGPT within prior work showing its capabilities to generate abstracts, research papers, dissertations, and high-quality essays across disciplines. It discusses ethical concerns (AI authorship, academic integrity) and practical issues (plagiarism detection, evaluation of AI-generated text). A SWOT analysis highlights strengths (natural language generation, personalized feedback), weaknesses (limited topic understanding, bias, lack of higher-order thinking), opportunities (efficiency, accessibility), and threats (academic integrity risks). Literature on students’ argumentative writing indicates persistent difficulties in argumentation and externalization of knowledge, underscoring the need to test AI assistance in this demanding task.

Methodology

Design: A controlled study with second-year master’s students (N=18) at the University Department of Forensic Sciences. Students were allocated to two groups matched on study module and GPA. Control group (n=9; GPA 3.92±0.46) wrote essays traditionally; experimental group (n=9; GPA 3.92±0.57) used ChatGPT-3.5 (version 2.1.0) as assistance. Task: Argumentative essay titled “The advantages and disadvantages of biometric identification in forensic sciences,” 800–1000 words, in Croatian, formatted with Vancouver citations. Both groups received scoring methodology and instructions. The experimental group received a brief introduction to ChatGPT and was free to use it in any way (drafting, Q&A, corrections); prompts and responses were not collected. Procedure: Students had up to 4 hours; the control group was supervised to prevent ChatGPT use. Student identities were coded for anonymity and to avoid grading bias. Two teachers graded using the Schreyer Institute Writing Rubric (mechanics, style, content, format; grades A–D converted to 4–1). Writing time was recorded for each student. Authorship/Originality checks: PlagScan assessed text authenticity. Pairwise similarity across essays was computed via R (Textreuse package) using Jaccard similarity. An AI text classifier (OpenAI, 2023) labeled texts as very unlikely, unlikely, unclear, possibly, or likely AI-generated; this tool was chosen due to better handling of non-English, as other detectors failed to recognize AI-generated non-English text. Analysis: Statistical analyses in Excel and R. Final essay score was the average of the four rubric elements. Linear regression tested effects of group (ChatGPT vs control), writing duration, module, and GPA on overall essay score; significance threshold P≤0.05.

Key Findings
  • Writing duration: ChatGPT group 172.22±31.59 min; control 179.11±31.93 min (similar across groups).
  • Grades: Both groups averaged grade C. Mean essay score was slightly higher in control (2.39±0.71) than ChatGPT (2.00±0.73).
  • Text authenticity: Mean non-authentic text percentage was 11.87%±13.45 in the ChatGPT group vs 9.96%±9.81 in control (slightly higher in ChatGPT).
  • Pairwise similarity: Generally low across all essays, median Jaccard similarity 0.002 (range 0–0.054).
  • AI text classifier: Control group—2 possibly AI-generated, 1 likely, 2 unlikely, 4 unclear. ChatGPT group—3 possibly, 5 likely, 1 unclear.
  • Associations (visualized): Positive association of writing duration and GPA with essay scores; in the ChatGPT group, higher non-authentic text proportion associated with lower essay scores.
  • Regression: Overall model showed moderate positive relationship (R=0.573; P=0.237). No significant predictors: group (P=0.184), writing duration (P=0.669), module (P=0.388), GPA (P=0.532).
Discussion

The findings indicate that ChatGPT assistance did not improve student essay performance: quality scores were slightly lower, writing was not faster, and text authenticity did not improve compared to traditional writing. Potential reasons include over-reliance on the tool, unfamiliarity with ChatGPT, and the model’s limited topic understanding, which may compound students’ existing challenges with argumentation. Although some prior studies with expert users report strong outcomes, those results may reflect skilled prompt design and critical evaluation abilities not present in novice students. Longer writing time correlated with higher scores in both groups, suggesting that integrating AI outputs with personal style and sources can be time-consuming. Slightly higher plagiarism indicators in the ChatGPT group did not translate into high cross-essay similarity, implying varied inputs and uses across students. AI text detection produced false positives in the control group and is less reliable for non-English texts, highlighting a need for improved, language-aware detectors. Overall, in this real-life educational setting, ChatGPT did not confer clear advantages and may distract or hinder inexperienced users.

Conclusion

ChatGPT-3.5, when used by students as a general-purpose writing assistant, did not enhance essay grades, reduce writing time, or improve authenticity relative to traditional writing. The tool’s impact appears dependent on users’ prior knowledge and skill in prompting and critical evaluation. Given detector limitations and the study’s context, concerns about widespread increases in non-original content may be moderated, though vigilance remains important as detection tools evolve. Future research should involve larger and more diverse samples, varied experimental conditions, additional variables related to writing skills, and analyses of prompt quality and user–AI interaction to better understand when and how AI assistance benefits student writing.

Limitations
  • Small sample size (n=9 per group) limits statistical power and generalizability.
  • Language specificity: Essays were written in Croatian, constraining the effectiveness of AI-text detectors and possibly limiting ChatGPT’s performance compared to English.
  • First-time ChatGPT use: Participants’ lack of experience and limited training with AI tools may have negatively affected outcomes.
  • Real-life usage design: Emulates rapid, convenience-driven application, which may not reflect best-practice, guided use; results may differ with experienced users or structured training.
Listen, Learn & Level Up
Over 10,000 hours of research content in 25+ fields, available in 12+ languages.
No more digging through PDFs, just hit play and absorb the world's latest research in your language, on your time.
listen to research audio papers with researchbunny