Education
Detecting contract cheating through linguistic fingerprint
M. Kutbi, A. H. Al-hoorie, et al.
The paper addresses the challenge of contract cheating, which increased notably during emergency remote teaching amid COVID-19. Traditional plagiarism detection systems cannot identify contract cheating because the submitted work is original but not authored by the student. The research investigates whether an AI-based method can detect inconsistency in an individual learner’s writing style, leveraging the idea that foreign language learners display consistent, idiosyncratic linguistic patterns (linguistic fingerprints). The central research question is whether a machine learning model can identify inconsistencies indicative of potential contract cheating by comparing an essay against the same student’s prior writing rather than external sources.
The literature highlights drivers of contract cheating, including dissatisfaction with the learning environment, abundant cheating opportunities (e.g., essay mills, online services), and disadvantages faced by learners of other languages. Rising cheating undermines academic credibility and correlates with broader unethical behavior. Traditional technological countermeasures (e.g., proctoring, lockdown browsers, plagiarism detectors) have limitations, particularly for original or AI-generated texts. The rise of AI language models like ChatGPT complicates detection because generated content is novel and can evade similarity-based tools; AI-text detectors also show poor reliability and biases. Analytical/ML approaches relevant to detection are reviewed: logistic regression (simple, interpretable but sensitive to input changes and dimensionality), LightGBM (efficient gradient boosting), and data imbalance methods including Random Oversampling and SMOTE (with pros/cons regarding overfitting vs. synthetic sample generation). These inform the study’s model design to detect stylistic inconsistency rather than text similarity.
Design: Supervised ML pipeline to classify essay pairs as consistent (same student) or non-consistent (different students), focusing on a learner’s linguistic fingerprint. Participants: N=150 freshmen (engineering/business) EFL learners in Saudi Arabia (B1–B2 proficiency after a foundation year). Materials: Descriptive essay prompts on familiar topics (e.g., campus description, weekend activities), with a consistent genre to control genre effects. Procedure: In-class writing, one essay every two weeks (seven essays per student over a semester), no dictionaries/smartphones, voluntary and non-graded. Ethical approval obtained. Data preprocessing: Using Python (e.g., NLTK): remove punctuation, numbers/special characters, extra spaces/empty lines, lowercase tokens, remove English stop words. Data transformation: TF-IDF vectorization (including 1-gram and 2-gram per figure; experiments report 1-gram TF-IDF and CountVectorizer variants). TF-IDF weight computed as W_dt = tf × log(N/df). Dataset construction and labeling: 1,050 essays (150 students × 7 essays) were digitized. Because the original essays were unlabeled for authorship pairing, labels were constructed as follows: (a) Non-consistent: pair each essay from Student X with an essay from Student Y. (b) Consistent: split each essay by Student X into two sub-paragraph essays, ensuring both halves are by the same author. Total dataset: 28,302 samples (27,710 non-consistent; 592 consistent), indicating severe class imbalance. Train-test split and oversampling: Hold-out split with 70% training and 30% testing. Oversampling applied only to training data to address imbalance. Random oversampling duplicated minority-class samples; alternative techniques (SMOTE, ADASYN) were also evaluated. Example training set counts after random oversampling: Consistent 425 before → 19,386 after; total training set 19,811. Test set contained 8,491 samples (167 consistent; 8,324 non-consistent). Models and tuning: Compared naive Bayes, logistic regression (LR), and LightGBM. Hyperparameter optimization via grid search with cross-validation. LR search space: penalty ∈ {l1, l2, elasticnet, none}; solver ∈ {lbfgs, newton-cg, liblinear, sag, saga}; max_iter ∈ {100, 200, 300, 400}. Best LR params: max_iter=200, penalty='l1', solver='liblinear'. Evaluation: Metrics included accuracy, precision, recall, F1-score; class-wise performance also examined due to imbalance. Experiments were run in Python (Google Colab).
- Without oversampling (preprocessing + CountVectorizer or TF-IDF 1-gram): models achieved high overall accuracy but performed poorly on the consistent class due to imbalance. Best overall: Logistic Regression + Bag-of-Words (CountVectorizer) with accuracy 98.06%, precision 99.86%, recall 98.06%, F1 98.92%. However, per-class results showed very poor performance on the consistent class (precision 3.59%, recall 66.66%, F1 6.81%; non-consistent class F1 99.02%), indicating misleading overall accuracy.
- With oversampling: Applying oversampling substantially improved minority-class performance. Across Random Oversampling, SMOTE, and ADASYN, Logistic Regression consistently outperformed naive Bayes and LightGBM.
- Best overall after oversampling: Logistic Regression + Random Oversampling achieved accuracy 98.03%, precision 98.52%, recall 98.03%, F1 98.24% (reported also as 0.980332/0.985253/0.980332/0.982432 with CountVectorizer in Table 6). Per-class performance for the consistent class improved markedly: accuracy 53.88%, precision 55.62%, recall 46.11%, F1 50.00%; non-consistent class: accuracy 99.15%, precision 98.48%, recall 99.20%, F1 98.84%.
- Hyperparameter tuning (grid search) further improved consistent-class metrics slightly: consistent accuracy 54.63%, precision 56.37%, recall 46.11%, F1 50.42%; non-consistent class remained very strong (accuracy 99.15%, precision 98.69%, recall 99.22%, F1 98.95%).
- The model effectively detects stylistic inconsistency and can flag potentially contract-cheated essays, particularly enhancing detection of the minority (consistent) class after balancing.
The study demonstrates that focusing on deviations from a student’s own linguistic fingerprint is a viable complementary strategy to traditional plagiarism detection, which cannot capture original but outsourced or AI-generated work. While overall accuracy was high even without balancing, minority-class (consistent) performance was initially poor, underscoring the necessity of handling class imbalance. Random oversampling combined with logistic regression substantially improved consistent-class detection while maintaining strong overall performance. In the broader context of AI (e.g., ChatGPT), where generated texts evade similarity-based tools and current AI-text detectors show limited reliability, individualized stylistic consistency checks offer a practical signal for further review. Ethically, the authors caution that model outputs are indicators rather than proof; flagged essays warrant human follow-up. Privacy and fairness must be safeguarded, and model interpretability is important for trust. A multi-pronged integrity strategy is recommended, combining analytics with education, policy, and cultural interventions.
The paper presents an ML approach that detects potential contract cheating by modeling a learner’s linguistic fingerprint and identifying inconsistencies across a semester’s essays. Using preprocessing, TF-IDF/Bag-of-Words features, class balancing (notably random oversampling), and logistic regression with grid-searched hyperparameters, the model achieved high overall performance and improved detection for the minority (consistent) class. The approach augments, rather than replaces, traditional plagiarism detection by focusing on authorial consistency rather than text similarity. Given the prevalence of AI-generated texts, this method provides a promising additional indicator. Future directions include improving detection models (advanced NLP/ML), longitudinal studies of second-language development over longer periods, assessing applicability to native-language writers across educational levels, and continued attention to ethics, privacy, and interpretability.
- The dataset is highly imbalanced (27,710 non-consistent vs. 592 consistent), necessitating oversampling, which can risk overfitting despite performance gains.
- Data are limited to second-language (EFL) learners and a single genre (descriptive essays) within one institution; generalizability to other populations (e.g., native speakers, different genres, institutions) is uncertain.
- Constructed labeling (pairing across students for non-consistent and splitting essays for consistent) may not capture all real-world cheating scenarios.
- Logistic Regression can be sensitive to input dimensionality and small changes; additional models/feature engineering could improve robustness.
- Model outputs are indicative, not definitive evidence of cheating; human verification remains necessary.
- Potential for AI tools to mimic a student’s style if provided with sufficient samples could reduce effectiveness in some cases.
Related Publications
Explore these studies to deepen your understanding of the subject.

