logo
Loading...
AI-assisted Peer Review

Interdisciplinary Studies

AI-assisted Peer Review

A. Checco, L. Bracciale, et al.

This study, conducted by Alessandro Checco, Lorenzo Bracciale, Pierpaolo Loreti, Stephen Pinfield, and Giuseppe Bianchi, delves into the role of AI in enhancing the peer-review process for scientific literature. With increasing submission volumes, the authors propose innovative AI tools aimed at improving efficiency and identifying biases, while also addressing the ethical implications and potential biases of these algorithms.... show more
Introduction

The peer review system faces increasing strain due to growing submission volumes (around 6.1% annual growth since 2013) and rising retraction rates, with an estimated 15 million reviewer hours spent annually on resubmitted rejected manuscripts. Automated tools already assist with plagiarism detection, formatting checks, reviewer matching, and method/statistics assessment, though their reliability and acceptance vary. Beyond efficiency, concerns persist that peer review may reinforce biases (e.g., gender, language, institutional affiliation). This study investigates how AI might assist pre-peer review screening and aspects of peer review, potentially saving reviewer time and uncovering biases, without replacing human judgment. The authors articulate four research questions: RQ1: To what extent can AI approximate human decisions in quality assessment and peer review? RQ2: Can AI reduce the time reviewers spend assessing papers? RQ3: Can AI uncover common biases in the review process? RQ4: What are the ethical implications of such tools? The study focuses on conference proceedings, training an AI model to predict reviewer outcomes from manuscript text and macroscopic features, and then examining model explainability and ethical considerations.

Literature Review

Peer review operates within a competitive and collaborative research ecosystem and remains central to quality control despite recognized limitations. Variants such as open peer review and soundness-only review in open-access mega-journals shift aspects of evaluation pre- and post-publication. Prior research highlights issues in peer review including opportunistic rejections and acceptance of low-quality work, and experiments with alternative processes (open review, community commenting). Automated Essay Scoring has been used in MOOCs, though contested. Readability assessment and NLP-based quality evaluation show promise but fully automated assessment of complex scholarly documents remains challenging. Socio-cultural and algorithmic biases (e.g., first-impression effects, language, prestige) are documented and risk being perpetuated by AI systems trained on historical data. Studies on first-impression bias and typographic layout suggest superficial presentation can disproportionately influence judgments. Modeling peer assessment and reviewer bias has been attempted in educational contexts, indicating feasibility of partial automation and bias modeling. This literature motivates cautious exploration of AI to assist (not replace) aspects of peer review, while foregrounding risks of bias and opacity.

Methodology

Data collection: The study used three conference datasets totaling about 3,300 submissions with review outcomes. From WCNC 2018 (provided by the general chair) the authors obtained submitted manuscripts, reviewer scores, and editorial decisions. From OpenReview (ICLR 2018 and 2019), they collected submissions with reviewer scores and accept/reject decisions, removing any post-acceptance metadata (e.g., author names/affiliations). Due to weak correlation between final editorial decisions and content in OpenReview, the focus there was on predicting the average reviewer score rather than final acceptance. Feature extraction: PDFs were converted to text. For WCNC, binary encoding of the top 20,000 most frequent words was used; for OpenReview, TF-IDF of the top 2,000 words was used. Feature augmentation included macroscopic features capturing formatting and readability: text/image ratio, file size, text length, number of pages, and a suite of readability and quality metrics (e.g., Automated Readability Index, Flesch-Kincaid grade, Flesch Reading Ease, Gunning Fog, LIX, Linsear Write, Smog, Dale–Chall, difficult words ratio, syllable/letter/character counts, sentence count, polysyllabic counts), as listed in Table 2 of the paper. Neural network design and training: A dense neural network was used with a 32-neuron ReLU layer, dropout, then a 16-neuron ReLU layer, and a final single-node output with sigmoid activation for acceptance classification or score regression. Trainable parameters: ~640,577 (WCNC) and ~64,577 (OpenReview), reflecting different input dimensions. Loss functions: binary cross-entropy for WCNC classification; mean squared error (MSE) for OpenReview regression. Optimizer: SGD with Nesterov momentum. Standard train/test splits were used. Baselines: a random classifier for WCNC (balanced set implies ~50% F1) and a naive regressor using the median score for OpenReview. Model explainability: Local Interpretable Model-agnostic Explanations (LIME) with submodular pick was employed to derive local and representative global explanations of feature contributions for individual documents and across the space.

Key Findings
  • WCNC acceptance classification: Dense NN achieved Accuracy 74.01%, F1-score 72.30%, Precision 72.45%, Recall 73.19%, outperforming a random baseline (~50% on balanced data).
  • ICLR (OpenReview) reviewer score regression: Dense NN outperformed a naive median regressor with MAE 0.79 vs. 0.96 and MSE 0.90 vs. 1.40. On the test set, 75% of samples had absolute error under 1.2 (10-point scale), with a median absolute error of 0.79.
  • Feature importance patterns: Readability and macroscopic features (e.g., Linsear Write, LIX, Flesch-Kincaid grade, text length, number of pages, file size) and certain word features contributed strongly to predictions. Example influential tokens included terms such as address, approach, conclusion, correlation, efficient, illustrates, previously (WCNC), and tokens like actor, causal, dnn, imagenet, precision, regression, tensor, token, width (OpenReview).
  • LIME explanations revealed context-dependent effects and potential overfitting (e.g., spurious influence of domain-specific words like "decoding"), highlighting both the transparency benefits and caution needed in interpreting local rules.
  • Core empirical result: Despite relying on relatively superficial features (formatting, readability, word distributions), the models correlated strongly with human reviewer outcomes, indicating either that such features proxy for overall quality or that first-impression biases influence reviewer decisions.
Discussion

The ability of AI models trained on superficial textual and formatting features to predict reviewer outcomes suggests two interpretations: (1) these features serve as useful proxies for overall scholarly quality (clarity, rigor reflected in presentation), supporting their use in pre-screening to reduce reviewer load; and/or (2) first-impression biases (e.g., typography, language errors) unduly influence reviewers, potentially disadvantaging well-conducted but poorly presented work. In practice, AI tools could support editors and reviewers by flagging papers for formatting/readability issues, advising authors pre-review, and prioritizing administrative checks. The study’s explainability (via LIME) can surface latent decision rationales and potential biases, aiding transparency and trust. However, risks include perpetuating historical biases and overfitting to domain idiosyncrasies. The authors outline potential AI impact across peer review dimensions (high for formatting, plagiarism, scope; medium for readability and relevance; low for soundness/rigor, novelty, and impact), recommending AI as decision support rather than replacement. Ethical analysis emphasizes avoiding opacity (inscrutable evidence), mitigating bias from misguided evidence and design choices, and guarding against transformative effects that could steer reviewer autonomy or amplify status-quo biases. Overall, careful design, transparency, and governance are essential for positive deployment.

Conclusion

The study demonstrates that AI models using textual distributions and macroscopic readability/formatting features can often approximate human reviewer outcomes on conference submissions, achieving substantial gains over baselines in both acceptance classification (WCNC) and score regression (ICLR). This indicates strong correlations between superficial presentation features and review results. Such tools could improve efficiency in quality control and peer review workflows and provide insights into reviewer decision processes through explainable AI. The authors stress that AI should assist, not replace, human evaluation, and that ethical considerations are paramount to prevent bias propagation. Future work includes: (1) controlled experiments to study feedback effects on reviewers when exposed to AI signals; (2) incorporating full textual reviews and rebuttals into modeling; (3) expanding perception-related features (e.g., typographic layout) and deeper analysis of language/formatting decision rules; (4) examining disciplinary variation in AI-assisted assessment; and (5) extending methods to grant application reviews, accounting for different document structures and heterogeneity.

Limitations
  • Scope limited to three conference datasets (WCNC 2018; ICLR 2018 and 2019 via OpenReview), which may constrain generalizability across venues and disciplines.
  • Models use predominantly superficial features (readability, formatting, word distributions) and do not capture deep domain-specific content or methodological rigor; textual reviews were not used.
  • Potential overfitting and context-dependent token effects observed; explainability exposed spurious associations (e.g., domain-specific terms).
  • AI tools risk replicating historical human biases present in training data (e.g., linguistic, regional, institutional), and may disadvantage under-represented groups if used uncritically.
  • Ethical concerns include opacity of complex models, bias from design choices, and potential influence on reviewer autonomy when AI signals are presented.
  • Disciplinary norms and document structure variability may limit transferability; the approach is intended for semi-automated assistance, not replacement of expert judgment.
Listen, Learn & Level Up
Over 10,000 hours of research content in 25+ fields, available in 12+ languages.
No more digging through PDFs, just hit play and absorb the world's latest research in your language, on your time.
listen to research audio papers with researchbunny