
Computer Science
An open source machine learning framework for efficient and transparent systematic reviews
R. V. D. Schoot, J. D. Bruin, et al.
Explore how ASReview, developed by a dedicated team of researchers from Utrecht University, revolutionizes systematic reviews by merging active learning with machine learning. This innovative tool not only boosts efficiency but also delivers superior outcomes compared to traditional manual reviews. Join the movement for better research practices with community-driven contributions!
~3 min • Beginner • English
Introduction
The volume of scientific publications is rapidly increasing, making comprehensive, transparent, and reproducible evidence synthesis via systematic reviews and meta-analyses increasingly difficult. Screening titles and abstracts is a major bottleneck because only a small fraction of records are relevant, leading to an imbalanced and error-prone manual process. Researchers often narrow searches to keep workloads manageable, risking missed relevant studies. Recent advances in machine learning, particularly active learning within human-in-the-loop frameworks, offer a way to prioritize likely relevant records and reduce screening burden. However, existing tools are often closed-source, lack transparency and data ownership guarantees, provide limited flexibility across diverse review topics, and offer few benchmarking options. The study presents ASReview, an open-source, machine learning–aided active learning pipeline designed to efficiently and transparently prioritize records for screening in systematic reviews, addressing the need for flexible models, reproducibility, and benchmarking.
Literature Review
The paper situates ASReview within prior work on semi-automated screening and active learning for systematic reviews. Prior systems (for example, Abstrackr, Colandr, FASTREAD, Rayyan, RobotAnalyst) implement active learning with various classifiers (commonly SVMs), feature extraction (e.g., TF-IDF, Word2Vec, topic models), and query strategies (uncertainty or certainty-based), but many are closed-source, have restrictive data policies, limited algorithmic flexibility, and lack benchmarking capabilities. The literature highlights active learning’s potential to reduce workload and improve prioritization, and emphasizes the need for transparent, reproducible workflows and researcher-in-the-loop settings where the primary output is a selected set of records rather than a trained model. The authors also reference guidelines and best practices for systematic reviews (e.g., PRISMA) and previous evaluations of screening tools, motivating an open, extensible framework with benchmark support.
Methodology
The authors describe ASReview, an open-source (Apache 2.0) framework with three modes: (1) Oracle mode for interactive screening; (2) Simulation mode for benchmarking performance on labeled datasets; and (3) Exploration mode for teaching with preloaded datasets. Workflow: Users perform standard database searches and upload records (RIS, CSV, XLS/XLSX). Mandatory text fields are titles and/or abstracts; other metadata (author, date, DOI, keywords) can be present but are not used in model training to avoid authority bias. Users provide prior knowledge (at least one relevant and one irrelevant record), then enter an active learning cycle where the model iteratively retrains and selects the next record to label until a user-defined stopping criterion. Implemented components: - Classifiers: Naive Bayes (default), Support Vector Machine, Logistic Regression, Neural Network, Random Forests, LSTM-base, LSTM-pool. - Feature extraction: TF-IDF (default; optionally with n-grams), Embedding-IDF, Doc2Vec, Sentence-BERT, Embedding LSTM (LSTM-base/pool require Embedding LSTM). - Query strategies: certainty-based (default), uncertainty-based, random, mixed (e.g., 5% random/95% certainty-based). - Balance strategies: dynamic resampling (default), undersampling, full sampling/simple (no balancing). Defaults were selected for robust performance and computational efficiency on local machines: NB + TF-IDF + dynamic resampling + certainty-based sampling. Technical considerations: Texts are lowercased; stop-word removal is optional. A document-term matrix is precomputed and cached in a state file to speed iterations; records are identified by row indices. The system can run locally or on self-hosted servers; data remain on the user’s machine to preserve ownership and confidentiality. Simulation study methodology: Version 0.7.2 was used to benchmark performance on four labeled datasets. For each dataset, 15 simulation runs were performed, each initialized with one random inclusion and one random exclusion. Performance metrics included Work Saved over Sampling (WSS) at specified recall levels (notably 95%), and the Relevants Retrieved in the first 10% screened (RRF10%). Datasets: (1) Viral metagenomics in livestock: 2,481 records, 120 inclusions (4.84%). (2) Fault prediction in software engineering: 8,911 records, 104 inclusions (1.2%). (3) Unsupervised ML on longitudinal PTSD symptom trajectories: 5,782 records, 38 inclusions (0.66%). (4) ACE inhibitors efficacy (TREC 2004 subset): 2,544 records, 41 inclusions (1.6%). User experience (UX) testing methodology: Iterative UX evaluations included unstructured interviews with experienced reviewers, systematic remote moderated tests (N=11; two cohorts: inexperienced and experienced users), thematic analysis of observations (coded as showstoppers/doubtful/superb), and iterative software releases addressing feedback (e.g., dataset upload, prior knowledge selection, GUI improvements, project dashboard, documentation updates).
Key Findings
- Simulation performance: Across four datasets and 15 runs each (initialized with one random inclusion and one random exclusion), the average WSS at 95% recall was 83%, ranging from 67% to 92%. Practically, 95% of eligible studies were found after screening only 8% to 33% of records. The proportion of relevant records found after screening the first 10% (RRF10%) ranged from 70% to 100%. These results indicate substantial reduction in screening workload relative to random order. - Usability outcomes: In systematic UX testing (N=11), the overall mean user rating was 7.9/10 (s.d.=0.9). Inexperienced users: mean 8.0 (s.d.=1.1, N=6); experienced users: mean 7.8 (s.d.=0.9, N=5). Feedback led to multiple releases (v0.10, v0.10.1, v0.11) with a redesigned GUI, improved dataset upload and prior knowledge selection, a project dashboard, and expanded documentation. - Privacy and transparency: ASReview runs locally with data retained on the user’s machine, addressing data ownership and confidentiality concerns. The framework is open source, extensible, and supports benchmarking via simulation mode and public datasets/scripts (Zenodo and OSF DOIs provided).
Discussion
The findings demonstrate that active learning within a researcher-in-the-loop framework can greatly reduce the screening burden for systematic reviews while maintaining high recall, directly addressing the bottleneck of title/abstract screening in large, imbalanced corpora. By enabling discovery of most relevant studies after screening a small fraction of records, ASReview improves efficiency without sacrificing transparency, as all included records are ultimately reviewed by humans and the pipeline is fully open source. Flexibility to choose classifiers, features, query and balance strategies allows adaptation to diverse topics and class prevalences, a limitation in many existing tools. The local, privacy-preserving deployment and benchmark capabilities further support adoption in evidence synthesis workflows. The results suggest that machine learning–aided screening can broaden initial searches (favoring recall), enabling more comprehensive evidence bases without proportional increases in workload. However, estimating application-specific error rates during active learning and extending validated performance beyond screening to other review stages remain open challenges.
Conclusion
The paper introduces ASReview, an open-source, active learning framework that accelerates and makes transparent the screening of titles and abstracts in systematic reviews. It offers flexible combinations of classifiers, feature extraction methods, query and balance strategies, and provides oracle, simulation, and exploration modes. Benchmarks across multiple real-world datasets show large reductions in screening effort at high recall, and UX testing indicates good usability, with iterative improvements informed by user feedback. The authors encourage the community to evaluate and extend the framework, leverage the simulation mode to validate settings for specific applications, and contribute to open benchmark challenges and datasets. Future work includes exploring methods for accurate error estimation during active learning, validating the approach in non-review text domains and full-text screening, integrating with tools for other stages of the review pipeline, and incorporating advances in NLP feature extraction.
Limitations
- Error estimation during active learning: Because the model actively prioritizes likely relevant records, straightforward estimation of error rates without additional labeling is difficult; providing accurate, application-specific error estimates remains an open problem. - Generalizability to other use cases: While the approach is not limited to systematic reviews, empirical benchmarks in other domains and tasks are lacking. - Scope limited to screening: ASReview accelerates the screening step only; the broader review workflow (initial search, data extraction, risk of bias assessment, synthesis) remains largely manual and requires integration with other tools to achieve end-to-end automation with quality guarantees.
Related Publications
Explore these studies to deepen your understanding of the subject.