Interdisciplinary Studies
CORE: A Global Aggregation Service for Open Access Papers
P. Knoth, D. Herrmannova, et al.
The study addresses the challenge of systematically collecting and maintaining a global, up-to-date corpus of open access (OA) scientific literature suitable for large-scale text and data mining (TDM). Scientific publishing is expanding rapidly (about 10% annually and millions of new articles per year), with a substantial and growing share available as OA. However, the literature is fragmented across thousands of heterogeneous repositories, publishers, journals, and databases that lack interoperable protocols, complicating machine access and processing. The purpose of CORE is to aggregate OA research outputs from worldwide data providers and to offer machine-accessible services (API, datasets, synchronisation) to support TDM, discovery, and downstream applications (e.g., plagiarism detection, recommendations). The paper motivates the need for a continuously updated, harmonised, and downloadable full-text dataset; outlines obstacles (interoperability, copyright, scalability, OAI-PMH limitations); and presents CORE’s key innovations: extending metadata protocols to enable content harvesting, a novel microservices-based harvesting approach (CHARS) with a pro-active scheduler for improved scalability and reliability, and an efficient scheduling algorithm to optimise recency and resource utilisation.
The paper situates CORE within the landscape of OA aggregators and scholarly databases. Existing OA aggregation services include BASE, OpenAIRE, Unpaywall, Paperity, and SHARE. BASE primarily harvests metadata via OAI-PMH and exposes an API and dataset; OpenAIRE focuses on policy compliance and provides an API; Paperity harvests from OA journals but does not host full texts; SHARE harvests US repositories; Unpaywall compiles links to free-to-read versions from Crossref but generally does not host full texts. CORE differs by hosting and exposing a very large corpus of full-text OA content, providing rich metadata and multiple machine access methods (API, dataset, FastSync). Beyond OA aggregators, major publication databases include Crossref (DOI metadata, API but no bulk download), Scopus and Web of Science (subscription citation indices), Google Scholar (no programmatic access), Semantic Scholar (API and dataset), Dimensions (analytics), and 1findr (curated indexing without open APIs/datasets). This literature and service review highlights the gap CORE fills: a harmonised, hosted, downloadable full-text OA corpus enabling TDM at scale, contrasted with services that provide only links or metadata or restrict access.
The Methods detail CORE’s harvesting and enrichment infrastructure designed for scalable, reliable aggregation of OA literature. CORE Harvesting System (CHARS) employs a microservices architecture with decoupled components: a Scheduler, per-task message Queues, and specialized Workers. The harvesting pipeline comprises: (1) Metadata download via OAI-PMH (robust to failures with resumption tokens), (2) Metadata extraction and harmonisation into CORE’s internal schema addressing syntactic and semantic heterogeneity, (3) Full-text download using URLs extracted from metadata, (4) Information extraction to convert PDFs to plain text and extract semi-structured data (e.g., references), (5) Enrichment (online ML tasks such as language and document type detection; offline periodic enrichment using external research graphs), and (6) Indexing to support search, API, and FastSync. Scalable infrastructure requirements guiding CHARS include high automation, fail-fast validation, ease of troubleshooting, distributed scalability, no single point of failure, decoupling from user-facing systems, recoverability, and performance observability. Pro-active harvesting Scheduler: The event-driven scheduler prioritises providers when compute resources are available to maximise ingestion throughput and data recency. A scoring formula selects providers based on Days Not Harvested (DNH), a Repository Days Offset (RDO) threshold, and provider PRIORITY, while limiting the entry queue size via constant K to maintain responsiveness. SCORE = max(DNH − RDO, 0) × PRIORITY; the highest scoring provider enters the metadata download queue when capacity permits. Using OAI-PMH for content harvesting: Though OAI-PMH targets metadata exchange, CORE uses it to bootstrap full-text acquisition. After excluding records attempted within a retry period (typically six months), CHARS applies heuristics to candidate URLs: accepted file extensions (e.g., prioritise PDFs and repository bitstreams), same-domain policy (allowing dx.doi.org and hdl.handle.net exceptions; disabled for aggregators), provider-specific URL composition rules, and prioritisation of repository patterns and direct PDF links. A depth-first search with bounded harvesting levels follows links from HTML landing pages to locate the actual full text. Downloads are validated (must be a valid PDF with a title matching the metadata); oversized or invalid files are rejected. The process stops upon finding the first validated match or upon exhausting the search depth. Enrichments: Online enrichments (single-pass) include article type detection (presentation, thesis, research paper, other) via supervised ML and language identification from full text with harmonisation to metadata. Offline enrichments (periodic, monthly) map and augment records using external datasets (Crossref, ORCID, PubMed, Unpaywall, etc.). Matching uses DOIs where available; otherwise, fuzzy matching on title, authors, and year. Map-reduce workflows on a Cloudera EDH perform field harmonisation and add persistent identifiers, citation counts, additional OA links, and fields of study. Data access services: CORE exposes data via a RESTful API, bulk Dataset dumps (ODC-By), and FastSync (ResourceSync-based, optimised for incremental enterprise synchronisation with on-demand resource dumps).
- Scale and coverage (as of February 2023): 291,151,257 metadata records; 32,812,252 records with hosted full text; 94,521,867 records with abstracts; an estimated 139,000,000 records with links to OA full texts; content from 10,744 data providers across 150 countries; estimated 118 languages represented. Approximately 13% of records have hosted full text; about 48% link to OA full texts hosted elsewhere. The uncompressed dataset (including PDFs) is about 100 TB; plain-text-only compressed dataset ~393 GB (uncompressed ~3.5 TB).
- Growth: Transition to CHARS with pro-active scheduling significantly improved data recency and tripled collection size over three years. Monthly growth curves show sustained increases in both metadata-only and full-text records since 2012.
- Data sources: Top providers include Crossref (~110.0M documents), USC Digital Library (~11.0M), CiteSeerX (~6.54M), DOAJ (~4.50M), Gallica (~3.44M), University of Michigan Library Repository (~3.32M), PubMed Central (~2.57M), FigShare (~2.45M), NARCIS (~2.29M), Elsevier (~1.68M).
- Languages and geography: >80% of language-identified documents are in English; top languages include en, es, pt, id, de, hr, ru, ja, fr. Providers span 150 countries, with high counts in Indonesia, US, Japan, Germany, Brazil, UK, Spain, Turkey, Canada, Peru.
- Document types: Predominantly research articles among full-text records; theses also frequent but likely overrepresented in full-text subset; overall corpus expected to be even more article-heavy.
- Disciplines: Sample of 20,758,666 publications shows largest fields include biology, medicine, physics, chemistry, mathematics, computer science, psychology, and engineering, aligning with external studies.
- Adoption and impact: Over 40 million monthly active users; top 10th Science and Education site (SimilarWeb). 4,700 registered API users and 2,880 dataset users; 7,000+ experts have used CORE services. CORE Recommender integrated in 70+ repositories (e.g., University of Cambridge, arXiv). CORE Discovery integrated into 434 repositories; browser extension >5,000 downloads. CORE Dashboard used by 499 institutional repositories across 36 countries. CORE has supported third-party services (e.g., Turnitin via FastSync) for plagiarism detection, scholarly search, and discovery systems (Naver, Lean Library, Ontochem).
CORE directly addresses long-standing interoperability and scalability barriers in accessing and processing scholarly OA content for TDM. By hosting validated full texts and providing harmonised metadata, CORE eliminates the need for users to crawl disparate sources or process PDFs themselves, enabling applications such as plagiarism detection, recommender systems, systematic review support, and research trend analysis. Compared to services that only provide URLs to OA copies (e.g., Unpaywall, BASE), CORE provides hosted full texts and machine-ready plain-text dumps, facilitating large-scale analytics. The microservices-based CHARS and pro-active scheduling optimise resource utilisation and data recency across thousands of heterogeneous providers, while heuristics over OAI-PMH metadata enable robust content retrieval despite protocol limitations. Real-world use cases (e.g., Turnitin’s similarity checking via FastSync; integrations with Cambridge and arXiv; discovery and dashboard tools for repositories) demonstrate CORE’s relevance across academia and industry. The platform’s enrichments and linkages to external scholarly graphs further increase utility, making CORE a central OA infrastructure component.
The paper presents CORE as a globally leading OA aggregation service delivering the largest hosted corpus of open access full texts with rich metadata and multiple machine access options. Key contributions include: (1) a scalable, fault-tolerant, microservices-based harvesting pipeline (CHARS); (2) a pro-active scheduling algorithm optimising recency and throughput; (3) pragmatic strategies to use OAI-PMH for full-text harvesting; and (4) online and offline enrichment processes linking CORE to broader scholarly ecosystems. These innovations have enabled rapid growth (291M+ metadata records; 32.8M+ hosted full texts) and broad adoption across research and industry. Future work includes expanding the collection toward comprehensive global OA coverage, adding new ML-powered enrichments (e.g., citation classification), improving deduplication and versioning by implementing CORE Works to link versions under a unique works entity, and deeper integration with external scholarly knowledge graphs to enhance discoverability and analytics.
Despite advances, limitations remain. OAI-PMH’s sequential design, heterogeneous implementations, and lack of standardised full-text linking complicate large-scale content harvesting, necessitating heuristics and workarounds; incremental harvesting and reliability issues persist across implementations. CHARS currently relies on empirically determined worker allocations; optimal dynamic resource allocation remains under investigation (e.g., Petri Net-based models). The retry period to avoid repeated failed downloads can delay acquisition when providers update links. Metadata heterogeneity and restricted machine access policies at some repositories limit achievable full-text coverage relative to estimates of OA availability. Adoption of more suitable protocols (e.g., ResourceSync) by providers would mitigate several constraints.
Related Publications
Explore these studies to deepen your understanding of the subject.

