Interdisciplinary Studies

ENCORE: a practical implementation to improve reproducibility and transparency of computational research

A. H. C. V. Kampen, U. Mahamune, et al.

Discover ENCORE, an innovative approach designed by Antoine H. C. van Kampen and colleagues to tackle the reproducibility crisis in computational research. By offering standardized structures, templates, and GitHub integration, ENCORE enhances research transparency and reproducibility across various projects. Find out how they faced challenges regarding researchers' incentives!

00:00

~3 min • Beginner • English

Index

Introduction

The paper addresses the ongoing challenges of computational reproducibility and transparency across scientific disciplines, with a focus on biomedical research. Despite many guidelines and community efforts, studies are frequently difficult to reproduce due to undocumented manual steps, unavailable or version-varying software and libraries, incomplete documentation, separation of code and data across repositories, and unspecified parameters. The authors argue that tightly integrating data, code, results, and comprehensive documentation into a single, shareable compendium improves reproducibility and transparency. ENCORE (ENhancing Computational Reproducibility) is introduced as a practical, tool-agnostic approach to guide researchers in structuring and documenting computational projects within a standardized file system structure (sFSS), augmented by pre-defined documentation templates, Git/GitHub for version control, and an HTML-based navigator. The primary purpose is to translate existing reproducibility guidelines into concrete, usable instructions and templates to promote routine adoption, harmonization within research groups, and improved transparency and reproducibility in practice.

Literature Review

The authors review the landscape of reproducibility and transparency initiatives and guidelines across the research lifecycle. They reference concerns over a reproducibility crisis and emphasize transparency as a prerequisite for trust and credibility. Prior work includes FAIR data principles and the use of code repositories (GitHub, GitLab, Bitbucket), with persistent archiving via Zenodo. They note the drawbacks of separating code and data into distinct repositories, which can obscure links to results and necessitate undocumented manual steps. Numerous published guidelines and best practices exist for computational reproducibility, software documentation, and project organization (e.g., Sandve’s Ten Simple Rules, Noble’s project organization guidance, Wilson et al. on scientific computing practices), yet adoption remains inconsistent. The authors compare ENCORE to related project structuring approaches, notably the standardized filesystem layout proposed by Spreckelsen et al., and R-centric compendia like rrtools, highlighting ENCORE’s more detailed, project-oriented compendium, integrated templates translating guidelines into directory-specific README instructions, and explicit use of Git/GitHub. They also discuss standards and ontologies (e.g., FAIR for software, Software Ontology, EDAM, MIASE, SED-ML), and journal policies promoting code/data availability, while noting that many computational studies still fail to reproduce despite such policies.

Methodology

ENCORE is implemented as a practical framework consisting of five components anchored by eight main requirements and four practical principles. Eight main requirements: 1) A single, self-contained project compendium integrating data, code, results, and documentation in one location; transferable without breaking internal consistency. 2) Facilitate transparency with a standardized structure and deep documentation of concepts, methodology, data, code, and results. 3) Enable reproducibility so an independent peer can execute, understand, and recreate published outcomes. 4) Adhere to published reproducibility guidelines where possible. 5) Enable version control for code and code documentation via Git/GitHub. 6) Facilitate harmonization across researchers and groups with standardized, well-documented practice. 7) Provide a generic, tool- and domain-agnostic approach independent of data type, programming language, and ICT infrastructure. 8) Allow adaptation to different working styles while remaining accessible from any software tool. Five components: - Component 1: Standardized File System Structure (sFSS) template. A directory structure with pre-defined files that organizes conceptual information, data (raw/processed/meta), code, settings, notebooks, results, and project documentation (lab journal, literature, presentations, manuscript). Implicit links arise from hierarchy; explicit links are captured in documentation. The structure is flexible (e.g., placing processed data in Data or as results within a computation subdirectory) and supports any OS and toolchain. - Component 2: Pre-defined files. Markdown-based README templates in each directory specify minimum required documentation and instructions translated from published guidelines. Root files include O_PROJECT.md (project details) and O_GETTINGSTARTED (saved as HTML for Navigator). The LabJournal template is used to log conceptual background, computational approaches, decisions, meeting/email summaries, ideas, and to-do lists, with pointers to relevant directories/files. - Component 3: GitHub repository. Git/GitHub is used for version control of code, notebooks, settings, and code documentation. Only the Processing subdirectories (and software environment specifications) are under version control; data and results remain in the compendium to keep it self-contained and avoid GitHub storage/size constraints. .gitignore templates are provided. - Component 4: sFSS Navigator. A Python tool that scans the sFSS and generates a Navigate.html web page with configurable panels: expandable directory tree, file content viewer, project information, and a Getting Started panel. It guides recipients to key data, code, and results. Executables for Windows, macOS (Intel/Apple silicon), and shell scripts for Linux are provided. Configuration via Navigation.conf. - Component 5: User documentation. A Step-by-Step ENCORE Guide explains instantiating a project from the template, setting up GitHub, and using the Navigator. Automation scripts are provided to set up projects and domain-specific templates (e.g., AIRR-seq), with ongoing additions. Project instantiation: Clone the sFSS template from the ENCORE GitHub, create and link a project-specific GitHub repo to the Processing directory, select preferred formats of templates, remove non-recipient files before sharing, and start documentation. Setup typically takes under 30 minutes; an automation script can initialize projects. Use with remote/HPC: Projects can be synchronized via cloud storage and partially transferred to HPC for intensive computations (e.g., only the simulation branch), then results transferred back. Standard data transfer tools (curl, rclone; sFTP/SCP) are applicable. Internal evaluations: ENCORE evolved from v1.0 (Oct 2020) to v4.0 through group-wide adoption, iterative evaluations, and refinements. A 2022 evaluation assigned nine ENCORE projects to group members not involved in those projects to assess adherence to sFSS, documentation level, and reproducibility by attempting to reproduce selected results. Findings informed subsequent revisions (3.1–4.0), improved documentation consistency, and the creation/enhancement of the Navigator.

Key Findings

- Adoption and harmonization: ENCORE became mandatory in the group for all new projects, leading to standardized organization. As of reporting, over 20 ENCORE research projects and over 50 ENCORE projects for external data analysis services were established. - Iterative improvement: Successive versions (1.0 → 2.0 → 3.0 → 3.1 → 3.5 → 4.0) simplified the sFSS, reduced pre-defined files, improved instructions, and introduced/enhanced the sFSS Navigator. - Reproducibility evaluation (Sept 2022): Of nine ENCORE projects tested by non-involved group members, only about half had results that could be reproduced. Barriers included differing library versions, absolute (non-portable) paths, OS differences, missing software installation/running instructions, compilation issues, software errors, large dataset handling challenges, and inadequate documentation of goals and methodology. - Specific fixes and lessons: • Simple fixes (e.g., replace absolute with relative paths) can remove some barriers. • Deeper challenges (software/library versioning, environment preservation) require additional tooling and practices. • Not being acquainted with a project made critical information hard to find, motivating the development of the sFSS Navigator. • Despite consensus on structure, remembering and applying all rules was difficult; merging instructions with templates in README files and completing a Step-by-Step Guide improved consistency. • Documentation level was often inadequate; responsibilities were clarified to ensure all relevant documentation is in the sFSS. - Practicality: Initial setup of a project takes less than 30 minutes; ENCORE is agnostic to domain, data type, programming language, and infrastructure; compatible with cloud sync and HPC workflows. - Key takeaways (lessons learned): • Lack of incentives is a major barrier to transparency and reproducibility. • Incremental, group-harmonized adoption with regular evaluations is essential. • Group-wide harmonization eases inspection, reuse, and best-practice development. • Future ENCORE should explicitly include software engineering best practices and environment preservation methods.

Discussion

The study demonstrates that translating broad reproducibility guidelines into a concrete, standardized, and documented project compendium meaningfully improves transparency, harmonization, and practical reproducibility, though full reproducibility still requires additional measures. By integrating data, code, results, and rich documentation in a single sFSS and using Git/GitHub for version control of code and documentation, ENCORE addresses common causes of irreproducibility (e.g., missing context, scattered resources, unclear execution steps). Internal evaluations revealed that about half of tested projects were reproducible, highlighting progress yet underscoring persistent issues (e.g., environment/version management, documentation granularity, portability). The sFSS Navigator enhances discoverability and onboarding for recipients unfamiliar with a project. The findings underscore the importance of incentives and cultural change: reproducibility work is often perceived as overhead with limited immediate rewards. Aligning with community and publisher policies (e.g., FAIR, journal code/data availability requirements) and integrating persistent archiving (e.g., Zenodo) can improve recognition and accountability. The authors advocate complementing ENCORE with environment preservation tools (Conda, renv, containers such as Docker/Apptainer/Podman, and VMs), and encourage but do not mandate workflow systems (Galaxy, KNIME, Snakemake, Nextflow). Moving towards machine-readable metadata (JSON/YAML/RDF) could enhance searchability, linkage between code/data/results, and compliance checking, while interactive tools could lower the burden of maintaining such metadata. Automation and AI tools are proposed to assist with documentation, testing, meeting summarization, and ENCORE compliance checks, further reducing overhead and improving robustness.

Conclusion

ENCORE operationalizes computational reproducibility guidelines into a practical, standardized, and flexible project compendium that integrates data, code, results, and documentation, complemented by Git/GitHub version control and a browsable Navigator. Group-wide adoption improved organization, transparency, and harmonization, and facilitated partial reproducibility in internal tests. The work highlights that achieving full reproducibility requires addressing computing environment preservation, improving documentation practices, and strengthening incentives. Future directions include: (i) integrating explicit software engineering best practices; (ii) adopting environment preservation technologies (e.g., Conda/renv, containers, VMs); (iii) enhancing the sFSS Navigator (configurability, media rendering, browsing, linking code/data/results, formatting); (iv) improving explicit link annotations among data, code, results, and concepts (e.g., YAML/JSON); (v) expanding machine readability and interactive tooling to reduce documentation burden; (vi) developing domain-specific templates and collaborating with diverse research groups for broader evaluation; and (vii) advancing automation (ENCORE-AUTOMATION) to streamline project setup and routine tasks. ENCORE compendia can be archived with DOIs (e.g., Zenodo) and shared alongside publications to support peer review and research sustainability.

Limitations

- Structural compromise: ENCORE’s sFSS reflects a consensus among varying prior practices and may not perfectly match individual preferences, though it allows flexibility. - Navigator scope: The current sFSS Navigator has limited functionality and configuration; improvements are planned (panel layouts, figure/table rendering, browsing and linking results with code/data, proper formatting, reliable relative links). - Linking granularity: No simple, built-in mechanism exists to explicitly encode links among results, code, data, and concepts beyond structure, paths, and manual documentation. JSON/YAML-based annotations could help but introduce maintenance overhead. - Environment preservation: ENCORE only partially addresses preservation of the full computing environment (OS, toolchains, versions, dependencies). Complementary use of Conda/renv, containers, and VMs is encouraged but not mandated. - Workflow systems: Not required to avoid disruption; however, workflows can enhance reproducibility. They bring their own versioning and longevity challenges (outdated components/services, workflow manager versions). - Sensitive content management: No built-in mechanism to redact or strip sensitive data (patient/controlled access) or copyrighted PDFs before sharing a compendium. - Storage and large data: The self-contained requirement means large datasets must fit within host storage; otherwise, robust retrieval scripts and documentation are needed to preserve reproducibility. - Documentation burden and incentives: Sustained, detailed documentation is essential but time-consuming; lack of incentives may limit compliance and completeness. - Portability pitfalls: Past issues included absolute paths, OS differences, missing installation/execution instructions, compilation errors, and library version mismatches, which can impede reproducibility until systematically addressed.

Related Publications

Explore these studies to deepen your understanding of the subject.

Education

Unleashing the potential: a quest to understand and examine the factors enriching research and innovation productivities of South Asian universities

S. Javed, Y. Rong, et al.

Education

The impact of a social and emotional learning programme to improve pupils' educational inclusion in vocational education and training

F. D. Fernández-martín, I. Aznar-díaz, et al.

Education

Evaluations of training programs to improve capacity in K*: a systematic scoping review of methods applied and outcomes assessed

S. Shewchuk, J. Wallace, et al.

Interdisciplinary Studies

The impact of gender diversity on scientific research teams: a need to broaden and accelerate future research

H. B. Love, A. Stephens, et al.

Listen, Learn & Level Up

Over 10,000 hours of research content in 25+ fields, available in 12+ languages.

No more digging through PDFs, just hit play and absorb the world's latest research in your language, on your time.

listen to research audio papers with researchbunny