Medicine and Health
How to establish and maintain a multimodal animal research dataset using DataLad
A. Kalantari, M. Szczepanik, et al.
The paper addresses the challenge that most small-animal neuroimaging studies do not share complete datasets and accompanying processing tools, undermining FAIR principles and reproducibility. Only a small fraction of datasets on popular platforms include mouse/rat MRI, and metadata and cross-modal data (e.g., microscopy) are often missing. The authors aim to provide an easily applicable approach to set up a multimodal dataset with access to raw and processed data, methods, results, and provenance. Grounded in FAIR and Open Science, and in response to funder and publisher requirements for research data management (RDM), the study proposes a practical, standardized workflow using DataLad and GIN to improve transparency, collaboration, and reusability in animal research.
The authors contextualize their work with best practices in human MRI data analysis and sharing, highlighting low levels of data sharing and standardization in small-animal studies. They note that on OpenNeuro only about 3% of datasets include mice or rats, and on Zenodo about 30% of MRI datasets are from mice or rats. Most shared neuroimaging datasets do not include all relevant modalities or processing details, contravening FAIR principles and hindering reproducibility and adherence to the 3Rs in animal research. Prior efforts emphasize BIDS for neuroimaging, community metadata standards, and decentralized RDM. However, step-by-step, multimodal, and reproducible pipelines for animal studies remain scarce. This work builds on and integrates these principles, providing a practical protocol that complements existing standards and tools.
The authors detail a step-by-step data management protocol leveraging DataLad for version control and GIN for online hosting and DOI-enabled sharing within a standardized YODA directory structure. Workflow and organization:
- Project planning and metadata curation are managed in an in-house cloud-based relational database capturing the entire longitudinal timeline across modalities (MRI, histology, electrophysiology, behavior). A standardized study identifier (Study ID) ensures traceability and blinding.
- Data are organized per the YODA principles, compatible with BIDS. Raw data remain archived in an authority-compliant structure tied to animal protocols; project-specific copies are placed into a YODA-structured Project folder with subfolders: input (by modality), code, results/output, and docs.
- Backups: An incremental backup routine copies data from local external storage to centrally managed network storage. Local hardware included a Mac Pro with RAID-5 local storage; automated weekly backups to network storage.
DataLad-based protocol (four mandatory stages): Stage 1: Initialize a DataLad dataset
- Convert an existing Project folder (e.g., Project1) into a DataLad dataset: datalad create --force (or use datalad create for new empty datasets). For code-only datasets, use datalad create --force -c text2git to optimize tracking with git.
Stage 2: Version control (local)
- Inspect dataset status: datalad status
- Record changes: datalad save -m "message"
- Programmatic, provenance-tracked execution: datalad run -m "message" --input ... python <script.py> (automatically saves and records inputs/outputs for re-execution)
- Access history and provenance: git log (e.g., git log -2)
- Repeat Stage 2 throughout the project to create a complete change log.
Stage 3: Initialize remote sibling on GIN
- Create a repository on GIN and obtain the SSH URL.
- Register GIN as a sibling: datalad siblings add --dataset . --name gin --url git@gin.g-node.org:/user-name/dataset-name.git (or use create-sibling-gin for automation). SSH keys must be configured.
Stage 4: Upload to GIN
- Push complete dataset or selective paths: datalad push --to gin [
] - Repeat pushes as the dataset evolves; only changes are transferred. GIN provides git/git-annex compatibility, version history display, and DOI services for public datasets.
Third-party access and selective retrieval:
- Clone via SSH or HTTPS: datalad clone
- By default, only structure and small files are retrieved; large file contents are placeholders until requested.
- Retrieve full or partial contents on demand: datalad get [
]
Dataset nesting and decentralization:
- Use nested datasets (subdatasets) for modularity (e.g., separate raw_data, proc_data, and multiple code pipelines as independent datasets). Top-level dataset (superdataset) references subdataset repositories (e.g., code on GitHub; data on GIN). This preserves updatability and allows selective cloning.
Metadata and publication:
- Make GIN repository public to obtain a DOI; include DataCite-compliant metadata in datacite.yml and a LICENSE file.
- Provide documentation in docs (e.g., study ID mapping, time points, modality/protocol details). Integrate metadata exported from the relational database (txt, csv, json). The workflow is BIDS-compatible but does not require BIDS to maximize flexibility.
Software and storage ecosystem:
- Core tools: DataLad, git, git-annex, GIN; code pipelines (e.g., AIDAmri, AIDAqc) installed as subdatasets; compatibility with other platforms (OpenNeuro, OSF, Zenodo, Dataverse, Dryad, figshare) and data services (XNAT, PACS, OMERO) via DataLad extensions.
Data types and tracking:
- Typical file formats include text (txt, csv, json, yml), documents (docx, xlsx), neuroimaging (nii, gz), code (py, m), microscopy (lif, vsi, zvi), images (png, jpg), and MATLAB mat. Text/code receive line-by-line (holistic) tracking; other formats tracked per-file via checksums.
Identifier and naming:
- Study ID encodes protocol/project/cage/animal (e.g., SPs1c4m1), with optional genotype/sex. Filenames incorporate Study ID, test, and time point (e.g., SPs1c4m1CytD4). IDs should avoid special characters for BIDS compliance.
Hardware and backup (Methods):
- Main workstation with RAID-5 local storage; manual initial copy to local storage; weekly automated incremental backups to network storage. Access via VNC/SMB or remote desktop; similar workflows on Linux/Windows are feasible.
- Introduces a practical, four-stage, step-by-step protocol for creating and maintaining multimodal animal research datasets with DataLad, enabling version control, provenance tracking, and efficient collaboration.
- Demonstrates YODA-structured projects with nested datasets to separate raw data, processed data, and code (e.g., code hosted on GitHub; data on GIN), allowing decentralization and selective cloning/getting of content.
- Leverages GIN as an open, DOI-granting research data platform, fully compatible with git/git-annex, to publish datasets and their histories; includes guidance for metadata (datacite.yml) and licensing.
- Shows how datalad run stores re-executable provenance (inputs/outputs), supporting reproducible pipelines; compatibility with BIDS without making it a hard requirement.
- Provides a reference test dataset implementing the YODA structure: https://doi.org/10.12751/g-node.3yl5qi.
- Highlights ecosystem flexibility: DataLad interoperates with alternative repositories (e.g., OpenNeuro, Zenodo, OSF, Dataverse, Dryad, figshare) and data services (e.g., XNAT, OMERO, PACS) via extensions.
- Contextual problem metrics underscoring need: ~3% of OpenNeuro datasets include mice/rats; ~30% of MRI datasets on Zenodo are mice/rats; data not shared in 93% of biomedical open-access publications (cited sources).
The proposed workflow addresses deficits in transparency and reproducibility in small-animal neuroimaging by unifying data, code, and results under DataLad’s version control with GIN as a public hosting platform. A standardized identifier scheme and YODA structure ensure traceability while preserving raw data archives. The approach is easy to adopt, relies on free/open-source tools, and supports collaborative, decentralized RDM. It complements community standards like BIDS while permitting flexibility across data modalities and formats. Integration with relational databases enhances machine-readable metadata, enabling downstream discovery and reuse. By enabling dataset nesting and provenance tracking with datalad run/rerun, the workflow facilitates re-executable analyses and selective access to large datasets, improving efficiency and enabling robust cross-lab comparisons. The framework is adaptable to various IT environments and can be integrated with other repositories and domain platforms through DataLad extensions, supporting FAIR data logistics at scales ranging from small labs to multisite consortia.
Open data sharing within a proper RDM protocol enables critical scrutiny of study design and analysis, preventing error repetition and strengthening reproducibility. The presented blueprint—YODA-organized, DataLad-managed, and GIN-published—supports FAIR, decentralized, and provenance-rich workflows for multimodal animal research. It promotes data reuse and collaboration, aligns with the 3Rs by reducing redundant experiments, and can extend to other preclinical domains. Ongoing efforts toward metadata standardization, provenance, workflow management, and interoperability will further bridge the translational gap between basic and clinical research.
Implementing DataLad requires upfront time and resource investment and familiarity with version control concepts. Broader reproducibility in the field is hindered by commonly missing methodological details and transparency in publications; while this workflow mitigates such issues by publishing raw/processed data and detailed post-processing instructions (e.g., on GitHub), complete reproducibility may still depend on comprehensive documentation and community adherence to best practices.
Related Publications
Explore these studies to deepen your understanding of the subject.

