logo
ResearchBunny Logo
Orchestrating and sharing large multimodal data for transparent and reproducible research

Medicine and Health

Orchestrating and sharing large multimodal data for transparent and reproducible research

A. Mammoliti, P. Smirnov, et al.

Discover ORCESTRA, a groundbreaking cloud-based platform that revolutionizes reproducible processing of multimodal biomedical data. Developed by leading researchers including Anthony Mammoliti and Petr Smirnov, this innovative tool enhances data sharing and management for clinical and genomic research, ensuring compliance with FAIR principles.

00:00
00:00
~3 min • Beginner • English
Introduction
The demand for large multimodal biomedical datasets, driven by personalized medicine and disease research, has increased the difficulty of reproducing findings due to complex data handling and computational processing requirements. Existing workflow languages and systems (e.g., CWL, WDL, Snakemake, Nextflow) promote reproducibility but have steep learning curves compared to user-friendly platforms like Galaxy, which face limitations in features and scalability. Inadequate sharing of workflows and metadata impairs data provenance and transparency. Multimodal data types—pharmacogenomics, toxicogenomics, radiogenomics, and clinical genomics—are central to biomarker discovery, often requiring integration across multiple datasets. To enhance reproducibility and FAIR compliance, there is a need for scalable, transparent solutions that process and share large, heterogeneous datasets with full provenance. ORCESTRA was developed to address these needs by orchestrating reproducible processing and sharing of multimodal biomedical data.
Literature Review
The authors survey existing orchestration and data-processing platforms—Pachyderm, DNAnexus, Databricks, and Lifebit—highlighting features such as language-agnostic pipelines, large dataset support, automatic triggering, prevention of full recomputation, Docker utilization, versioning, and parallelization. Pachyderm was selected for ORCESTRA due to comprehensive provenance, versioning, and automatic triggering, despite limitations in direct data mounting, cost-efficiency, and persistent resource allocation. The paper also reviews data-sharing practices across domains: pharmacogenomics (e.g., GDC/CCLE, GRAY, DepMap), toxicogenomics (LSDB/TG-GATEs), xenographic and radiogenomics (often supplementary materials), and clinical genomics (compendia like Oncomine, MultiAssayExperiment, curatedData, MetaGx). Many repositories lack consistent FAIR-compliant metadata, file-level versioning, pipeline transparency, and support for multiple processing pipelines, underscoring the need for platforms like ORCESTRA that unify transparent processing and sharing.
Methodology
ORCESTRA architecture comprises three layers: - Web-application layer: Built with a Node.js API, React front end, and MongoDB database, the web app lets users browse existing data objects, request new customized objects by selecting parameters (e.g., dataset, genome reference, transcriptome source, RNA-seq quantification tools/versions), view request status, and manage user accounts. The UI filters objects via queries to MongoDB and links to detailed metadata pages providing publications, raw data sources, pipeline parameters, and Zenodo DOIs. Users can submit external datasets via a Data Submission feature (registered users) guided by documentation. - Data-processing layer: Deployed on a Microsoft Azure Kubernetes Service cluster running Pachyderm. Pipelines use Dockerized R/Bioconductor toolchains (e.g., PharmacoGx, ToxicoGx, Xeva, MetaGxPancreas, RadioGx). RNA-seq raw data are preprocessed with Kallisto and Salmon via Snakemake on HPC and pushed into Pachyderm repositories; microarray, CNV, mutation, and fusion data are processed in Pachyderm or aggregated from public sources. Pachyderm provides automatic triggering on commits, provenance tracking with unique commit IDs, prevention of unnecessary recomputation (e.g., metadata updates without reprocessing raw data), pipeline parallelization, and GitHub integration for versioned pipelines (https://github.com/BHKLAB-Pachyderm). Costs are controlled by turning the cluster on/off as needed; reported average annual costs are approximately CAD ~$2,800 (VMs ~$1,300; storage ~$1,500). - Data-sharing layer: Generated data objects are automatically uploaded to Zenodo, obtaining DOIs. Each object has an accompanying metadata web page detailing sources, parameters, versions, and links, and an automatically generated BioCompute Object (with its DOI) describing inputs, steps, software, and parameters. ORCESTRA supports restricted/private dataset sharing via Zenodo access controls, shareable links, and a Publish Dataset workflow that flips visibility to public. Security: Azure Active Directory enforces RBAC on the Kubernetes API server and storage to prevent unauthorized access. ORCESTRA tracks each object with three identifiers: ORCESTRA ID, Pachyderm commit ID, and Zenodo DOI. Data-object generation and usage: ORCESTRA currently integrates 17 curated data objects across pharmacogenomics (in vitro), toxicogenomics, xenographic pharmacogenomics (in vivo), radiogenomics, and a clinical genomics compendium. Supported molecular profiles include RNA-seq and microarray expression, copy number variation, mutation, and fusion. Users can customize RNA-seq processing (reference genome, transcriptome source Ensembl/Gencode, quantification tool and version). Each object is versioned, fully documented, and reproducible.
Key Findings
- Platform capability: ORCESTRA enables automated, customizable, and reproducible processing of multimodal biomedical data with complete provenance and sharing via DOIs. It currently provides 17 integrated data objects spanning 11 pharmacogenomic (in vitro), 3 toxicogenomic, 1 xenographic pharmacogenomic (in vivo), 1 clinical genomics compendium (21 studies), and 1 radiogenomics data objects. - Orchestration tool assessment: Pachyderm was chosen due to language-agnostic pipelines, large dataset support, automatic triggering, prevention of full recomputation, Docker integration, commit-level versioning, parallelization, and GitHub integration. Limitations include lack of direct data mounting, persistent resource allocation, and cost-efficiency compared to some alternatives. - Reproducible analyses demonstrated: • Pharmacogenomics: Across GRAY, UHNBreast, CCLE, and GDSC2 data objects, ERBB2 mRNA expression showed strong association with Lapatinib response (AAC). Consistency of Lapatinib response was higher between CTRPv2 and GDSC2 (same assay, CellTiter-Glo) than CTRPv2 and GDSC1. • Toxicogenomics: In Open TG-GATEs Human, primary hepatocytes showed top differentially expressed genes for a high-DILI drug (acetaminophen) versus a no-DILI drug (chloramphenicol). • Xenographic pharmacogenomics: In PDXE, trastuzumab response correlated strongly with ERBB2 expression in breast cancer PDX models. • Clinical genomics: MetaGxPancreas supported prognostic value of PCOSP and clinical models across pancreatic cancer cohorts. • Radiogenomics: Cleveland dataset showed correlations between gene expression and radiosensitivity (AUC of radiation survival curve) across tissues. - Reproducibility and transparency: All analyses are fully reproducible via a Code Ocean compute capsule hosting data objects, code, and figures (https://codeocean.com/capsule/9215268/tree). Each data object is accompanied by detailed metadata, release notes (tracking changes in samples, treatments, assays, and profiles), pipeline parameters, and persistent identifiers. - Operational metrics and costs: ORCESTRA provides usage metrics (downloads, popularity, object statistics) and maintains average annual operational costs around CAD ~$2,800 (VMs ~$1,300; storage ~$1,500).
Discussion
The paper emphasizes challenges posed by heterogeneous, high-dimensional biomedical data, including incomplete metadata, inconsistent data lineage, and limited transparency of computational workflows, all of which hinder reproducibility and FAIR compliance. Many data custodians release only single processed versions without documenting pipeline justification, preventing diverse analyses. Existing portals often lack sufficient file-level metadata, version tracking, or pipeline details, and datasets evolve over time, necessitating robust provenance tracking. ORCESTRA addresses these issues by unifying primary data sources, offering customizable pipeline choices, comprehensive provenance (commit IDs, DOIs, metadata pages, BioCompute Objects), and transparent sharing, thereby operationalizing FAIR principles. It bridges gaps across multiple data types by centralizing access, versioning, and documentation, while crediting data generators and clarifying usage policies. Although one platform cannot resolve all sociopolitical and infrastructural challenges in data sharing, ORCESTRA represents a step towards standardization and community practices that facilitate reproducible research.
Conclusion
ORCESTRA introduces a scalable and transparent paradigm for curating, processing, versioning, and sharing ready-to-analyze multimodal biomedical data with full provenance and reproducibility. By integrating automated pipelines, detailed metadata, persistent identifiers, and standardized BioCompute Objects, it enhances data reusability—a cornerstone of Open Science. Future directions include expanding supported datasets and data types, further automating user data submission and processing through standardized pipelines, enabling community-run local instances, and implementing metrics to track data-object usage in publications to quantify impact.
Limitations
- Platform/tooling constraints: Pachyderm (v1.9.3) lacks direct data mounting from cloud object storage, requiring data copying into Pachyderm’s filesystem; persistent CPU/RAM allocation for pipelines to enable automatic triggering increases resource needs; cloud deployments may be less cost-efficient than on-prem HPC, and Pachyderm lacks built-in low-priority instance cost optimizations available in some alternatives. - Broader ecosystem limitations: A single platform cannot fully resolve sociopolitical and cultural barriers to data and code sharing; some external data sources provide incomplete metadata or inconsistent versions, which can impact downstream integration and reproducibility despite ORCESTRA’s provenance controls.
Listen, Learn & Level Up
Over 10,000 hours of research content in 25+ fields, available in 12+ languages.
No more digging through PDFs, just hit play and absorb the world's latest research in your language, on your time.
listen to research audio papers with researchbunny