Medicine and Health

Orchestrating and sharing large multimodal data for transparent and reproducible research

A. Mammoliti, P. Smirnov, et al.

Discover ORCESTRA, a groundbreaking cloud-based platform that revolutionizes reproducible processing of multimodal biomedical data. Developed by leading researchers including Anthony Mammoliti and Petr Smirnov, this innovative tool enhances data sharing and management for clinical and genomic research, ensuring compliance with FAIR principles.

00:00

Playback language: English

Index

Introduction

The increasing demand for large-scale multimodal biomedical data, driven by personalized medicine research, presents challenges to reproducibility. Existing data-sharing practices are insufficient due to data size and infrastructure requirements. While workflow languages (CWL, WDL) and management systems (Snakemake, Nextflow) exist, they have steep learning curves. User-friendly platforms like Galaxy lack scalability. Furthermore, computational workflows and metadata are often missing, hindering data provenance and transparency. The need for scalable, reproducible, and transparent solutions for processing and analyzing large multimodal data, with full data provenance, is critical. Biomedical data encompasses various types, including pharmacogenomics, toxicogenomics, radiogenomics, and clinical genomics. These data are crucial for biomarker discovery and translational research, requiring standardized and transparent processing and sharing. The FAIR data principles (findable, accessible, interoperable, and reusable) emphasize the need for rich metadata, persistent identifiers, standardized formats, and accessible usage licenses. The MAQC Society promotes community-agreed standards for sharing multimodal biomedical data to improve reproducibility. However, many genomic data repositories do not fully adhere to FAIR principles, often providing only a single version of a dataset without adequate documentation of processing pipelines. To address these issues, the authors developed ORCESTRA.

Literature Review

The paper reviews existing data-sharing practices and platforms in several areas: pharmacogenomics (GDC, Broad Institute portal, GRAY dataset, LINCS, DepMap), toxicogenomics (LSDB Archive, TG-GATE), xenographic and radiogenomics (supplementary materials from publications), and clinical genomics (Oncomine, MultiAssayExperiments, curatedData, MetaGx). It highlights the limitations of these approaches, such as data scattered across multiple repositories, inconsistent metadata, lack of pipeline documentation, and absence of version control. The review underscores the challenges of ensuring reproducibility and transparency in the sharing and processing of large, complex biomedical datasets.

Methodology

The authors developed ORCESTRA (orcestra.ca), a cloud-based platform using Pachyderm, an open-source orchestration tool, for reproducible data processing. Pachyderm's features, such as language-agnostic pipelines, large dataset support, automatic pipeline triggering, and version control, were leveraged. However, limitations of Pachyderm, such as the lack of direct data mounting and cost-efficiency features, were noted. ORCESTRA orchestrates data-processing pipelines to create customized, versioned, and fully documented data objects. It integrates multiple data types (pharmacogenomics, toxicogenomics, etc.) and uses various Bioconductor packages (PharmacoGx, ToxicoGx, Xeva, MetaGxPancreas, RadioGx). Data objects are automatically uploaded to Zenodo with DOIs, accompanied by detailed metadata, including pipeline parameters and BioCompute Objects. The platform provides a web application for users to search, request, and manage data objects. Users can filter objects by various parameters, request custom data objects with specified settings, and track their request status. Registered users can save favorites and submit their own data for processing. Security measures using Azure Active Directory and RBAC are implemented to protect computational resources and data. The platform's architecture consists of three layers: a web application layer (Node.js API, React front-end, MongoDB database), a data-processing layer (Kubernetes cluster on Azure, Pachyderm, Docker images for R packages), and a data-sharing layer (Zenodo for DOI assignment and custom metadata web pages). The paper describes the functionalities of each layer and the data flow. Cost analysis of the platform's Azure infrastructure is also provided.

Key Findings

ORCESTRA successfully addresses the challenges of reproducibility and transparency in handling large multimodal biomedical datasets. Key findings include: (1) A flexible framework for processing and sharing diverse data types; (2) Automatic generation of versioned data objects with DOIs and detailed metadata; (3) A user-friendly web application for data object search, request, and management; (4) Integration of various Bioconductor packages for different data types; (5) Implementation of robust security measures; (6) Detailed documentation of data sources, pipelines, and processing parameters; (7) Use of BioCompute Objects for enhanced reproducibility; (8) Demonstration of platform capabilities through case studies, showcasing the association between ERBB2 expression and drug response, drug compound toxicity, and prognostic value of clinical models; (9) Provision of usage metrics for data object popularity and dataset statistics; (10) Option for users to upload their data for processing. The platform's effectiveness is demonstrated through case studies using various data objects (GRAY, CCLE, CTRPv2, GDSC, Open TG-GATEs, PDXE, MetaGxPancreas, Cleveland), highlighting associations between gene expression and drug response, toxicity, and survival prediction. Results are reproducible via a Code Ocean compute capsule.

Discussion

ORCESTRA offers a significant advancement in addressing the reproducibility crisis in biomedical research by providing a transparent and reproducible platform for sharing and processing large multimodal data. The platform's features, including version control, detailed metadata, and the use of DOIs and BioCompute Objects, directly address the limitations of existing data-sharing practices. The integration of diverse data types and analytical tools enhances the value of the data and facilitates broader research applications. The user-friendly interface makes the platform accessible to a wider range of researchers. ORCESTRA is a valuable resource for the biomedical research community, promoting open science principles and accelerating the pace of discovery.

Conclusion

ORCESTRA offers a novel approach to sharing multimodal biomedical data, ensuring transparency and reproducibility. Its key contributions are the integrated data object creation, version control, and detailed documentation. Future directions include expanding datasets, automating data uploads, enhancing community involvement through local instance deployment, and tracking data object usage in publications.

Limitations

While ORCESTRA addresses many challenges in data sharing, some limitations remain. The reliance on Pachyderm introduces certain cost and resource allocation constraints compared to other platforms. The platform's functionality depends on the availability of suitable Bioconductor packages for various data types, and the continuous updating of those packages is crucial. The effectiveness of the platform ultimately hinges on the community's adoption and active participation.

Related Publications

Explore these studies to deepen your understanding of the subject.

Medicine and Health

Open source and reproducible and inexpensive infrastructure for data challenges and education

P. E. Dewitt, M. A. Rebull, et al.

Interdisciplinary Studies

A focus groups study on data sharing and research data management

D. R. Donaldson and J. W. Koepke

Medicine and Health

Employing a systematic approach to biobanking and analyzing clinical and genetic data for advancing COVID-19 research

S. Daga, C. Fallerini, et al.

Computer Science

Unveiling the Flaws: Exploring Imperfections in Synthetic Data and Mitigation Strategies for Large Language Models

J. Chen, Y. Zhang, et al.

Listen, Learn & Level Up

Over 10,000 hours of research content in 25+ fields, available in 12+ languages.

No more digging through PDFs, just hit play and absorb the world's latest research in your language, on your time.

listen to research audio papers with researchbunny