Medicine and Health

Open source and reproducible and inexpensive infrastructure for data challenges and education

P. E. Dewitt, M. A. Rebull, et al.

Unlock the potential of research data sharing with a groundbreaking study by Peter E. DeWitt, Margaret A. Rebull, and Tellen D. Bennett. Discover how a cost-effective, reproducible workflow was created using GitHub, open-source languages, and Docker to democratize data challenges in pediatric traumatic brain injury. Take a look at the results of their innovative approach!

00:00

Playback language: English

Index

Introduction

The FAIR principles (Findability, Accessibility, Interoperability, and Reusability) guide data management and stewardship. Data challenges promote secondary analyses and aid in developing high-quality decision support tools, frequently utilized in computational training. However, biomedical data challenges often depend on expensive cloud computing infrastructure and industry partnerships, creating barriers for researchers with limited resources. This research aimed to address this gap by developing an inexpensive and computationally lightweight method for sharing data and hosting data challenges, focusing on reproducibility and ease of use for participants with varying levels of computational expertise. A key component of this approach was the development of a workflow to facilitate reproducible model training, testing, and evaluation, leveraging open-source tools and technologies.

Literature Review

Existing biomedical data challenges often utilize advanced cloud-based computing infrastructures like Google Cloud virtual machines and platforms such as Sage Bionetworks' Dream Challenges. These resources are costly and require significant funding and partnerships, limiting participation for researchers without substantial resources. This paper highlights the need for more accessible and affordable alternatives that still maintain the rigor and reproducibility of existing platforms. The FAIR principles are presented as the foundational guidelines for data management and stewardship.

Methodology

This study utilized data from the PEDiatic validation of variables in TBI (PEDALFAST) multi-center prospective cohort study, focusing on pediatric traumatic brain injury (TBI). The data, comprising demographic, injury, and clinical event information on 388 subjects, were mapped to the NIH-supported common data elements (CDEs) used by the Federal Interagency TBI Research (FITBIR) Informatics System and made publicly available via FITBIR, an R data package (pedalfast.data), and Zenodo. The Harmonized Pediatric Traumatic Brain Injury (HPTBI) Data Challenge was conducted using 300 subjects for training and 88 for testing. Participants were provided with a template GitHub repository containing R and Python skeletons, a Docker file for reproducible model training and testing, and clear instructions. Submissions were evaluated using automated bash scripts assessing model accuracy (MSE for Functional Status Scale (FSS), MCC and F1 for mortality), reproducibility (standard deviation of assessment statistics), and parsimony. The evaluation scripts included checks to ensure the integrity and consistency of participant submissions. The evaluation process was automated to a large extent to minimize the workload on the administrators. The study's success hinged on successfully combining open-source resources with the use of version control and Docker to ensure consistency, reproducibility, and ease of assessment.

Key Findings

Eleven out of 27 registered participants submitted final entries. Most participants used R (8/11), while fewer used Python (3/11). Failed submissions often resulted from neglecting to provide required tags and version numbers in description files or failing to update Docker files. A wide range of modeling techniques was used, including Random Forest (most common), linear and logistic regression, ridge regression, support vector machines, gradient boosting, and stacked models. The analysis revealed inconsistent data values (e.g., discrepancies between hospital length of stay and ICP monitoring durations), a lack of data exploration by several participants, and the inappropriate use of predictor variables in certain submissions (e.g., including FSS in mortality prediction). The most common approach for handling missing data was replacing missing values with zeros, which potentially affected clinical application and understanding of predictor-outcome associations. Model performance varied significantly between training and testing datasets, suggesting overfitting in some cases. The winning model (P07) utilized simpler models (Gaussian response linear model for FSS and logistic regression for mortality), showcasing that simpler, more parsimonious models can outperform complex machine-learning approaches. The winning models were selected based on accuracy, reproducibility, model parsimony and clinical utility. The assessment revealed the importance of addressing data quality issues and considering the clinical implications of modeling decisions.

Discussion

This study successfully demonstrated a workflow for data challenges that uses open-source tools, promoting reproducibility and affordability. The open-source, reproducible, inexpensive, and computationally lightweight methodology has the potential to increase data sharing and reuse, extending the impact of research data. While the data challenge process was largely successful, the findings highlight several lessons learned regarding participant recruitment (using established platforms like Kaggle could improve participant numbers), model selection (emphasizing parsimony and interpretability), and data preprocessing (the necessity of data exploration and handling inconsistencies). The use of automated evaluation scripts proved efficient, but some failure points require refinement for future iterations. Using a single, larger Docker image with common packages could simplify the process for participants and administrators. The open accessibility of the data via FITBIR, CRAN, and Zenodo aligns with the FAIR principles, demonstrating the reusability of the data and encouraging broader research.

Conclusion

This research presented a successful, open-source, reproducible, inexpensive, and computationally lightweight workflow for hosting data challenges. This approach facilitates data sharing and expands opportunities for researchers with limited resources. Future work should focus on improving participant recruitment strategies, emphasizing interpretability and clinical relevance in model selection, and refining the automated evaluation processes to minimize failure points. The successful application of this workflow to the HPTBI Data Challenge provides a valuable model for future data-driven initiatives.

Limitations

The number of participants was relatively small, potentially limiting the generalizability of the findings. The automated nature of the evaluation process might have missed some nuances in model development that a more manual review could identify. The focus on model performance in the data challenge may have discouraged some participants from prioritizing model interpretability and clinical feasibility. The reliance on participants' expertise in data handling may lead to variability in data preprocessing, affecting the comparability of results.

Related Publications

Explore these studies to deepen your understanding of the subject.

Medicine and Health

Orchestrating and sharing large multimodal data for transparent and reproducible research

A. Mammoliti, P. Smirnov, et al.

Environmental Studies and Forestry

Data-driven approaches linking wastewater and source estimation hazardous waste for environmental management

W. Xie, Q. Yu, et al.

Biology

Zoobooth: A portable, open-source and affordable approach for repeated size measurements of live individual zooplankton

C. Broch and J. Heuschele

Computer Science

An open source machine learning framework for efficient and transparent systematic reviews

R. V. D. Schoot, J. D. Bruin, et al.

Listen, Learn & Level Up

Over 10,000 hours of research content in 25+ fields, available in 12+ languages.

No more digging through PDFs, just hit play and absorb the world's latest research in your language, on your time.

listen to research audio papers with researchbunny