
Medicine and Health
Open source and reproducible and inexpensive infrastructure for data challenges and education
P. E. Dewitt, M. A. Rebull, et al.
Unlock the potential of research data sharing with a groundbreaking study by Peter E. DeWitt, Margaret A. Rebull, and Tellen D. Bennett. Discover how a cost-effective, reproducible workflow was created using GitHub, open-source languages, and Docker to democratize data challenges in pediatric traumatic brain injury. Take a look at the results of their innovative approach!
~3 min • Beginner • English
Introduction
Data sharing is necessary to maximize actionable knowledge from research data. The FAIR (Findability, Accessibility, Interoperability, and Reusability) principles are NIH-supported guidelines for scientific data management and stewardship. Data challenges can encourage secondary analyses of datasets, facilitate development of decision support tools, and are common in computational training programs. However, in biomedicine they may require considerable computational resources, often via expensive cloud infrastructures or industry partnerships (e.g., PhysioNet challenges using Google Cloud, and Dream Challenges hosted by Sage Bionetworks), which can be prohibitive for investigators without substantial resources. Given the NIH emphasis on data sharing and reuse, there is a need for inexpensive and computationally lightweight methods for sharing data and hosting data challenges. To address this, the authors developed a workflow to share prospectively collected clinical study data and to host a data challenge, demonstrating reproducible model training, testing, and evaluation using public GitHub repositories, open-source languages, and Docker technology.
Literature Review
The paper situates its contribution within existing guidance and platforms: the FAIR principles provide a framework for findable, accessible, interoperable, and reusable data. Prior biomedical data challenges (e.g., PhysioNet/Computing in Cardiology Challenges and Sage Bionetworks Dream Challenges) often depend on robust cloud infrastructures such as Google Cloud virtual machines, which can be costly and limit accessibility to groups without industry partnerships or substantial funding. This context motivates the development of a lightweight, open-source alternative infrastructure for data sharing and challenges.
Methodology
Clinical context: Pediatric traumatic brain injury (TBI) leads to substantial morbidity and mortality in the U.S., yet recommended treatments have limited evidence bases in part due to limited access to useful clinical datasets.
Data source (PEDALFAST): The NICHD-funded PEDALFAST multi-center prospective cohort study (two ACS-certified level 1 pediatric trauma centers; May 2013–June 2017) enrolled 395 subjects, of which 388 had sufficient data quality to be de-identified and shared. Inclusion criteria: age <18 years, acute TBI, ICU admission, and GCS ≤12 documented by appropriate clinicians or a neurosurgical procedure in the first 24 hours. Exclusion: discharge from ICU within 24 hours without key interventions or death. Variables include demographics, injury mechanism and severity, interventions/treatments, neurologic exams (including GCS components), encounter information, and outcomes.
Standardization and sharing: The dataset was mapped to NIH-supported FITBIR CDEs and submitted to FITBIR (study profile 395). Most PEDALFAST data were mapped to standard forms (Demographics, Injury History, Imaging Read, FSS, Surgical/Therapeutic Procedures, GCS, Pupils). A study-specific form reported ICP monitor placement/durations. Two novel FITBIR elements were contributed: PupilReact (captures overall pupil reactivity categories) and EyesObscuredInd (captures when eyes are obscured). The de-identified data were also shared as CSVs in an R data package (pedalfast.data) on CRAN and archived on Zenodo; package release was delayed until the data challenge concluded. The package includes helper functions (e.g., rounding ages to FITBIR standards and mapping integer-coded ordinal variables like GCS to labeled factors).
Data challenge design: The Harmonized Pediatric Traumatic Brain Injury (HPTBI) Data Challenge tasked participants with building reproducible models to predict (1) hospital mortality and (2) Functional Status Scale (FSS) at discharge. Recruitment occurred via social media and direct emails. Participants registered via Google Forms. Training data comprised 300 of 388 subjects with labels and a data dictionary; a holdout test set had 88 subjects, with a site-balanced split. A template GitHub repository and Zenodo archive provided skeletons for R or Python, a minimal Dockerfile, templates for data preparation and model definition/prediction functions, and infrastructure to test code inside Docker. Participants forked the repo, personalized a description.yaml, developed models (R or Python), and extended the Dockerfile with dependencies. Submissions were indicated via Google Forms; multiple trial submissions were allowed (only run success/failure returned) and one final submission was accepted.
Evaluation infrastructure: Submissions were tagged in participant forks and evaluated on a 2018 MacBook Pro (16 GB RAM, Intel Core i9). An administrative bash script automated assessment: fetching/merging submissions via submodules, branch management, file change checks via sha256, building/running Docker, generating results, and pushing assessments back to participant repos. Evaluation scripts validated prediction vector lengths (mortality and FSS), constrained FSS predictions to integers 6–30, and mortality predictions to character values Alive/Mortality. Metrics: mortality evaluated with Matthews correlation coefficient (MCC) and F1 score; FSS with mean squared error (MSE). To assess reproducibility, each model was trained/evaluated 100 times; mean and standard deviation of metrics were computed. Ranking combined (1) accuracy (MSE for FSS; MCC and F1 for mortality), (2) reproducibility (SDs of metrics), and (3) model parsimony. For each participant, outcome, and dataset, average ranks over mean values and SDs were computed; sums formed outcome ranks, and overall rank summed across outcomes. Ties were broken by assigning the minimum tied rank. Cash prizes were awarded to top performers. Participants were anonymized (P01, P02, ...), with IDs derived by alphabetizing the hash of GitHub user IDs.
Key Findings
Participation and tooling: 27 participants registered; 11 (40.7%) submitted final entries. Most used R (8/11) versus Python (3/11). Common failure causes for early submissions included missing required tags/version numbers in description files (resolved with feedback) and incomplete Docker dependencies (missing R/Python packages), also rectified after guidance.
Modeling approaches: A total of 22 models (two per participant) were submitted. Random Forests were most frequent (10/22). Other methods included unpenalized linear regression (3/22), logistic regression (3/22), ridge regression (1/22), support vector machines (1/22), gradient boosting (2/22), and stacked models (2/22).
Data quality and predictors: Only 1/11 submissions clearly attempted to mitigate inconsistent or illogical data values (e.g., hospital length of stay shorter than time from admission to ICP monitor placement or end). Some participants used inappropriate predictors: one was disqualified for including FSS (a survivor-only discharge measure) in mortality prediction; 3/11 used hospital disposition to predict FSS, which risks target leakage.
Missing data handling: Approaches varied: use of mice (multivariate imputation by chained equations); modeling missingness as informative; one-hot encoding ignoring/dropping missing; several submissions lacked clear accounting; the most common approach (≥5 submissions) was replacing missing with zeros. Zero-imputation created implausible values (e.g., GCS minimum is 3, but zero imputation implies worse-than-possible severity), potentially harming clinical interpretability and utility despite acceptable challenge performance.
Performance and overfitting: Some models showed zero error on training but degraded performance on test data, indicating overfitting (e.g., FSS models P24, P03; mortality models P26, P14, P03). Repeated training/testing (100 runs) quantified reproducibility; submissions with zero metric SDs ranked highly on reproducibility.
Ranking and winners: Rankings combined accuracy, reproducibility, and parsimony; adding a categorical assessment of clinical utility did not change overall ranking. Overall winner P07 used relatively simple, interpretable models: Gaussian-response linear regression for FSS and logistic regression for mortality. Final mortality model (after backward stepwise selection and VIF checks) included five predictors: cardiac arrest (any time), age, ICU GCS, mannitol ordered (yes/no), and receipt of enteral nutrition (yes/no). The FSS model underwent backward stepwise selection and cross-validation, resulting in 14 predictors (including ED GCS eye and sedation indicators, CT findings, ICU GCS components/observations, timing of key events and procedures, specific interventions, and hospital length of stay). Prize outcomes: P07 ($500, rank 1), P24 ($250, rank 2), P03/P11/P26 ($125 each, prize rank 3); P22 was ineligible for prize money for administrative reasons.
Feasibility: Multiple participants and submissions were managed by a single administrator using a standard laptop. The open-source, Docker-based workflow enabled reproducible assessments without costly cloud infrastructure.
Discussion
The study set out to create and demonstrate an open-source, inexpensive, and computationally lightweight infrastructure for data sharing and data challenges. The successful execution of the HPTBI Data Challenge using public GitHub repositories, R/Python, and Docker on a standard laptop shows that robust, reproducible model training and evaluation can be achieved without costly cloud platforms. This addresses barriers to participation and reuse emphasized by FAIR and NIH policies.
Findings also highlighted critical considerations for scientific and clinical utility. Many participants prioritized predictive performance over parsimony and interpretability, and several approaches to missing data (e.g., zero imputation) produced implausible values with limited clinical applicability. Instances of target leakage and failure to address data inconsistencies (e.g., timing logic) further reduced the utility and generalizability of some models. The ranking methodology that incorporated reproducibility and parsimony favored simpler, interpretable, and consistent models, aligning with clinical implementation needs. Operationally, while the workflow generally functioned well, certain procedural requirements (tags/versioning, Docker dependency management) were friction points for participants and may require simplification or more comprehensive base images in future iterations.
Conclusion
The authors developed and shared a reproducible, open-source, and low-cost workflow for hosting data challenges and enabling data reuse. They mapped a prospective pediatric TBI dataset (PEDALFAST) to FITBIR CDEs, created an accessible R data package, and ran the HPTBI Data Challenge using GitHub and Docker on modest hardware. The challenge demonstrated that multiple teams can be supported with minimal administrative overhead while ensuring reproducible training and evaluation. Lessons learned include the importance of guiding participants toward data exploration, proper handling of missing/inconsistent values, avoiding target leakage, and emphasizing parsimony and interpretability alongside accuracy. Future work could streamline submission mechanics (e.g., deprecate strict tagging/versioning), provide more comprehensive base Docker images to reduce setup errors, consider established participant platforms for recruitment, and further integrate clinical utility assessments. Overall, open-source, reproducible, and lightweight methods can increase the impact of shared research data and support education.
Limitations
The dataset was not guaranteed to be analysis-ready; inconsistencies (e.g., timing of ICP monitoring versus length of stay) were present by design to encourage data exploration, but many participants did not address them. Several submissions used problematic handling of missing data (e.g., zero imputation creating implausible values like GCS=0) and included inappropriate predictors (target leakage), limiting clinical applicability. Operationally, the workflow required participants to manage Git tags/versioning and Docker dependencies, which caused failures and required administrator support; this support may not scale if participant numbers increase substantially. The evaluation ran on a single laptop and may face scalability constraints with larger challenges. Recruitment was regional and modest, and the number of final submissions was limited. The study did not directly compare performance or user experience against established challenge platforms.
Related Publications
Explore these studies to deepen your understanding of the subject.