logo
ResearchBunny Logo
Introduction
Understanding individual differences in brain-behavior relationships is a central goal of neuroscience. Machine learning using neuroimaging data, such as functional connectivity, has become increasingly popular for predicting various phenotypes, including cognitive performance, age, and clinical outcomes. Prediction offers advantages in replicability and generalizability by evaluating models on unseen data, typically through techniques like k-fold cross-validation or train/test splits. However, data leakage, where information about the test data is introduced into the model during training, undermines this advantage and compromises the validity of the findings. A recent meta-review highlighted the prevalence of leakage across various fields, including neuroimaging, often leading to inflated model performance and decreased reproducibility. This study aims to quantify the effects of different forms of data leakage on the performance and interpretation of connectome-based predictive models.
Literature Review
Several studies have examined the use of machine learning in neuroimaging to predict various phenotypes. These studies have shown promising results in predicting cognitive performance, age, and clinical outcomes. However, a growing concern is the potential for data leakage to inflate prediction performance and hinder reproducibility. A recent meta-review by Kapoor and Narayanan (2023) highlighted the prevalence of data leakage across seventeen fields, identifying hundreds of papers with potential leakage issues. Another review focused on predictive neuroimaging suggested that several studies might have leaked information through improper dimensionality reduction. Despite these concerns, the severity of performance inflation due to leakage in neuroimaging remains largely unknown, prompting this study to investigate the impact of various leakage forms.
Methodology
This study evaluated the effects of data leakage on functional and structural connectome-based predictive models using four large datasets: the Adolescent Brain Cognitive Development (ABCD) Study, the Healthy Brain Network (HBN) Dataset, the Human Connectome Project Development (HCPD) Dataset, and the Philadelphia Neurodevelopmental Cohort (PNC) Dataset. Three phenotypes—age, attention problems, and matrix reasoning—were predicted using ridge regression with 5-fold cross-validation, 5% feature selection, and a grid search for the L2 regularization parameter. Five forms of data leakage were examined: feature leakage (feature selection performed on the entire dataset before splitting), covariate-related leakage (site correction and covariate regression performed on the entire dataset), and subject-level leakage (family structure and repeated subjects not accounted for in data splitting). The effects of leakage were evaluated by comparing prediction performance (Pearson's correlation and cross-validated R-squared) and model coefficients between leaky and non-leaky pipelines. Additionally, the effects of sample size were investigated by subsampling the datasets to smaller sizes. Sensitivity analyses were conducted using different machine learning models (support vector regression, connectome-based predictive modeling) and structural connectomes.
Key Findings
The study found that different analysis choices (e.g., inclusion or exclusion of covariate regression and site correction) led to varying prediction performances. Feature leakage and subject leakage (20% repeated subjects) consistently inflated prediction performance across all datasets and phenotypes, with the most significant impact observed for phenotypes with weaker baseline performance (e.g., attention problems). Covariate-related leakage (leaky site correction and leaky covariate regression) had minimal or negative effects on performance. Family leakage had little to no impact on large datasets where few families had multiple members. Smaller datasets were more susceptible to the variability introduced by leakage. Comparing model coefficients revealed substantial differences between leaky and non-leaky pipelines, particularly for feature leakage and omission of covariate regression. The effects of leakage were generally similar across different models (ridge regression, SVR, CPM) and connectome types (functional and structural). Subsampling to smaller sizes increased the variability and impact of leakage, but using the median performance across multiple iterations of k-fold cross-validation mitigated this effect to some degree.
Discussion
This study demonstrates the variable effects of different types of data leakage on connectome-based predictive models. While some forms of leakage (feature and subject leakage) led to substantial performance inflation, others had minimal or no impact. The effect sizes of leakage were influenced by factors such as sample size and baseline predictive performance of the phenotype. These findings highlight the critical importance of careful data handling and rigorous validation techniques to ensure the validity and reproducibility of neuroimaging predictive modeling studies. The study reinforces the necessity of strictly separating training and test data and performing appropriate data splitting methods (e.g., k-fold cross-validation with appropriate handling of family structure). The findings emphasize the need for transparency and detailed methodological reporting to promote reproducibility in neuroimaging research.
Conclusion
Data leakage significantly impacts the reproducibility and validity of findings in connectome-based predictive modeling. This study emphasizes the need for researchers to carefully design their analyses to avoid the various forms of leakage identified. While some forms of leakage showed limited effects, the consistent inflation caused by feature and subject leakage highlights the importance of best practices in data handling. Future research should explore additional forms of leakage and develop more robust methods for preventing and detecting leakage in neuroimaging and other machine learning applications.
Limitations
This study investigated a limited set of data leakage types, models, and phenotypes. While the selected forms of leakage are common and impactful, other types of leakage (e.g., temporal leakage, hyperparameter selection leakage) were not considered. The findings may not generalize to all datasets, populations, or machine learning models, particularly highly complex models like neural networks which are known for their increased susceptibility to data memorization. The analysis was primarily focused on large, publicly available datasets, and the effects of leakage may differ in smaller, less well-characterized datasets.
Listen, Learn & Level Up
Over 10,000 hours of research content in 25+ fields, available in 12+ languages.
No more digging through PDFs—just hit play and absorb the world's latest research in your language, on your time.
listen to research audio papers with researchbunny