logo
ResearchBunny Logo
Data leakage inflates prediction performance in connectome-based machine learning models

Medicine and Health

Data leakage inflates prediction performance in connectome-based machine learning models

M. Rosenblatt, L. Tejavibulya, et al.

This research by Matthew Rosenblatt, Link Tejavibulya, Rongtao Jiang, Stephanie Noble, and Dustin Scheinost delves into the critical issue of data leakage in neuroimaging predictive modeling. By examining five types of leakage across four datasets, the study unveils how feature selection and repeated subject leakage can dramatically skew prediction outcomes, particularly in smaller datasets. Discover the nuances of leakage's impact and its significance for achieving valid results!... show more
Introduction

Understanding individual differences in brain-behavior relationships is a key goal in neuroscience, and predictive machine learning models using neuroimaging (e.g., functional connectivity) have become popular for predicting phenotypes like cognition, age, and clinical outcomes. Prediction methods improve generalizability by evaluating models on unseen test data via cross-validation or train/test splits. Data leakage occurs when information from the test set is introduced into training, violating this separation and potentially inflating performance. A recent meta-review across 17 fields identified eight leakage types and showed that leakage often inflates model performance and undermines reproducibility. In neuroimaging, reviews have noted instances of dimensionality reduction performed prior to data splits, indicating potential leakage. Despite its prevalence, the magnitude of performance inflation due to leakage in neuroimaging predictive models remains unclear. This study systematically evaluates how several common leakage forms affect prediction performance and interpretability in connectome-based models across multiple datasets and phenotypes.

Literature Review

Prior work across scientific machine learning identified widespread leakage, including lack of separate test sets, preprocessing and feature selection on combined data, duplicate data points, illegitimate features, temporal leakage, non-independence between training and test sets, and sampling bias. Leakage was associated with inflated performance and reduced reproducibility. In predictive neuroimaging specifically, some studies performed unsupervised dimensionality reduction on the whole dataset prior to splitting, risking leakage and contributing to reproducibility concerns. The field has emphasized best practices for prediction and warned about the need for strict separation between training and test data. However, quantitative assessments of how much leakage inflates or deflates performance in neuroimaging prediction tasks have been limited, motivating this study.

Methodology

Datasets and phenotypes: Four large resting-state fMRI datasets were used: ABCD (N=7822–7969), HBN (N=1024–1201), HCPD (N=424–605), and PNC (N=1119–1126). Three phenotypes available across datasets were predicted: age, attention problems (CBCL or analogous measure), and matrix reasoning (WISC-V or PMAT). For structural connectivity analyses, diffusion MRI from HCPD (N=635) was used to build structural connectomes.

Preprocessing: All datasets underwent motion correction and additional preprocessing in BioImage Suite, including regression of nuisance covariates (linear/quadratic drifts, mean CSF, mean white matter, global signal), 24-parameter motion regression, temporal smoothing (~0.12 Hz cutoff), gray matter masking, parcellation with the Shen 268-node atlas, and computation of Fisher z-transformed connectivity matrices. Data with poor quality, high motion (>0.2 mm mean FD), missing coverage, or missing phenotypes were excluded.

Gold standard predictive pipeline: Ridge regression with 5-fold cross-validation and nested hyperparameter tuning (five nested folds for HBN/HCPD/PNC; two nested folds for ABCD) was used. Within each training fold, the top 5% of edges most correlated with the target were selected; the selected features were then applied to the test fold. A grid search over L2 regularization α∈{10^−3,10^−2,10^−1,1,10,10^2,10^3} was performed, selecting the model with highest Pearson r in nested folds. Family structure (ABCD, HCPD) was respected by assigning all members of a family to the same fold. Covariate regression (FD, sex, and age except when predicting age) and multi-site correction (ComBat) were performed within the cross-validation loop: parameters were estimated on training data and applied to test data. Performance metrics were Pearson’s correlation r and cross-validated R² (q²), computed by concatenating predictions across folds, repeated over 100 random seeds.

Leakage conditions: Five leakage forms were evaluated relative to the gold standard: (1) Feature leakage: selecting features (top 5%) using the entire dataset before cross-validation; (2) Leaky site correction: applying ComBat to the full dataset prior to splitting; (3) Leaky covariate regression: regressing covariates using the entire dataset outside cross-validation; (4) Family leakage: ignoring family structure when splitting, allowing relatives across train/test; (5) Subject leakage: duplicating a percentage of subjects (5%, 10%, 20%) so repeats can cross train/test. Additional non-leaky variants examined analysis choices (e.g., excluding site correction or covariate regression).

Sample size analyses: To assess interactions with sample size, subsamples of N=100, 200, 300, 400 were drawn (10 resamples each), with 10 iterations of 5-fold CV per resample, across leakage types. ABCD subsampling used four largest sites to ensure ComBat stability; family-based sampling preserved relatedness proportions.

Additional models and modalities: The leakage analyses were replicated with support vector regression (RBF kernel; grid over C=10^−3…10^2…10) and connectome-based predictive modeling (CPM). Structural connectomes (HCPD diffusion MRI) were built via susceptibility correction, GQI reconstruction in DSI-Studio, and automated tractography using the Shen atlas.

Coefficient and feature distribution comparisons: For each pipeline, coefficients were averaged across folds and correlated with gold standard coefficients. Feature selection distributions across 10 canonical networks (55 subnetworks) were compared via size-adjusted counts and rank correlations.

Key Findings

Baseline non-leaky performance (HCPD): Gold standard ridge models showed no predictive power for attention problems (median r=0.01, q²=−0.13), strong age prediction (r=0.80, q²=0.63), and moderate matrix reasoning prediction (r=0.30, q²=0.08). Excluding site correction had negligible impact. Omitting covariate regression increased r but variably affected q² across phenotypes.

Feature leakage: Selecting features on the full dataset inflated performance for all phenotypes, most for those with weaker baseline performance. In HCPD, inflation was age Δr=0.03, Δq²=0.05; matrix reasoning Δr=0.17, Δq²=0.13; attention problems Δr=0.47, Δq²=0.35, raising attention from chance (r=0.01, q²=−0.13) to moderate (r=0.48, q²=0.22). Across datasets, feature leakage effects ranged Δr=0.03–0.52, Δq²=0.01–0.47, with the largest dataset (ABCD) least affected.

Covariate-related leakage: Leaky site correction had minimal impact (Δr≈−0.01–0.00, Δq²≈−0.01–0.01). Leaky covariate regression consistently deflated performance: in HCPD attention Δr=−0.06, Δq²=−0.17; age Δr=−0.02, Δq²=−0.03; matrix Δr=−0.09, Δq²=−0.08. Across datasets: Δr=−0.09–0.00, Δq²=−0.17–0.00.

Subject-level leakage: Family leakage had little effect overall (age/matrix Δr=0.00, Δq²=0.00; attention small increase Δr=0.02, Δq²=0.00). Duplicating subjects inflated performance, increasing with duplication rate. In HCPD at 20% subject leakage: attention Δr=0.28, Δq²=0.19; age Δr=0.04, Δq²=0.07; matrix Δr=0.14, Δq²=0.11. Across datasets, 20% duplication yielded Δr=0.06–0.29, Δq²=0.03–0.24. Twin-only ABCD subset showed small inflation (Δr≈0.02–0.04). Simulations indicated larger effects as the proportion of multi-member families increased.

Coefficient and feature distribution shifts: Compared with gold standard, coefficients changed little when excluding site correction (median r_coef=0.75–0.99), moderately when omitting covariate regression (0.31–0.84) or both site and covariate steps (0.32–0.81). Leaky feature selection produced the most dissimilar coefficients (0.39–0.72). Family leakage (0.79–0.94) and 20% subject leakage (0.74–0.93) also altered coefficients. Feature distributions across canonical subnetworks were most perturbed by leaky feature selection and by omitting covariate regression.

Sample size effects: Leakage effects were more variable and potentially larger at smaller sample sizes (e.g., N=100), with wide Δr ranges that narrowed as N increased to 400. Taking medians over multiple CV iterations reduced but did not eliminate leakage effects; feature and subject leakage remained impactful.

Sensitivity analyses and structural connectomes: Trends generalized to SVR and CPM, though effect magnitudes varied (e.g., CPM less affected by feature leakage; SVR most sensitive to subject leakage). Structural connectomes in HCPD showed similar patterns: gold standard performance (matrix r=0.34, q²=0.12; attention r=0.11, q²=−0.07; age r=0.73, q²=0.53). Feature leakage (Δr=0.07–0.57, Δq²=0.12–0.52) and subject leakage (Δr=0.05–0.27, Δq²=0.06–0.20) most inflated performance; leaky covariate regression mildly reduced performance (Δr up to −0.04, Δq² up to −0.04).

Discussion

The study demonstrates that data leakage impacts connectome-based prediction in ways that vary by leakage type, phenotype, dataset size, model, and modality. Critically, leaky feature selection and subject-level leakage (duplicate or repeated subjects crossing train/test) can substantially inflate performance, especially for phenotypes with weak brain-behavior associations (e.g., attention problems). In contrast, leaky covariate regression can deflate performance, highlighting that leakage can also cause underestimation of true effects. Family leakage showed minimal effects in datasets with few multi-member families but can become more consequential when family prevalence is higher (e.g., twin studies). Larger datasets and using multiple cross-validation iterations mitigated but did not eliminate leakage effects; small samples were particularly vulnerable and showed high variability. Coefficient comparisons revealed that leakage alters model interpretability, with the largest deviations for leaky feature selection and when omitting covariate regression, implying that leakage can distort inferred neurobiological relationships. Overall, maintaining strict separation of training and test data and carefully embedding all preprocessing within cross-validation are essential to obtain valid, reproducible estimates of predictive performance and to preserve meaningful model interpretation.

Conclusion

This work systematically quantifies how common forms of data leakage affect connectome-based predictive modeling across multiple large neuroimaging datasets, phenotypes, models, and modalities. Feature selection performed on combined data and subject duplication across train/test cause the greatest performance inflation, whereas leaky covariate regression deflates performance and site-related leakage typically has negligible impact under harmonized conditions. Effects are amplified in smaller samples and for phenotypes with weaker brain-behavior associations. Best practices include strict train/test separation, cross-validated preprocessing (covariate regression and site correction), accounting for relatedness, performing many cross-validation iterations, code sharing, and considering alternative validation strategies (e.g., lockbox or external validation) and model information sheets. Future work should examine additional leakage forms (e.g., hyperparameter selection on test data, temporal leakage, unsupervised dimensionality reduction), broader populations and settings (e.g., site confounds), complex models like deep neural networks, and additional evaluation metrics.

Limitations

The study cannot cover all possible leakage forms or all datasets/phenotypes. Several leakage types were not examined (e.g., temporal leakage, hyperparameter selection on test data, unsupervised dimensionality reduction, phenotype standardization, illegitimate features). Analyses focused on child/adolescent/young adult cohorts from well-harmonized datasets; different populations, dataset quality, or confounds (e.g., diagnosis confounded with site) could alter leakage effects. Family leakage effects may be larger in datasets with more multi-member families (e.g., twin studies). Complex models (e.g., deep networks) could be more susceptible to leakage. Results primarily use Pearson’s r and q²; other metrics may behave differently. In some contexts, what constitutes leakage can be application-dependent (e.g., site separation vs. harmonization).

Listen, Learn & Level Up
Over 10,000 hours of research content in 25+ fields, available in 12+ languages.
No more digging through PDFs, just hit play and absorb the world's latest research in your language, on your time.
listen to research audio papers with researchbunny