Medicine and Health

Data leakage inflates prediction performance in connectome-based machine learning models

M. Rosenblatt, L. Tejavibulya, et al.

This research by Matthew Rosenblatt, Link Tejavibulya, Rongtao Jiang, Stephanie Noble, and Dustin Scheinost delves into the critical issue of data leakage in neuroimaging predictive modeling. By examining five types of leakage across four datasets, the study unveils how feature selection and repeated subject leakage can dramatically skew prediction outcomes, particularly in smaller datasets. Discover the nuances of leakage's impact and its significance for achieving valid results!... show more

Abstract

Predictive modeling is a central technique in neuroimaging to identify brain-behavior relationships and test their generalizability to unseen data. However, data leakage undermines the validity of predictive models by breaching the separation between training and test data. Leakage is always an incorrect practice but still pervasive in machine learning. Understanding its effects on neuroimaging predictive models can inform how leakage affects existing literature. Here, we investigate the effects of five forms of leakage—involving feature selection, covariate correction, and dependence between subjects—on functional and structural connectome-based machine learning models across four datasets and three phenotypes. Leakage via feature selection and repeated subjects drastically inflates prediction performance, whereas other forms of leakage have minor effects. Furthermore, small datasets exacerbate the effects of leakage. Overall, our results illustrate the variable effects of leakage and underscore the importance of avoiding data leakage to improve the validity and reproducibility of predictive modeling.

Publisher

Nature Communications

Published On

Feb 28, 2024

Authors

Matthew Rosenblatt, Link Tejavibulya, Rongtao Jiang, Stephanie Noble, Dustin Scheinost

DOI

https://doi.org/10.1038/s41467-024-46150-w

Related Publications

Explore these studies to deepen your understanding of the subject.

Medicine and Health

Network-based machine learning in colorectal and bladder organoid models predicts anti-cancer drug efficacy in patients

J. Kong, H. Lee, et al.

Medicine and Health

Improved metabolomic data-based prediction of depressive symptoms using nonlinear machine learning with feature selection

Y. Takahashi, M. Ueki, et al.

Education

Driving STEM learning effectiveness: dropout prediction and intervention in MOOCs based on one novel behavioral data analysis approach

X. Xia and W. Qi

Medicine and Health

Machine learning-based prediction of in-hospital death for patients with takotsubo syndrome: The InterTAK-ML model

O. D. Filippo, V. L. Cammann, et al.

Listen, Learn & Level Up

Over 10,000 hours of research content in 25+ fields, available in 22+ languages.

No more digging through PDFs, just hit play and absorb the world's latest research in your language, on your time.

listen to research audio papers with researchbunny