Medicine and Health

Improved metabolomic data-based prediction of depressive symptoms using nonlinear machine learning with feature selection

Y. Takahashi, M. Ueki, et al.

Discover the groundbreaking research by Yuta Takahashi and colleagues on a novel prediction model for depressive symptoms using the HSIC Lasso algorithm! This innovative study leverages a vast metabolomic dataset from the population affected by the Great East Japan Earthquake, revealing key metabolites that enhance predictive power.

00:00

~3 min • Beginner • English

Index

Introduction

The study addresses the challenge of predicting depressive symptoms from plasma metabolomics by overcoming two key limitations in prior work: the abundance of high-dimensional features (including redundancies) and nonlinear relationships among metabolites, covariates, and depressive phenotypes. Traditional linear models and even linear feature selection (e.g., Lasso) are limited by assumptions of linearity, while kernel-based predictors without feature selection (e.g., SVM/KR) can overfit omics data. The authors hypothesize that integrating a nonlinear feature selection method that accounts for feature–outcome dependence and minimizes redundancy (HSIC Lasso) with nonlinear predictors (SVM/KR) will improve prediction of depressive symptoms (CES-D) from metabolomic profiles in a large, population-based Japanese cohort affected by the Great East Japan Earthquake. The study’s purpose is to develop and evaluate this HSIC Lasso-based pipeline and identify metabolite features most predictive of depressive symptoms.

Literature Review

Prior metabolomics studies suggest plasma metabolome profiles can inform depression mechanisms and prediction, but findings have been inconsistent and often not replicated. Earlier prediction models commonly relied on single metabolites or linear approaches and suffered from small sample sizes (often a few hundred or fewer) and statistical issues including multicollinearity, redundancies among metabolites, and unaddressed nonlinear associations among metabolites, covariates (e.g., age, BMI), and depressive phenotypes. Kernel-based methods like SVM/KR can model nonlinearity but may overfit in omics contexts without explicit feature selection. HSIC Lasso was proposed to tackle these issues by selecting features that are maximally dependent on the outcome and minimally redundant with each other using a kernel-based dependence measure (HSIC).

Methodology

Design and population: Population-based cross-sectional study using the first batch (n=1008) of the Japanese Multi Omics Reference Panel (jMorp). After listwise deletion of 48 subjects with missing CES-D and exclusion of 63 with unreliable CES-D responses, 897 subjects remained for analysis. Ethical approval was obtained from the Tohoku University Ethics Committee; written informed consent was provided by all participants. Outcome measures: Depressive symptoms assessed by the Center for Epidemiologic Studies-Depression Scale (CES-D). Both quantitative CES-D scores and binary traits using cutoffs ≥16 and ≥19 to define the depressive group were analyzed. Metabolomics acquisition: Plasma was prepared and stored at -80°C. Metabolites were extracted using standard methanol extraction. NMR: Bruker Avance 600 MHz at 298 K; 1D NOESY and CPMG spectra; processed with Chenomx NMR Suite; target profiling via Chenomx Profiler for identification and quantification. MS: UHPLC-QTOF MS (Waters ACQUITY UPLC I-class + Synapt G2-Si, ESI positive; C18 ACQUITY HSS T3 column; MassLynx v4.1). Negative ion mode used HILIC (ZIC-PHILIC) with Q Exactive Orbitrap MS and heated-ESI-II; Xcalibur v4.1. In total, 306 features were used (37 NMR-identified metabolites, 269 MS-characterized metabolites by intensity). Covariates: Sex, age, BMI, marital status, damage from the Great East Japan Earthquake (0–4 scale based on official house damage categories), antidepressant intake, Lubben Social Network Scale 6, and social capital score. These were included in variable selection alongside metabolites; selected covariates per fold are listed in Supplementary Table S1. Rationale: these environmental and clinical factors are associated with both CES-D and metabolite profiles in prior literature. Model evaluation and tuning: Nested fivefold cross-validation. Outer loop: fivefold CV to evaluate predictive power. Metrics: Pearson correlation between predicted and observed CES-D for quantitative outcomes; AUC for binary outcomes. Inner loop(s): For models requiring tuning, fivefold CV on the training partitions. If separate feature selection and prediction algorithms were used, a first inner loop tuned feature selection parameters (e.g., number of features), and a second inner loop tuned prediction model hyperparameters (e.g., kernel parameters). Subject splits were identical across models for fair comparison. HSIC Lasso-based pipeline: Feature selection using HSIC Lasso to select features with maximal dependence on outcome and minimal redundancy, assessed via HSIC (kernel-based independence measure; zero indicates independence). Prediction performed with kernel regression (KR) for quantitative outcomes and support vector machine (SVM) for binary outcomes using R kernlab and CVST packages, with radial basis function kernels. First inner loop tuned the number of HSIC-selected features; second inner loop tuned sigma and lambda/C for KR/SVM. Comparative models: - Lasso (glmnet) as a linear feature selection + prediction model; lambda tuned via inner fivefold CV. - SVM/KR without feature selection; sigma and lambda/C tuned via inner CV. - SVM/KR with univariate P<0.05 variables; hyperparameters tuned via inner CV. - SVM/KR with Lasso-based feature selection; Lasso lambda (first inner loop) and SVM/KR hyperparameters (second inner loop) tuned. - Random forest (randomForest): mtry optimized via Out-of-Bag error; number of trees set to 500 (sensitivity in Supplementary Table S2). - Partial least squares (PLS) and sparse PLS (SPLS) via caret: number of components (PLS) and components plus sparsity parameter eta (SPLS) tuned via inner CV. - Neural network (keras): two hidden layers, 128 nodes each, ReLU activation; epoch size tuned via inner CV; architecture motivated by prior studies and preliminary trials (architectural sensitivity in Supplementary Table S3). Cross-validated feature stability: In models with feature selection, features selected in each of the five outer folds were tracked; frequently selected metabolites (≥4 of 5 folds) were summarized.

Key Findings

- Across both quantitative CES-D prediction (correlation metric) and binary CES-D classification at cutoffs ≥16 and ≥19 (AUC metric), the HSIC Lasso feature selection combined with SVM/KR achieved higher predictive performance than all comparator models (Lasso, SVM/KR without feature selection, SVM/KR with univariate filtering or Lasso-based selection, random forest, PLS, SPLS, neural network, and multiple regression). - Frequently selected predictive metabolites included L-leucine, 3-hydroxyisobutyrate, and gamma-linolenyl carnitine. Additional metabolites repeatedly selected across folds (≥4/5) for both quantitative and binary models are reported in Table 3 (not fully shown in the excerpt). - Demographic contrasts between high and low CES-D groups showed significant differences in sex distribution, marital status, earthquake-related house damage, medication use (including antidepressants), and social engagement measures. Self-reported PTSD symptoms were more prevalent in the high CES-D group. - The study used the largest sample size to date for metabolomics-based depression prediction (n=897), reducing variance in cross-validated estimates compared to prior smaller studies.

Discussion

By explicitly addressing both high dimensionality and nonlinearity, the HSIC Lasso-based pipeline improved prediction of depressive symptoms from plasma metabolomics. HSIC Lasso selects features that are nonlinearly dependent on CES-D while minimizing redundancy among selected features, thereby mitigating multicollinearity and overfitting risks common in omics datasets. Coupling these features with nonlinear predictors (KR/SVM with RBF kernels) capitalizes on complex relationships among metabolites and between metabolites and covariates. The outperformance over linear models (Lasso, PLS/SPLS) and nonlinear models without integrated feature selection (SVM/KR alone, random forest, neural network) supports the importance of nonlinear feature selection in this domain. The stability of selected metabolites across folds (e.g., L-leucine, 3-hydroxyisobutyrate, gamma-linolenyl carnitine) suggests potential biomarker candidates for depressive symptoms in this Japanese cohort, warranting further validation. These findings imply that metabolomic signatures, when modeled with appropriate nonlinear feature selection, can enhance risk stratification for depressive symptoms and may contribute to understanding biochemical pathways associated with depression.

Conclusion

The study introduces and validates a nonlinear feature selection plus prediction framework (HSIC Lasso with SVM/KR) that improves metabolomics-based prediction of depressive symptoms compared with multiple state-of-the-art baselines. Robust nested cross-validation in a large cohort (n=897) demonstrated superior predictive performance and identified recurrently predictive metabolites such as L-leucine, 3-hydroxyisobutyrate, and gamma-linolenyl carnitine. Future work should test generalizability across different ethnicities and populations, further assess the identified metabolites’ roles as biomarkers, and explore external validation and longitudinal designs to evaluate predictive utility over time.

Limitations

- Single-cohort, population-based cross-sectional design limits causal inference. - Generalizability is uncertain; authors recommend evaluation in other ethnicities and populations beyond Japanese communities affected by the Great East Japan Earthquake. - Metabolomic features derived from specific NMR and MS platforms; findings may vary with other platforms or processing pipelines. - While nested cross-validation reduces overfitting, no external validation cohort is reported in the provided text. - Some covariate and environmental factors were self-reported (e.g., PTSD symptoms, social engagement), which may introduce reporting bias.

Related Publications

Explore these studies to deepen your understanding of the subject.

Medicine and Health

Prediction of mortality risk and duration of hospitalization of COVID-19 patients with chronic comorbidities based on machine learning algorithms

P. Amiri, M. Montazeri, et al.

Medicine and Health

Machine learning-based prediction of COVID-19 diagnosis based on symptoms

Y. Zoabi, S. Deri-rozov, et al.

Medicine and Health

Machine learning-based prediction of in-hospital death for patients with takotsubo syndrome: The InterTAK-ML model

O. D. Filippo, V. L. Cammann, et al.

Computer Science

Reliability of Supervised Machine Learning Using Synthetic Data in Health Care: Model to Preserve Privacy for Data Sharing

D. Rankin, M. Black, et al.

Listen, Learn & Level Up

Over 10,000 hours of research content in 25+ fields, available in 12+ languages.

No more digging through PDFs, just hit play and absorb the world's latest research in your language, on your time.

listen to research audio papers with researchbunny