Medicine and Health
Functional connectivity signatures of major depressive disorder: machine learning analysis of two multicenter neuroimaging studies
S. Gallo, A. El-gazzar, et al.
Major depressive disorder (MDD) affects over 163 million people worldwide, motivating efforts to improve diagnosis, prevention, and treatment. Interest has grown in applying artificial intelligence to develop psychiatric biomarkers. Early small-scale studies suggested that resting-state fMRI functional connectivity (FC) could yield high diagnostic accuracy for MDD, but larger datasets in psychiatry often show reduced accuracy, likely due to increased heterogeneity. Large resting-state MDD cohorts were previously unavailable, limiting progress. This study leveraged two large consortia—REST-meta-MDD (mddrest) and PsyMRI (psymri)—to evaluate the potential of whole-brain resting-state FC as a biomarker for MDD using linear and nonlinear SVMs and graph convolutional neural networks (GCNs). The study also aimed to identify neurophysiological signatures of MDD via interpretable deep learning and to assess the influence of medication status on classification performance.
Prior univariate analyses of resting-state FC in MDD have reported consistent group differences but may miss multivariate patterns. Early machine learning studies using SVMs reported high accuracies (up to ~95%) in small samples. Deep learning, particularly GCNs that exploit graph structure inherent to FC, has shown promise in neuroimaging tasks and offers interpretability tools (e.g., GCN-Explainer). However, larger, more heterogeneous datasets in psychiatric neuroimaging often show diminished classification performance, potentially due to clinical and site heterogeneity. Previous multicenter work reported balanced accuracies around 67–69% for MDD classification, and meta-analyses have implicated thalamic hyperactivity/hyperconnectivity in MDD across rest and task paradigms.
Design and datasets: Resting-state fMRI data were acquired from two multicenter consortia: PsyMRI (23 cohorts; 531 MDD, 508 HC) and REST-meta-MDD (mddrest; 25 cohorts in China; 1255 MDD, 1083 HC). Two external datasets were used for benchmarking sex classification: ABIDE (ASD vs TD; N=2000) and UK Biobank (N=2000). Inclusion ensured written informed consent and local IRB approval. Preprocessing and feature extraction: PsyMRI raw data were preprocessed in-house with FSL and ANTs; mddrest data were preprocessed at sites using DPARSF/SPM. Time series were extracted from 112 ROIs based on the Harvard-Oxford atlas. Pairwise Pearson correlations between ROIs formed subject-level FC matrices. The upper triangle of FC matrices was used for SVM features. For GCNs, graphs were derived by binarizing FC to retain the top 50% absolute correlations as edges; node features were each ROI’s original (pre-threshold) connectivity profile (row of the FC matrix). Classification tasks: Five contrasts were evaluated with balanced class sizes per task: (I) MDD vs HC; (II) non-medicated MDD vs HC; (III) medicated MDD vs HC; (IV) medicated vs non-medicated MDD; (V) male vs female. Models and training: Three classifiers were implemented: linear SVM, RBF-kernel SVM, and spatial GCN. Performance was assessed via 5-fold cross-validation, reporting balanced accuracy averaged across folds. Hyperparameters were selected from literature-informed ranges based on relative accuracy on 20% of the training data; after selection, performance was evaluated on held-out test folds. Permutation testing assessed significance with Bonferroni correction for multiple comparisons. For between-dataset generalization (train on one consortium, test on the other) in MDD vs HC, SVMs used one-shot training; GCNs used 5-fold CV with model selection on 20% of the test set. Additional metrics (F1-score, sensitivity, specificity) and ComBat site harmonization analyses are reported in Supplementary Information. Deep learning details: The GCN used binary cross-entropy loss, Adam optimizer, 100 epochs, initial learning rate 0.001 decayed by a factor of 10 every 30 epochs. Architecture was optimized per contrast/dataset (details in Supplementary Information). Interpretability and post hoc analyses: Two complementary experiments focused on MDD vs HC: (1) GCN-Explainer identified subgraphs (connections) most informative for classification by learning a mask that maximizes mutual information with predictions; (2) an ablation study virtually removed each ROI’s connectivity profile in the test set to quantify the mean drop in balanced accuracy across 10 repetitions per fold, attributing performance loss to that ROI. Univariate t-tests on FC were also conducted per dataset and contrast, covarying sex, age, site, and head motion; FDR correction (p<0.05) was applied. Symptom prediction: Hamilton Depression (HAM-D) scores were predicted using SVR with RBF kernel and GCN in 1113 mddrest and 333 psymri patients. Cross-site/subject analyses: Accuracy variability across sex, diagnosis, scanner manufacturer, and recording site was evaluated (best-performing RBF-SVM).
- Overall classification performance:
- MDD vs HC: mean balanced accuracy ~61% across datasets and models (range 57–63%); significant above chance after correction in most cases (linear SVM on psymri not significant).
- Medicated MDD vs HC, non-medicated MDD vs HC, medicated vs non-medicated MDD: mean ~62% (range 54–67%). At least one model was significant for mddrest and combined datasets; none significant for psymri alone.
- Cross-dataset generalization (train on one consortium, test on the other) for MDD vs HC yielded lower accuracy: GCN 54.16% (sd 0.66) psymri→mddrest and 56.38% (sd 0.84) mddrest→psymri; linear SVM 55.7% and 54.8%; RBF-SVM 53.1% and 56.1%.
- Site-level variability for RBF-SVM was considerable (range 48–87%) and not significantly associated with site sample size (r=0.25, p=0.25).
- Sex classification:
- Within MDD consortia: mean ~68% across datasets/models (range 65–71%).
- External benchmarks: ABIDE 73%; UK Biobank 81%.
- Symptom prediction:
- HAM-D prediction poor: SVR explained 3.5–7% variance; GCN predicted training mean only.
- Interpretability results (MDD vs HC):
- GCN-Explainer identified connections consistently across both datasets among top features: left–right thalamus; right lingual gyrus–right supracalcarine cortex; left–right anterior supramarginal gyrus; left–right medial frontal cortex.
- Ablation: largest mean drops in balanced accuracy (present in both datasets) for thalamus (psymri −6.27% (sd 2.17); mddrest −4.62% (sd 1.08)) and Heschl’s gyrus (psymri −5.99% (sd 3.88); mddrest −4.12% (sd 1.47)).
- Univariate FC analyses:
- mddrest: 28% of connections differed between MDD and HC, predominantly reduced FC in MDD. Amygdala showed reduced connectivity with 154 regions; insula with 126; anterior cingulate with 100 (but increased with right pre/postcentral gyri). Thalamus showed increased connectivity with 199 regions (primarily frontal and insular), but reduced interhemispheric thalamic connectivity. Effect sizes were small: reduced connections mean d=−0.14 (range −0.34 to −0.08); increased thalamic connections mean d=0.12 (range 0.08–0.18).
- psymri: only decreased left–right supracalcarine connectivity survived FDR correction; uncorrected patterns resembled mddrest.
- Replication: two comparisons showed replicable differences across datasets: (1) MDD vs HC: reduced left–right supracalcarine connectivity (psymri t=−4.73, p-corr<0.05; mddrest t=−3.69, p-corr<0.0005); (2) MDD medicated vs HC: increased connectivity between left thalamus and left prefrontal gyrus in medicated patients (nine other FCs decreased in medicated patients; details in Supplementary Tables).
Machine and deep learning classifiers could distinguish MDD from controls above chance but with modest accuracy (~61%), far below accuracies reported in small-sample studies. Consistent with prior large-scale work, performance likely suffered from substantial clinical and technological heterogeneity across many sites, as reflected in wide site-level variability and diminished cross-dataset generalization. Splitting by medication status did not meaningfully improve performance, suggesting limited impact of antidepressant use on classification at this scale. Deep learning (GCN) did not substantially outperform SVMs, possibly due to dataset size relative to model capacity and the challenges posed by heterogeneous, multicenter data. Despite modest accuracy, interpretability analyses converged on thalamic hyperconnectivity as a robust neurophysiological signature of MDD, replicated across datasets and supported by both GCN-Explainer and ablation. Univariate analyses corroborated a pattern of widespread cortical hypoconnectivity in MDD alongside thalamic hyperconnectivity, aligning with prior literature implicating the (mediodorsal) thalamus in MDD pathophysiology and hypervigilant states. The findings suggest that corticothalamic hyperconnectivity may co-occur with reduced corticocortical connectivity in MDD. The higher sex-classification accuracy, particularly in a harmonized cohort (UK Biobank), underscores how reduced heterogeneity and standardized acquisition can improve predictive performance. Overall, resting-state FC provides informative though insufficiently accurate biomarkers for MDD diagnosis in heterogeneous multicenter settings. Progress may require larger harmonized datasets, dynamic FC analyses, data-driven subtyping/biotypes to reduce clinical heterogeneity, and integration of multimodal data to capture the disorder’s complexity.
Using two of the largest multicenter resting-state fMRI datasets in MDD, the study establishes a realistic, likely lower-bound estimate of MDD classification performance from whole-brain FC (~61% balanced accuracy), not sufficient for clinical diagnostic use. Nevertheless, interpretable deep learning identified consistent thalamic hyperconnectivity as a prominent and reproducible neurophysiological feature of MDD, amidst widespread hypoconnectivity elsewhere. Future work should focus on reducing heterogeneity via harmonized acquisition, exploring dynamic FC, applying data-driven biotyping, and integrating multimodal biomarkers (molecular, genomic, clinical, imaging, physiological, and behavioral) to improve accuracy and generalizability.
- High clinical and technological heterogeneity across many sites (differences in severity, chronicity, scanners, and acquisition protocols) likely reduced classification performance and generalizability; cross-dataset generalization was particularly poor.
- Site harmonization (ComBat) had limited impact on classification accuracy.
- Deep learning methods may have been underpowered given dataset size; no clear advantage over SVMs was observed.
- Visualization reliability is constrained by modest classifier accuracy; SVM interpretability further limited by deterministic nature (no repeated stochastic training).
- FC was analyzed as stationary, ignoring potentially informative temporal dynamics.
- Single-modality (rs-fMRI) approach may be reductive for heterogeneous psychiatric conditions.
- Symptom severity could not be predicted reliably from FC features (low variance explained).
Related Publications
Explore these studies to deepen your understanding of the subject.

