logo
ResearchBunny Logo
Distinct brain morphometry patterns revealed by deep learning improve prediction of post-stroke aphasia severity

Medicine and Health

Distinct brain morphometry patterns revealed by deep learning improve prediction of post-stroke aphasia severity

A. Teghipco, R. Newman-norlund, et al.

This groundbreaking study by Alex Teghipco, Roger Newman-Norlund, Julius Fridriksson, Christopher Rorden, and Leonardo Bonilha reveals how deep learning using Convolutional Neural Networks (CNNs) significantly outperforms traditional machine learning methods in predicting post-stroke aphasia severity. Delving into three-dimensional brain imaging, their findings underscore the critical role of morphometry patterns in understanding cognitive processes linked to aphasia.

00:00
00:00
~3 min • Beginner • English
Introduction
Aphasia affects approximately 30% of stroke survivors, with chronic impairments persisting in up to 60% beyond six months. Although lesion location, lesion size, initial aphasia severity, age, and demographics explain a substantial portion of variance, comprehensive models account for only about 50% of chronic aphasia severity, suggesting additional neurobiological factors. Contemporary models posit that language recovery depends on preservation of hierarchically organized systems beyond the lesion, including core left hemisphere language regions and, when these are compromised, contralateral homotopic and domain-general networks. Morphometric integrity outside the lesion (e.g., atrophy patterns) likely contributes to outcomes but remains under-characterized, partly due to methodological limitations in assessing spatially distributed atrophy. The authors hypothesize that a 3D CNN operating on whole-brain morphometry and lesion maps will outperform SVMs in classifying severe aphasia and that CNNs will exploit spatially dependent features beyond the lesion to improve prediction.
Literature Review
Prior work shows lesion site and size predict aphasia severity, but extralesional integrity also matters. Studies report atrophy in inferior frontal gyrus and distributed gray matter correlating with post-stroke cognitive/language impairments; preservation of right temporal and supplementary motor areas relates to better language outcomes. In other neurological conditions, specific spatial atrophy patterns associate with symptoms, suggesting patterned morphometry may also be relevant post-stroke. CNNs have been widely used in stroke for lesion detection/segmentation and imaging enhancement; fewer works address outcome prediction, mostly in the acute setting, where CNNs outperformed logistic or linear models. These observations motivate testing whether CNNs can capture spatial dependencies in chronic stroke morphometry that classical methods miss.
Methodology
Design: Cross-sectional predictive modeling comparing 3D CNNs and SVMs to classify severe aphasia versus nonsevere aphasia from structural MRI-derived morphometry and lesion maps. Participants: Retrospective cohort of chronic left-hemisphere stroke survivors evaluated at the Center for the Study of Aphasia Recovery (C-STAR). Methods text reports N=213 (age 57.98 ± 11.34; 62% male; mean 3.2 ± 3.7 years post-stroke). Data collected at the University of South Carolina and Medical University of South Carolina under IRB approvals (Pro00053559, Pro00105675, Pro00005458). Abstract/figures refer to N=231 scans processed for lesion/morphometry pipelines. Behavioral assessment: Western Aphasia Battery–Revised (WAB-R) Aphasia Quotient (AQ; 0–100). Severe aphasia class defined as WAB-AQ < 50 (very severe or severe). In the sample, severe = 35%; nonsevere (moderate/mild) = 65%. Imaging occurred within 10 days of WAB-R. MRI acquisition: 3T Siemens Prisma (Trio to FIT upgrade in 2016), 20-channel coil. T1 MPRAGE (1 mm isotropic; TR 2.25 s; TE 4.11 ms; TI 925 ms; FA 9°) and T2 SPACE (1 mm isotropic; TR 3200 ms; TE 567 ms; variable FA). Preprocessing: Manual T2 lesion tracing in MRIcron; resampling to T1 space (SPM8, nii_preprocess) and refinement. Enantiomorphic healing to minimize normalization deformation. Tissue segmentation of healed T1 into GM/WM/CSF with FAST; nonlinear normalization to MNI152 2 mm (FSL fsl_anat/FNIRT). Lesion and tissue maps transformed to template space (k-NN interpolation); merged into ordinal morphometric maps with lesion as a fourth tissue value. Volumes downsampled to 8 mm isotropic and cropped to common field of view; maps scaled to −1..1. Cross-validation: Stratified, repeated nested CV with 6 outer folds (train ~192, test ~38 per repeat) and 8 inner folds for hyperparameter tuning. 20 repeats with preallocated partitions to enable paired comparisons. Inner-fold grid/random search; outer-fold retraining with early stopping/validation split (CNN). Class imbalance handled by inverse-frequency weighting. Performance metrics: precision, class-wise accuracies, balanced accuracy, F1 (primary), and permutation testing (500 label shuffles) for chance level. Models: - 3D CNN: Single-channel VGG-style architecture with 4–5 conv blocks (3×3 kernels; 1–4 conv layers per block; 8–128 channels), max pooling between blocks; three FC layers (expand, contract, output). Regularization via batch norm after first FC, dropout (0.6/0.7/0.8), L2 weight decay (0.001/0.01). Optimized learning rate (0.1e-4, 0.8e-4, 1e-4) with cosine annealing warm restarts over up to 800 epochs, minibatch 128; binary cross-entropy loss with class weighting. Implemented in PyTorch. - SVMs: Linear and RBF kernels trained with SMO, hinge loss with class weighting. Random search over kernel scale (1e−3 to 1e3) and cost (1e−3 to 2e4). Dimensionality reduction variants: PCA (1–75 components tuned) and ICA applied to PCA components (1–75 components tuned). Implemented in MATLAB. Model fusion and feature analyses: Probability averaging ensembles; stacked ensemble via regularized LDA (gamma 0–1). SVMs trained on CNN penultimate-layer features (~64 dims) and on CNN saliency maps (deep SHAP; Grad-CAM++). Saliency computed with deep SHAP and Grad-CAM++ for CNN and SHAP for SVM. Unsupervised subtyping: Consensus clustering of Grad-CAM++ maps using repeated k-means++ (eta distance) on 60% voxel subsamples (1000 iterations), exploring 3–30 clusters, selecting reliable solutions via Hartigan’s dip test and proportion of ambiguously clustered pairs; final clusters via affinity propagation. Decoding: Correlated subgroup exemplar Grad-CAM++ maps (excluding lesioned voxels) with 200 topic meta-analyses from an author-topic model (NiMARE/Neurosynth). Bonferroni-corrected p < 0.0001 and r > 0.2 retained. ROI analysis: Compared normalized saliency within lesion, perilesional, extralesional and their contralateral homologs between severe vs nonsevere predictions (two-sample t-tests).
Key Findings
- CNN predictive performance: Median severe-class accuracy 0.88 (range 0.81–0.93), median balanced accuracy 0.77 (0.74–0.78), median nonsevere-class accuracy 0.67 (0.61–0.69), precision 0.59 (0.55–0.60), median F1 = 0.70 (0.67–0.72). Permutation test showed performance significantly above chance (p = 0.002; no overlap with permuted F1 distribution). - CNN vs SVM: Linear SVMs outperformed RBF SVMs but remained inferior to CNNs. F1: SVM M = 0.65 (SD 0.02) vs CNN M = 0.70 (SD 0.01); t(19) = −10, p = 5.26e−9, d = −2.24. Mean accuracy: SVM M = 0.73 (SD 0.01) vs CNN M = 0.77 (SD 0.01); t(19) = −9.8, p = 7.28e−9, d = −2.2. PCA/ICA dimensionality reduction (1–75 comps) did not improve SVM performance. - Class trade-offs: SVM had higher nonsevere accuracy (better precision) but worse severe-class recall; CNN favored severe-class accuracy, improving balanced measures (F1, balanced accuracy). - Model fusion: Weighted averaging of CNN/SVM probabilities did not surpass CNN (max ensemble F1 M = 0.71 vs CNN 0.70; t(19) = 2.0, p = 0.056). Stacked LDA improved over SVM but was not significantly better than CNN (F1 M = 0.71, SD 0.03; mean accuracy M = 0.78, SD 0.02; both ns vs CNN). - Saliency analyses: Grad-CAM++ maps (spatially aware) diverged markedly from SHAP/deep SHAP maps. SHAP vs deep SHAP similarity M = 0.58 (SD 0.08) exceeded similarities of Grad-CAM++ to deep SHAP (M = −0.37, SD 0.16) and to SHAP (M = −0.47, SD 0.16), ps < 1e−123. Group-averaged Grad-CAM++ highlighted contralateral regions (not simply peri-lesional), whereas SVM SHAP emphasized lesion vicinity. - Training SVMs on CNN-derived inputs: SVM trained on Grad-CAM++ saliency achieved higher F1 than on deep SHAP (0.69 vs 0.64; t(19) = 4.9, p = 9.25e−5, d = 1.9) and matched SVMs trained on CNN penultimate-layer features (~64 dims), achieving parity with CNN performance. - ROI saliency differences: Grad-CAM++ showed higher saliency in left-hemisphere regions for nonsevere predictions and higher saliency in right-hemisphere regions for severe predictions (all p < 0.0001). SVM SHAP showed higher lesion saliency for severe and higher peri-/extralesional saliency for nonsevere (all p < 0.0001). CNN trained on lesion anatomy alone performed significantly worse than with full morphometry (p = 0.02). - Subtypes: Consensus clustering of Grad-CAM++ maps identified 7 severe and 6 nonsevere subgroups with high internal reliability and consistent within-cluster patterns. Subgroups were not associated with lesion size (severe: F(6,108) = 0.77, p = 0.6; nonsevere: F(5,110) = 0.67, p = 0.65) or accuracy differences. - Decoding: Subgroup saliency correlated with meta-analytic topics implicating language subsystems (semantics, reading, overt speech, lexical-semantics), domain-general functions (attention, working memory, visuospatial processing, response inhibition), and aging/neurological conditions (e.g., Alzheimer’s, epilepsy).
Discussion
Findings support the hypotheses that 3D CNNs better predict severe aphasia than classical SVMs and that performance gains derive from leveraging spatially dependent morphometry well beyond the lesion. CNNs captured distributed ipsilateral and contralateral patterns, with right-hemisphere morphometry particularly predictive of severe aphasia and left-hemisphere features more informative for nonsevere cases. This aligns with contemporary models in which outcomes depend on the preservation of left-hemisphere language networks and compensatory or maladaptive involvement of contralateral/domain-general networks. Classical methods largely attended to lesion vicinity and only matched CNN performance when provided CNN-learned features or Grad-CAM++ saliency inputs, underscoring the unique value of spatial feature learning. Subtyping revealed heterogeneous, individualized morphometry patterns not tied to lesion size, suggesting multiple pathways by which extralesional brain integrity contributes to aphasia severity. Meta-analytic decoding indicates contributions from both language-specific and domain-general systems, and possible interactions with aging-related atrophy patterns. Collectively, these results highlight that modeling spatial dependencies at multiple scales in volumetric neuroimaging can improve prognostication and refine neurobiological understanding of chronic aphasia.
Conclusion
The study demonstrates that 3D CNNs applied to whole-brain morphometry and lesion maps outperform SVMs in classifying severe aphasia in chronic stroke. CNNs identify distributed, three-dimensional morphometry patterns—often outside the lesion—that are directly associated with aphasia severity, and reveal robust subtypes not explained by lesion size. These insights suggest deep learning can advance outcome prediction and illuminate individualized neurobiological mechanisms of post-stroke aphasia. Future research should: expand cohorts and especially severe cases; incorporate longitudinal designs; leverage higher-resolution and multimodal imaging; integrate demographic/clinical/tabular data; refine fusion/stacking strategies; and evaluate generalizability across modalities and tasks (classification and regression). Such developments could enable clinically useful point-of-care prognostication from intake scans.
Limitations
- Data downsampled to 8 mm to enable deep learning with nested cross-validation, potentially limiting spatial detail and generalizability to higher-resolution data or other modalities. - Modest, cross-sectional sample; absence of longitudinal imaging limits causal inference and trajectory prediction. - Potential overfitting risks inherent to deep models, though addressed with nested CV and permutation tests. - Saliency/attribution methods have known reliability issues; interpretations may vary across methods despite convergent findings for Grad-CAM++. - Not all classical machine learning variants or fusion strategies were exhaustively tested; more complex stacking or alternative algorithms might yield different results. - Models did not integrate non-spatial/tabular data (e.g., demographics) which could affect comparative performance and overall accuracy. - Differences in task framing (classification here vs regression in some prior work) complicate direct comparisons with earlier studies.
Listen, Learn & Level Up
Over 10,000 hours of research content in 25+ fields, available in 12+ languages.
No more digging through PDFs, just hit play and absorb the world's latest research in your language, on your time.
listen to research audio papers with researchbunny