Medicine and Health
Al-based differential diagnosis of dementia etiologies on multimodal data
C. Xue, S. S. Kowshik, et al.
Dementia poses a major global health challenge, with nearly 10 million new cases annually and substantial clinical and socioeconomic burden. Accurate and timely differential diagnosis is critical for guiding targeted therapies and optimizing patient care but is hampered by symptom overlap across etiologies, heterogeneous MRI findings, and a shortage of specialist clinicians. Although Alzheimer’s disease (AD) is the most common cause, other etiologies such as vascular dementia (VD), Lewy body dementia (LBD), and frontotemporal dementia (FTD) are prevalent, often co-exist, and can lead to misdiagnoses and inappropriate treatments. Access to gold-standard biomarkers (CSF, PET) is limited and blood-based biomarkers remain under active development; specialist access is constrained even in urban centers. Conventional diagnostic workflows rely on neuropsychological testing, clinical assessment, and MRI, but are resource intensive and variable across settings. Machine learning has shown promise, yet most prior work emphasizes imaging-only and AD-centric classification, limiting clinical utility for mixed etiologies and variable data availability. The study aims to develop and validate a scalable AI framework that integrates multimodal clinical and imaging data to perform differential dementia diagnosis across common etiologies, handle missing data, support mixed diagnoses, and align predictions with established biomarker and neuropathological evidence to aid clinical decision-making and screening.
Prior ML and neuroimaging research has largely focused on distinguishing normal cognition (NC), mild cognitive impairment (MCI), and AD using structural MRI, with fewer works contrasting AD to other dementias (VD, LBD, FTD). Earlier approaches often prioritized imaging-only features and AD-centric tasks, limiting applicability where other etiologies are common and can co-occur. Some multimodal studies began incorporating demographics, history, and neuropsychological measures, demonstrating improved differentiation between AD and non-AD dementias. However, real-world clinical adoption requires models that: 1) accommodate mixed etiologies; 2) operate with incomplete or variable modalities; 3) generalize across diverse cohorts; and 4) provide interpretable, etiology-specific outputs aligned with clinical pathways. This work extends prior art by training and validating a transformer-based, multimodal model across nine independent datasets, explicitly modeling 10 etiologies with multilabel outputs and validating against biomarkers and postmortem findings, as well as assessing human–AI collaboration with clinicians.
Study design and population: Data were aggregated from nine independent cohorts totaling 51,269 participants: NACC (n=45,349), ADNI (2,404), NIFD (253), PPMI (198), AIBL (661), OASIS (491), 4RTNI (80), LBDSU (182), and FHS (1,651). Participants spanned NC (19,849), MCI (9,357), and dementia (22,063). Ten etiologies were modeled: AD (17,346), LBD (2,003), VD (2,032), PRD (114), FTD (3,076), NPH (138), SEF (808), PSY (2,700), TBI (265), ODE (1,234). Diagnostic categories were defined by neurologist consensus to align with clinical management pathways. Training used NACC, AIBL, PPMI, NIFD, OASIS, LBDSU, and 4RTNI; testing used a held-out NACC subset, plus ADNI and FHS.
Inclusion/exclusion and data harmonization: Eligibility required diagnosis of NC, MCI, or dementia; for non-NACC cohorts, at least one MRI within 6 months of documented diagnosis. NACC UDS 3.0 served as the harmonization dictionary across cohorts. For NACC, among multiple visits, the dementia-diagnosed visit with the richest features (prioritizing imaging), or the most recent if ties, was selected. Data included demographics, personal/family history, medications, labs, physical/neurological exam findings, neuropsychological tests, functional assessments, and multisequence MRI (T1w, T2w, FLAIR, SWI; DWI excluded from training due to 2D acquisition). In total, 391 non-imaging features were used.
Imaging preprocessing: NIFTI MRI volumes were skull-stripped (SynthStrip), reoriented (fslorient2std), linearly registered to MNI152 (FSL FLIRT), resampled/cropped, resized, and intensity-normalized to [0,1]. Subvolumes (128×128×128) were generated per sequence.
Imaging embeddings: A 3D Swin UNETR encoder (pretrained with self-supervision on 3D medical volumes) extracted 768×4×4×4 features per subvolume. A learnable downsampling module (four convolutional blocks) produced a 256-dim embedding per sequence. Encoder weights were frozen; downsampling was trainable. Imaging embeddings were combined with non-imaging embeddings.
Multimodal transformer backbone: Numerical features were linearly projected; categorical features were embedded via lookup; imaging embeddings were treated as numerical tokens and projected. A transformer aggregated all tokens with attention to produce multilabel predictions. A feature masking mechanism randomly masked tokens to simulate arbitrary missingness and improve robustness. Missing labels across cohorts were handled by multilabel heads (13 binary heads for NC, MCI, DE, and 10 etiologies) with label-specific loss masking.
Loss and optimization: The total loss combined per-label focal loss (to address class imbalance) and a ranking loss encouraging higher scores for positive labels over negatives by margin ε=0.25, plus L2 regularization. Weights: λ=0.005, β=0.0005. Training used AdamW (lr=0.001), cosine warm restarts (first at epoch 64, doubling each restart), batch size 128, 256 epochs. Best model selected by validation performance.
Interpretability: Approximate Shapley value analysis (permutation sampling) on NACC test cases (n=500 per class) identified top contributing features for NC, MCI, and dementia predictions. Missing features received zero Shapley values.
Benchmarking: A CatBoost baseline was trained on two feature subsets (common minimal across cohorts; expanded NACC/ADNI subset). The proposed model was evaluated on both subsets without retraining.
Biomarker and neuropathological validation: Model probabilities were compared with gold-standard biomarkers: in NACC, binary UDS indicators for Aβ PET, tau PET, FDG PET for AD; FDG and MRI evidence for FTD; DaTscan for LBD. In ADNI, positivity thresholds: Aβ PET >20 centiloids; tau PET meta-temporal SUVr>1.74; FDG meta-ROI SUVr<1.21. Postmortem validation (NACC, FHS, ADNI) assessed P(AD) vs Thal Aβ phases, Braak stages, CERAD neuritic plaque density, and presence of CAA/arteriolosclerosis; P(VD) vs arteriolosclerosis and old microinfarcts; P(FTD) vs TDP-43 pathology.
Clinician comparison and AI-augmentation: Twelve neurologists reviewed 100 randomly selected NACC cases (15 NC, 15 MCI, and 7 per etiology) with demographics, history, neuropsychological tests, functional scales, and MRI; they provided confidence scores (0–100) for NC, MCI, DE, and each etiology. Seven neuroradiologists reviewed 70 confirmed dementia cases with MRI and demographics, rating etiologies. AI-augmented scores were computed as the mean of clinician confidence and model probabilities. Performance was measured by AUROC and AUPR.
Statistics and metrics: Group differences assessed by ANOVA and χ² (cohort characteristics). KS tests compared P(AD) between AD vs non-AD etiologies in MCI and dementia. Kruskal–Wallis with post-hoc Dunn’s tests evaluated P(DE) across CDR categories (NACC, ADNI) and panel labels (FHS). One-sided Mann–Whitney U or t-tests compared biomarker-positive vs negative groups. Brunner–Munzel tests compared confidence distributions. Interrater agreement assessed via pairwise Pearson correlations and bootstrapping. ROC/PR curves generated with micro-, macro-, and weighted-average AUROC/AUPR; additional metrics included accuracy, sensitivity, specificity, F1, and MCC. Significance level 0.05.
Computing: Python 3.11.7, PyTorch 2.1.0; NVIDIA RTX GPUs. Training ~7 minutes/epoch on Quadro RTX8000; inference <1 minute per instance.
- Overall cognitive status classification (NC/MCI/DE): On held-out NACC, ADNI, and FHS test sets, microaveraged AUROC=0.94 and AUPR=0.90; macroaveraged AUROC=0.93 and AUPR=0.84; weighted-average AUROC=0.94 and AUPR=0.87. Performance consistent across age, gender, and race (micro-AUC>0.88, micro-AUPR>0.82).
- Robustness to missing data: Despite ~69% feature missingness in ADNI relative to NACC, weighted-average AUROC=0.91 and AUPR=0.86 for NC/MCI/DE. With 94% fewer features in FHS vs NACC, weighted AUROC=0.68 and AUPR=0.53 for NC/MCI/DE. Random feature masking maintained reliable predictions across omitted modalities (MRI, UPDRS, GDS, NPI-Q, FAQ, neuropsych tests).
- Alignment with prodromal AD: P(AD) was higher in MCI cases with AD etiology than in MCI due to non-AD causes; similarly, P(AD) was higher in dementia when AD was the primary cause (significant differences; Table S9), supporting early-stage detection utility.
- Correlation with impairment severity: P(DE) increased with higher CDR scores in NACC and ADNI (P<0.0001), and with panel ratings (normal vs impaired vs dementia) in FHS (P<0.0001; normal vs impaired not significant), indicating sensitivity to clinical severity.
- Etiology-level performance (10 etiologies): Microaveraged AUROC=0.96 and AUPR=0.70; macroaveraged AUROC=0.91 and AUPR=0.36; weighted-average AUROC=0.94 and AUPR=0.73. Performance stable across demographic subgroups (micro-AUC>0.94, micro-AP>0.66).
- Mixed etiologies: For co-occurring dementias (≥25 positives), AUROC ranged 0.63–0.97 and AUPR 0.08–0.60; AD+VD+PSY achieved AUROC=0.73, AUPR=0.48. Abstract summary: mean AUROC=0.78 for two co-occurring pathologies.
- Biomarker validation: P(AD) was higher in biomarker-positive vs negative groups for Aβ PET, tau PET, and FDG PET in NACC and ADNI (all P<0.0001). P(FTD) was higher with MRI/FDG evidence for FTD (NACC; P≤10⁻⁷). P(LBD) was higher in DaTscan-positive cases (NACC; P=6.26×10⁻⁶). Results align with ATN criteria and etiology-specific imaging biomarkers.
- Neuropathology validation: P(AD) increased with higher Thal Aβ phases (P=7.11×10⁻⁵), Braak stages (P=6.07×10⁻⁶), and CERAD neuritic plaque density (P=1.73×10⁻⁶), and was higher with CAA (P=0.01) and arteriolosclerosis (P=0.01). P(VD) was higher with arteriolosclerosis (P=0.0002) and old microinfarcts (P=0.0001). P(FTD) was higher with TDP-43 pathology (P=0.0008).
- AI-augmented clinicians: For neurologists (n=12; 100 cases), AI assistance increased mean AUROC by 26.25% and mean AUPR by 73.23% across labels (significant for all etiologies, P<0.05), with largest gains for PRD (AUROC +73%, AUPR +242%) and TBI (AUROC +72%, AUPR +257%). For neuroradiologists (n=7; 70 cases), AI assistance increased mean AUROC by 16.19% and AUPR by 41.79%, with significant AUROC improvement across most etiologies (except TBI, ODE) and largest AUPR gain in PRD (+200%).
The study demonstrates that a transformer-based multimodal AI model can perform robust differential diagnosis across the dementia spectrum, integrating demographics, clinical history, neuropsychological assessments, functional scales and multisequence MRI. The model maintains high discrimination for NC, MCI and dementia across diverse cohorts and demographic subgroups while preserving performance under substantial data missingness, reflecting real-world clinical variability. Crucially, etiology-specific probabilities align with established biomarkers (ATN for AD, FDG/MRI for FTD, DaTscan for LBD) and neuropathological hallmarks (Thal, Braak, CERAD, CAA, arteriolosclerosis, microinfarcts, TDP-43), supporting biological validity. The framework effectively addresses mixed dementias by outputting multilabel probabilities, aiding clinicians in prioritizing likely contributors to cognitive impairment. In head-to-head comparisons, AI-augmented clinician assessments significantly improved diagnostic accuracy (AUROC/AUPR) relative to clinician-only evaluations, suggesting practical utility as a decision-support tool across generalist and specialist settings. These findings address key gaps in prior AD-centric, imaging-only approaches by offering a generalizable, interpretable, and resilient model capable of guiding management pathways in complex, mixed-etiology dementia presentations.
This work introduces and validates a multimodal, transformer-based AI framework for differential dementia diagnosis across ten etiologies, robust to missing data and reflective of clinical reasoning by aligning categories with management pathways. The model achieves strong performance for cognitive status and etiology classification, handles co-occurring pathologies, and exhibits biologically grounded outputs corroborated by biomarkers and neuropathology. AI augmentation improves clinician diagnostic performance, indicating potential for integration into clinical workflows and trial screening. Future directions include prospective, multi-center validation across more diverse populations; assessment of impact on clinical outcomes and resource utilization; incorporation of disease staging and longitudinal trajectories; and deeper evaluation of AD heterogeneity and subtype-specific performance.
- Generalizability: Cohorts were predominantly White; external datasets (ADNI, FHS) showed some performance variability versus NACC, warranting broader demographic and geographic validation.
- Data imbalance and rare etiologies: Overrepresentation of AD may bias recognition toward AD versus less frequent etiologies (e.g., PRD, NPH, TBI, SEF), reflected in lower macro-averaged AUPR and variability in co-occurrence detection.
- Annotation uncertainty: Training labels derived from routine clinical practice may include inconsistencies; clinician variability was evident for challenging categories (SEF, TBI, ODE), potentially limiting model sensitivity.
- Feature availability: FHS analyses with limited features showed reduced discrimination in early impairment (normal vs impaired); although masking improves robustness, sparse inputs constrain performance.
- Disease staging: Mild, moderate, and severe dementia were combined as a single label; staging was not modeled and may affect management nuances.
- AD heterogeneity: The model does not explicitly address AD biological and clinical subtypes; stratified evaluations are needed.
- Imaging constraints: DWI was excluded due to 2D acquisition; broader imaging harmonization and additional modalities (e.g., PET) could further enhance performance.
Related Publications
Explore these studies to deepen your understanding of the subject.

