
Medicine and Health
Identification of clinical disease trajectories in neurodegenerative disorders with natural language processing
N. J. Mekkes, M. Groot, et al.
Discover the fascinating world of neurodegenerative disorders as researchers Nienke J. Mekkes, Minke Groot, and their team unveil a groundbreaking method utilizing natural language processing on medical records. This study highlights 84 neuropsychiatric signs, revealing critical insights into misdiagnoses and clinical manifestations, opening new avenues for future research.
~3 min • Beginner • English
Introduction
Neurodegenerative disorders such as Alzheimer's disease (AD), frontotemporal dementia (FTD), Parkinson's disease (PD), dementia with Lewy bodies (DLB), vascular dementia (VD) and mixed dementias are heterogeneous, share overlapping clinical and pathological features, and are often clinically misdiagnosed—up to one-third of cases in some cohorts. Brain banks enable mechanistic research but typically lack structured and comprehensive clinical data, limiting integration of key clinical parameters and temporal profiles into postmortem studies. This study aims to systematically extract, harmonize and analyze clinical signs and symptoms from unstructured medical record summaries of Netherlands Brain Bank donors to construct clinical disease trajectories. The objectives are to: (1) map cross-disorder symptomatology and its timing relative to neuropathological diagnoses (NDs); (2) compare clinical diagnoses (CDs) with NDs to quantify misdiagnosis patterns; (3) build predictive models of ND from temporal symptom trajectories; and (4) identify symptom-driven clinical subtypes across and within disorders to inform diagnosis, prognosis and biological investigations.
Literature Review
Prior work has noted extensive heterogeneity within dementias and movement disorders and frequent misdiagnosis due to overlapping symptoms and comorbidity. Some studies have incorporated clinical diagnoses, symptom sets or temporal profiles, but comprehensive approaches combining standardized symptom ontologies, temporal trajectories, and neuropathologically confirmed diagnoses in large autopsy cohorts have been lacking. There is ongoing debate regarding whether synucleinopathies (PD, PDD, DLB, MSA) reflect shared pathology with region-specific manifestations versus distinct disease processes; temporal analyses of key symptoms may help differentiate them. Reported clinical–pathological discrepancies and their impact on research (e.g., GWAS, epidemiology) underscore the need for more accurate, temporally resolved clinical data linked to ND.
Methodology
Cohort and data source: 3,042 Netherlands Brain Bank (NBB) donors with semi-structured medical record summaries (1982–2020). Inclusion required >500 characters in clinical-neuropathological summaries. Neuropathologists assigned neuropathological diagnoses (NDs). Clinical diagnoses (CDs) were matched to a modified Human Disease Ontology.
Parsing and temporal alignment: Python-based parsers extracted 'clinical history' per sentence and year. Ambiguous or ranged time references were standardized to specific years; 'year unknown' entries were excluded from temporal modeling.
Ontology of signs/symptoms: A cross-disorder clinical categorization of 90 neuropsychiatric signs/symptoms across 5 domains (psychiatric, cognitive, motor, sensory/autonomic, general) with definitions, inclusion/exclusion criteria, and UMLS IDs. Attributes were iteratively defined and externally validated by a neurologist.
Labeling and gold standard: From 293 randomly selected donors, 18,917 sentences were labeled for 90 attributes by one scorer; 1,000 sentences were independently scored by a second scorer to assess interannotator agreement (Cohen’s κ=0.86).
NLP model development: Task: multilabel sentence classification. Models compared: Bag-of-Words (logistic regression), linear SVM (One-vs-Rest), PubMedBERT, Bio_ClinicalBERT, and T5. Data splitting: MultilabelStratifiedKFold—80% train/validation with stratified fivefold cross-validation, 20% hold-out test. Hyperparameter optimization via Optuna (25–30 trials per model), selecting based on micro-F1 then micro-precision emphasis. Underperforming six attributes were excluded; final set comprised 84 attributes with micro-precision ≥0.8 or micro-F1 ≥0.8 on test data. PubMedBERT was selected and fine-tuned on all labeled data, then applied to the full corpus (199,901 sentences). Predictions were aggregated per donor-year to construct binary year×attribute clinical disease trajectories.
Statistical analyses:
- Enrichment of signs/symptoms per ND via permutation tests (100,000 label permutations) with Benjamini–Hochberg FDR correction; overrepresentation of a priori diagnostic attributes assessed by χ² tests.
- Observational profiles: pairwise two-sided Mann–Whitney U-tests on number of year-level observations per donor per ND (FDR-corrected); sex-balanced subsampling for sensitivity analyses.
- Temporal profiles: pairwise Mann–Whitney U-tests on age-at-observation distributions (FDR-corrected) and kernel density plots.
- Survival analysis: Kaplan–Meier survival after first observation of a symptom; pairwise Mann–Whitney U-tests (FDR-corrected).
Diagnosis accuracy assessment: CDs were cleaned, mapped to ontology classes, and donors labeled as clinically 'accurate', 'ambiguous', or 'inaccurate' relative to ND using rule-based criteria. Agreement summarized via confusion matrices and Jaccard scores.
Predictive modeling: A GRU-D recurrent neural network (robust to missing temporal data) predicted ND from trajectories. Dataset filtered to single ND donors among common diagnoses (including AD-DLB) and post-1997 autopsies; signs with neurodegeneration in progressive disorders were forward-imputed after first observation per predefined rules. Fivefold stratified splits; training for 50 epochs with sex, age at death, and age at observation included. Performance summarized by confusion matrices and proportions of accurate/ambiguous/inaccurate predictions.
Dimensionality reduction and clustering: Seurat with weighted nearest neighbors (WNN) on two assays derived from trajectories: (1) lifetime observation counts per attribute (flattened) and (2) temporal counts in overlapping age bins (e.g., 15–45, 20–50, 25–55). Standard normalization, scaling, PCA per assay, followed by FindMultiModalNeighbors, FindClusters, and UMAP visualization. Overrepresentation of NDs and inaccurate CDs tested by one-sided Fisher’s exact tests (FDR-corrected). FindMarkers identified cluster-defining attributes. Subclustering performed within major clusters (DEM, PD*, MS*, PSYCHIATRIC). Validation via enrichment for APOE ε4/ε4 genotype (Fisher’s exact test).
Software and code: Python 3.8 (Pandas, Scikit-learn, Optuna, SimpleTransformers, SciPy, statsmodels), R (Seurat). Fine-tuned models and code are publicly available; datasets are provided in Supplementary materials and via project websites, with original summaries accessible through NBB under data use restrictions.
Key Findings
- Data resource and NLP performance:
- Constructed clinical disease trajectories for 3,042 donors across 84 reliably predicted signs/symptoms from 90 initial attributes.
- Interannotator agreement: Cohen’s κ=0.86.
- Enrichment of labeled attributes known to be diagnostically important: χ²=171.28, P=1×10^-31; for predicted set: χ²=295.96, P=2.5×10^-66.
- Cross-disorder enrichment patterns:
- ‘Dementia’ and ‘memory impairment’ enriched in dementias (AD, FTD, DLB, VD, PDD) but not in PD without dementia.
- MS enriched for ‘impaired mobility’, ‘muscle weakness’, and ‘fatigue’.
- Differential motor attributes: ‘impaired mobility’ enriched in MS, PD, PDD, PSP, ATAXIA, MSA; ‘muscle weakness’ in VD, MND, PSP, MSA, MS.
- Features distinguishing frequently misdiagnosed dementias/movement disorders:
- AD: ‘paranoia’, ‘façade behavior’ uniquely enriched.
- VD: ‘hearing problem’, ‘muscle weakness’.
- PDD: ‘depressed mood’.
- DLB: ‘apraxias’.
- MSA: ‘ataxia’, ‘muscle fasciculation’.
- PSP: ‘visual impairment’.
- Temporal and survival profiling:
- ‘Dementia’ observed at younger ages in FTD than in other dementias; after first ‘dementia’ observation, survival shorter in VD, PD, PDD versus AD or FTD.
- ‘Bradykinesia’ observed earlier in MSA; survival after first ‘bradykinesia’ shorter in MSA, PSP, DLB than in PD/PDD—supporting qualitative differences among synucleinopathies.
- Mixed dementias (e.g., AD-VE, AD-PD) show later ‘dementia’ onset than AD/VD; AD, DLB, FTD exhibit longer survival post-‘dementia’ than several other dementia subtypes.
- FTD subtypes: lower ‘dementia’ observations in PSP; higher ‘compulsive behavior’ in FTD-TDP-B/C; earliest ‘dementia’ onset in FTD-TAU/CBD; latest in PiD/PSP.
- Clinical vs neuropathological diagnosis accuracy (examples; Jaccard score JS, ND/CD correctness):
- AD: ND 84% also CD; JS=0.642.
- FTD: ND 83% also CD; JS=0.466.
- MSA: often clinically called PD; JS=0.465.
- VD: ND 49% also CD; JS=0.117.
- PSP: clinically mapped to multiple disorders; JS=0.510.
- DLB: ND 69% also CD; JS=0.116.
- AD-DLB: most clinically labeled AD only; JS=0.138.
- MS: ND 97% also CD; JS=0.951.
- Predictive modeling (GRU-D, n=1,810 donors):
- Model: 1,342 accurate, 83 ambiguous, 385 inaccurate predictions.
- Clinicians: 1,236 accurate, 311 ambiguous, 263 inaccurate.
- GRU-D outperformed CD for FTD, similar for AD and PD, worse for MS and PSP; both performed poorly for DLB, VD, MND, MSA. Best performance observed for diagnoses with ≥100 training cases; rare/mixed diagnoses often missed.
- Some donors were consistently misclassified by both clinicians and model, indicating atypical symptomatology.
- Symptom-driven clusters (WNN-UMAP; six main clusters):
- 1: LATE-DEM; 2: PD* (extrapyramidal); 3: EARLY-DEM; 4: CTRL/ASYM.; 5: MS* (motor predominant); 6: PSYCHIATRIC.
- Inaccurately diagnosed donors overrepresented outside their typical disease clusters (e.g., inaccurate AD often in PD*; inaccurate MSA in dementia clusters), indicating masquerading symptom patterns.
- Genetic validation: APOE ε4/ε4 significantly enriched in EARLY-DEM (P=5.50×10^-10), modest in LATE-DEM (P=1.32×10^-3), underrepresented in CTRL/ASYM. (P=2.87×10^-10).
- Subclustering insights:
- Dementia supercluster: s-LATE-DEM (AD, DEM-SICC, inaccurate FTD-TDP), s-EARLY-DEM (FTD-TDP, FUS, TAU, PiD; younger onset; more ‘compulsive behavior’), MOTOR-DEM (motor attributes; enriched inaccurate AD), PSYCH-DEM (DLB, DLB-SICC, PD, PD-AD, psychiatric donors).
- PD* supercluster: EARLY-PD*, LATE-PD* (narrow symptom range), EARLY-/LATE-MENTAL-PD* (broader cognitive/psychiatric attributes), separating onset age and mental features as independent axes.
- MS* supercluster: SENSORY-MS (sensory/autonomic, fatigue), COG/PSYCH-MS (cognitive/psychiatric), VERBAL-MOTOR-DIS (later onset; enriched ALS/other MNDs, controls, MSA).
- PSYCHIATRIC: PSY-DEP (enriched for controls; ‘depressed mood’), PSY-MANIC (enriched for BP; ‘mania’ and extrapyramidal signs), PSY-PSYCHOSIS (enriched for SCZ; early onset; ‘psychosis’, ‘feeling suicidal’).
Discussion
The study addresses the challenge of clinical heterogeneity and frequent misdiagnosis in neurodegenerative disorders by converting unstructured medical summaries into standardized, temporally resolved clinical disease trajectories. Validation via enrichment analyses, temporal and survival profiling demonstrates that predicted symptom patterns align with clinical expectations (e.g., earlier dementia in FTD, distinct bradykinesia dynamics across synucleinopathies). Comparative analysis of clinical versus neuropathological diagnoses quantifies disease-specific misdiagnosis and highlights atypical donors whose symptoms masquerade as other disorders. Predictive modeling shows that trajectory-based machine learning can match or exceed clinical diagnosis for some disorders but struggles with rare/mixed cases, emphasizing the need for larger, more balanced datasets. Symptom-driven clustering and subclustering reveal cross-diagnostic clinical subtypes (e.g., early/late dementia, psychiatric and motor-rich dementia subtypes; PD subtypes distinguished by onset age and mental features; MS sensory vs cognitive/psychiatric vs motor-verbally impaired) and provide orthogonal genetic validation through APOE ε4/ε4 enrichment. Collectively, these findings underscore the value of integrating structured clinical parameters with pathology to refine diagnosis, understand heterogeneity, and guide targeted biomarker discovery and individualized research designs.
Conclusion
This work delivers a unique, large-scale resource of clinical disease trajectories for 3,042 brain donors, generated via an NLP pipeline that standardizes neuropsychiatric signs and symptoms into temporal profiles linked to neuropathology. The resource enables: (1) cross-disorder symptom enrichment and temporal/survival analyses; (2) quantification of clinical–pathological diagnostic discrepancies; (3) trajectory-based prediction of neuropathological diagnoses; and (4) discovery of symptom-driven clinical subtypes across and within disorders, supported by genetic validation. These advances provide a roadmap for integrating clinical parameters into postmortem research and for advancing more personalized understandings of neurodegeneration. Future work should expand cohorts (especially rare/mixed disorders), harmonize data across brain banks, incorporate additional clinical covariates (comorbidities, treatments), and further refine models to achieve clinically actionable diagnostic support.
Limitations
- Missingness and sampling bias in retrospectively summarized clinical records; not all signs/symptoms are assessed at every visit. Year-level collapsing, imputation rules and statistical methods mitigate but do not eliminate bias.
- Potential labeling errors in training data and prediction errors from the supervised NLP approach.
- The symptom ontology, although curated and neurologist-validated, may omit relevant attributes or nuances.
- Temporal and survival differences may be confounded by unmeasured variables (comorbidities, treatments, care pathways).
- Neuropathological diagnoses were assigned by different pathologists over extended time spans, potentially introducing variability.
- Cohort composition biases (e.g., higher education, predominately Dutch/Caucasian ancestry, enrichment for brain disease among donors) may limit generalizability.
- Predictive modeling performance is constrained by class imbalance and limited sample sizes for rare/mixed diagnoses, reducing sensitivity in these groups.
Related Publications
Explore these studies to deepen your understanding of the subject.