Medicine and Health

Multi-modality machine learning predicting Parkinson's disease

M. B. Makarious, H. L. Leonard, et al.

This groundbreaking study harnesses multimodal data and machine learning to predict Parkinson's disease risk with remarkable accuracy. Developed using the GenoML package, the model demonstrates its potential for large-scale screening, identifying key predictive features such as UPSIT and PRS. The work was conducted by a team of renowned researchers exploring innovative approaches in the pursuit of better healthcare solutions.... show more

Introduction

The study addresses the need for early and accurate diagnosis of Parkinson’s disease (PD), where first-visit clinical diagnoses are only about 80% accurate in pathologically confirmed cases. With growing availability of clinical, demographic, and genomic datasets and advances in automated ML, the authors aim to build cost-effective, data-driven, and scalable multimodal models that integrate clinico-demographic, genetic, and transcriptomic data to improve PD risk prediction near diagnosis. Using GenoML, an open-source AutoML framework, they test and tune models across multiple algorithms, seeking not only to enhance prediction accuracy versus single-biomarker approaches but also to provide biological insights via network analyses and drug-gene interaction exploration.

Literature Review

Prior PD diagnostic and risk prediction efforts often used single modalities and linear models. ML studies have explored CSF biomarkers, neuroimaging, RNA-based signatures, movement metrics, and wearable sensor data, often achieving good classification performance but with higher cost or limited accessibility. UPSIT (olfaction) has consistently been a strong predictor but lacks PD specificity. Reviews of ML in PD emphasize varied modalities and methodologies. The present work differentiates itself by integrating readily accessible clinico-demographic data with genetics (including PRS) and whole-blood transcriptomics, trained and validated in publicly available cohorts (PPMI, PDBP) to maximize transparency, reproducibility, and scalability.

Methodology

Study cohorts and data: Individual-level baseline data from AMP-PD for PPMI (training) and PDBP (external validation). After exclusions (misdiagnoses, genetically enriched sub-studies, relatedness, non-European ancestry outliers, excessive missingness), final training cohort PPMI included 427 PD cases and 171 controls; validation cohort PDBP included 804 cases and 442 controls. Features included clinico-demographics (sex, age, education, family history, inferred Ashkenazi ancestry, UPSIT), genetics (WGS-derived variants filtered using PD GWAS p-value thresholds; PRS from genome-wide significant loci), and transcriptomics (whole-blood RNA-seq protein-coding genes; variance-stabilized counts adjusted for age, sex, plate, percentage usable bases, and 10 RNA PCs). Data processing: - DNA and RNA principal components computed separately (10 PCs each). Genetic variant dosages adjusted for DNA PCs; transcript counts adjusted for RNA PCs to mitigate population structure and technical confounding; then Z-transformed. - Genetic pruning removed long-LD regions (e.g., HLA) while retaining known PD risk SNPs. - Preliminary feature filtering used p-value thresholds from PD GWAS (genetics) and differential expression (transcriptomics); multiple thresholds combined into 49 data combinations (genetics p ≤ 1E-2..1E-8; transcriptomics p ≤ 1E-2..1E-8). - Feature selection employed extremely randomized trees (extraTrees) on combined modalities to remove redundant/low-impact features and limit overfitting; correlation-based pruning ensured low inter-feature correlation among top features. Model development: - PPMI split 70:30 (train:test) for algorithm comparison across 12 scikit-learn classifiers: LogisticRegression, RandomForestClassifier, AdaBoostClassifier, GradientBoostingClassifier, SGDClassifier, SVC, MLPClassifier, KNeighborsClassifier, LinearDiscriminantAnalysis, QuadraticDiscriminantAnalysis, BaggingClassifier, XGBClassifier. - Best algorithm selected by AUC and balanced accuracy in withheld PPMI samples; AdaBoostClassifier performed best for the combined modality model. - Hyperparameter tuning of AdaBoost performed via fivefold cross-validation across estimator counts (1–1000), iterated 25 times in full PPMI (no fixed holdout). - Post hoc threshold optimization using Youden’s J index to improve balanced accuracy under class imbalance, applied in withheld PPMI and again in PDBP validation without retraining or reweighting. Model interpretation and biology: - SHAP used for feature importance (surrogate XGBoost for interpretability) with interactive web application for per-sample decision plots. - Case-only gene co-expression network from top transcriptomic features constructed; Leiden community detection yielded communities; drug target enrichment analyzed using DrugBank and GLAD4U. Software and reproducibility: GenoML (open-source), scikit-learn pipelines, code, models, and figures publicly available on GitHub; AMP-PD data access-controlled.

Key Findings

Predictive performance: - In withheld PPMI samples (70:30 split), combined multimodal model (AdaBoost) outperformed single-modality models: • Combined: AUC 89.72%, Accuracy 85.56%, Balanced accuracy 82.41%, Sensitivity 0.89, Specificity 0.76, PPV 0.91, NPV 0.73. • Clinico-demographic: AUC 87.52%, Accuracy 79.44%, Balanced accuracy 75.27%. • Transcriptomics-only: AUC 79.73%, Accuracy 73.89%, Balanced accuracy 54.60%. • Genetics-only: AUC 70.66%, Accuracy 70.00%, Balanced accuracy 60.64%. - Cross-validation/tuning in PPMI improved stability and performance: Combined model mean AUC during CV increased from 86.99% (baseline) to 90.17% (tuned), with reduced variance. - External validation in PDBP: • Untuned combined model: AUC 83.84%, Accuracy 75.81%, Balanced accuracy 69.31%, Sensitivity 0.93, Specificity 0.46. • Tuned combined model: AUC 85.03%, Accuracy 75.00%, Balanced accuracy 68.09%, Sensitivity 0.93, Specificity 0.43. - Threshold optimization (Youden’s J): • PPMI withheld (threshold 51%): Accuracy 85%, Balanced accuracy 83.95%, Sensitivity 0.86, Specificity 0.82, PPV 0.93, NPV 0.69. • PDBP validation (threshold 51%): Accuracy 78.58%, Balanced accuracy 77.97%, Sensitivity 0.80, Specificity 0.76, PPV 0.85, NPV 0.68. - Statistical comparison: Combined model consistently outperformed clinico-demographic model in PPMI (t-test statistic ≈ 10.23; p ≈ 9.95e-23). Feature contributions: - Model used 51 SNPs and 418 protein-coding transcripts plus clinico-demographics and PRS. - SHAP analyses identified UPSIT and PRS as top contributors; many smaller-effect transcripts and SNPs augmented accuracy. - Directionality varied across genetic features; sex did not contribute in final model due to balanced distribution in PPMI. Network and enrichment: - Case-only RNA network: 13 gene communities (300 genes), modularity 0.794. - Drug enrichment: DrugBank enrichment for fostamatinib targets (FDR 2.21e-4; genes MYLK, EPHA8, HCK, DYRK1B, BUB1B-PAK6) and copper (FDR 0.0286; HSP90AA, CBX5, HSPD1). GLAD4U enrichment for L-lysine-related genes (FDR 0.0057). Gamma-hydroxybutyric acid repeatedly top unadjusted hit (genes SLC16A7, SLC16A3, GABBR1). Translational implications: - At estimated PD prevalence of ~2%, optimized PPMI model yields PPV ~8.75% and NPV ~99.66%; low false omission rate suggests usefulness for large-scale screening to flag high-risk individuals for follow-up (biobanks, registries, trial recruitment).

Discussion

Integrating clinico-demographic data with genetics (including PRS) and whole-blood transcriptomics improves PD diagnostic classification versus single-modality approaches. The multimodal model enhances balanced accuracy, crucial for diseases with low prevalence. UPSIT provides strong signal but lacks disease specificity; genetics/PRS add PD-specific information, and transcriptomics contributes additional, diverse signals. The approach generalizes across cohorts, though performance attenuates in PDBP due to cohort differences (diagnosis timing, medication status, UPSIT and age distributions). Post hoc threshold optimization can rebalance sensitivity and specificity for external datasets. Beyond prediction, feature selection and case-only network analysis reveal biologically coherent gene communities and potential drug-gene interactions, which may inform therapeutic target nomination. The open-source, transparent pipeline facilitates reproducibility, transfer learning, and practical deployment in large healthcare systems for risk stratification and trial enrichment.

Conclusion

This work demonstrates that a transparent, AutoML-driven multimodal framework (GenoML) combining clinico-demographics, genetics/PRS, and blood transcriptomics yields improved peri-diagnostic PD prediction, validated externally. The model achieves strong AUC and balanced accuracy, with UPSIT and PRS as principal contributors, and identifies PD-relevant gene communities and drug-target enrichments. The approach is suited for large-scale screening to prioritize individuals for further evaluation and for clinical trial recruitment. Future directions include expanding to more diverse genetic ancestry groups (via GP2), integrating additional predictors (e.g., REM sleep behavior disorder, gastrointestinal features) when harmonized across cohorts, exploring ensemble/transfer learning with other modalities (e.g., imaging, wearable sensors), incorporating sex chromosome data, and prospective validation in pre-diagnostic or prodromal populations.

Limitations

Limited ancestral diversity; models trained/validated in European ancestry may not generalize across ancestries. - External validation dataset (PDBP) differs in recruitment (diagnosis timing, medication status, lack of DaTScan confirmation), attenuating performance relative to PPMI. - Some established predictors (e.g., constipation, RBD) were not included due to feature selection results or sparsity/availability, potentially limiting performance. - Lower specificity in external validation at default thresholds; class imbalance and real-world prevalence impact PPV/NPV. - Time-varying features (UPSIT, RNA-seq, age) may peak at diagnosis, potentially reducing pre-diagnostic performance. - Sex chromosome data (X/Y) unavailable in AMP-PD v1. - Potential overfitting mitigated but still possible given hypothesis-free RNA features; reliance on public datasets may introduce cohort-specific biases.

Related Publications

Explore these studies to deepen your understanding of the subject.

Medicine and Health

Integrative machine learning approaches for predicting disease risk using multi-omics data from the UK Biobank

O. Aguilar, C. Chang, et al.

Medicine and Health

Faecal microbiome-based machine learning for multi-class disease diagnosis

Q. Su, Q. Liu, et al.

Chemistry

Predicting glass structure by physics-informed machine learning

M. L. Bødker, M. Bauchy, et al.

Psychology

Application of machine learning in predicting aggressive behaviors from hospitalized patients with schizophrenia

N. Cheng, M. Guo, et al.

Listen, Learn & Level Up

Over 10,000 hours of research content in 25+ fields, available in 12+ languages.

No more digging through PDFs, just hit play and absorb the world's latest research in your language, on your time.

listen to research audio papers with researchbunny