Medicine and Health
Estimating individual treatment effect on disability progression in multiple sclerosis using deep learning
J. R. Falet, J. Durso-finley, et al.
The study addresses the challenge that disability progression in multiple sclerosis, especially in progressive forms, lacks reliable biomarkers for efficient phase 2 trials. Traditional MRI endpoints that work for focal inflammatory activity in RRMS are not predictive enough for progression independent of relapses. The hypothesis is that predictive enrichment using individualized treatment effect estimation from baseline clinical and MRI features can identify patients most likely to benefit, enabling smaller, shorter proof‑of‑concept trials. The purpose is to develop and validate a deep learning framework to estimate CATE for disability progression, improving trial power and feasibility and potentially accelerating drug development for progression in MS.
Predictive enrichment is a recognized strategy to increase trial power by selecting subgroups likely to respond (Temple, 2010). Prior MS work (Bovis et al., 2019) used Cox proportional hazards models to identify responders in RRMS (e.g., to laquinimod). In treatment effect estimation, uplift modeling and meta-learners (e.g., T‑learner) are common, with many tree-based and meta-learning approaches surveyed by Gutierrez & Gérardy. A recent meta-learning approach predicted treatment effect on MRI lesion activity in RRMS from baseline MRI and clinical variables (Durso-Finley et al., 2022). Prior subgroup analyses in PPMS (e.g., OLYMPUS) suggested younger age and presence of gadolinium-enhancing lesions predict greater response to B‑cell depleting therapy. However, single biomarkers like brain atrophy have shown limited correlation with EDSS progression over long follow-up in PPMS, RRMS, and SPMS, limiting their utility as phase 2 surrogates.
Design and data: Pooled individual participant data from six randomized clinical trials (n=3830): RRMS trials (OPERA I, OPERA II, BRAVO) and PPMS trials (ORATORIO, OLYMPUS, ARPEGGIO). Data were partitioned into: (1) RRMS subset for pre‑training (n=2520); (2) PPMS anti‑CD20 subset from ORATORIO and OLYMPUS (n=992), split 70%/30% for fine‑tuning (n=695) and testing (n=297); (3) PPMS laquinimod dataset ARPEGGIO (n=318) as an independent test set. Inclusion excluded participants with <24 weeks’ follow‑up, <2 clinical visits, or missing baseline features. Features and outcome: Nineteen baseline features recorded at screening were used: sex (binary), EDSS and Functional Systems Scores (ordinal), age, height, weight, disease duration (from symptom onset), T25FW, 9HPT dominant and non-dominant, gadolinium-enhancing lesion count, T2 lesion volume, normalized brain volume, among others. The primary training target was the slope of EDSS change over time per individual, obtained by linear regression of EDSS across visits; this directly models disability progression rate and avoids limitations of time‑to‑CDP24. Evaluation used time‑to‑24‑week confirmed disability progression (CDP24) for comparability with trial endpoints and to compute survival-based effect sizes. Causal framework and model: Individual treatment effect is framed via CATE τ(x) = μ1(x) − μ0(x), estimating potential outcomes under treatment and control from baseline features x. A neural T‑learner variant was implemented: an ensemble of multi‑headed MLPs with a shared trunk and treatment‑specific heads to predict EDSS slope on active and placebo arms. For interpretability of signs in reported results, τ(x) values were multiplied by −1 so that positive indicates benefit. Pre‑training used a 5‑headed MLP on RRMS arms; fine‑tuning froze the shared trunk and trained two new heads on PPMS anti‑CD20 vs placebo. The architecture included one common hidden layer and one treatment‑specific hidden layer with ReLU activations. Hyperparameters tuned via randomized search included learning rate, momentum, L2 weight decay, hidden width, max‑norm constraint, and dropout. Training procedure: Mini‑batch gradient descent with momentum, stratified batches preserving treatment allocation proportions. Loss was MSE computed on the head corresponding to each patient’s allocated arm; squared errors per head were weighted by n_i/(m·n) to compensate for arm imbalance. Overfitting control used 4‑fold cross‑validation with early stopping at the epoch of lowest validation MSE (max 100 epochs), dropout, L2 regularization, and max‑norm. Models from each CV fold were ensembled by averaging predictions. Model selection during tuning used crogging: aggregate validation predictions to compute metrics; select models with highest ADwabc among those within 1 SD of best factual MSE. Evaluation metrics: Predictive enrichment performance assessed by the average difference curve AD(c), summarised by ADwabc (area-weighted). Survival analyses used Kaplan–Meier curves and Cox proportional hazards (hazard ratios, log-rank p‑values). A simulation estimated sample sizes for placebo‑controlled 1‑ or 2‑year trials under varying enrichment thresholds, using observed CDP24 rates and HRs in predicted responder subgroups (power 80%, α=0.05, 2:1 randomization). Baselines: Compared against ridge regression and CPH T‑learners using the same features/targets, single‑feature heuristics (including ratios to disease duration), an MLP without RRMS pre‑training, and a prognostic MLP trained only to predict placebo slope (using predicted placebo progression as a proxy for potential treatment effect).
- Predictive ranking on anti‑CD20 test set (n=297): ADwabc = 0.0565 with near‑monotonic AD(c) (Spearman r=0.943), indicating effective ranking of responders.
- Anti‑CD20 survival effects (time‑to‑CDP24): Whole test set HR 0.743 (95% CI 0.482–1.15; p=0.179). Enriched subgroups: top 50% predicted responders HR 0.492 (0.266–0.912; p=0.0218); top 30% HR 0.361 (0.165–0.79; p=0.008). Corresponding non‑responders: bottom 50% HR 1.11 (0.599–2.05; p=0.744); bottom 70% HR 0.976 (0.578–1.65; p=0.925).
- ORATORIO-only subset (n=188): Whole-group HR 0.661 (0.383–1.14; p=0.135). Enriched: top 50% HR 0.516 (0.241–1.1; p=0.084); top 30% HR 0.282 (0.105–0.762; p=0.0082). Non‑responders: bottom 50% HR 0.849 (0.385–1.87; p=0.685); bottom 70% HR 0.915 (0.471–1.78; p=0.791).
- Demographic subgroup ADwabc (anti‑CD20): men 0.0405; women 0.0844; age <51: 0.0353; age ≥51: 0.0661; disease duration <5 years: 0.0385 vs ≥5 years: 0.0117; EDSS <4.5: 0.069 vs ≥4.5: 0.0451.
- Responder phenotype (anti‑CD20 test set): responders were younger, had shorter disease duration, higher EDSS and certain FSS (notably cerebellar and visual), more lesion activity (higher T2 lesion volume and Gad count); normalized brain volume did not differ significantly.
- Generalization to laquinimod (ARPEGGIO, n=318): ADwabc = 0.0211. Whole-group HR 0.667 (0.369–1.2; p=0.933). Enriched: top 50% HR 0.492 (0.219–1.11; p=0.0803); top 30% HR 0.338 (0.131–0.872; p=0.0186). Non‑responders: bottom 50% HR 0.945 (0.392–2.28; p=0.901); bottom 70% HR 0.967 (0.447–2.09; p=0.933). Responder characteristics were broadly concordant with anti‑CD20 results.
- Baseline comparisons (ADwabc): MLP best overall (anti‑CD20: 0.0565; laquinimod: 0.0211). MLP without pre‑training: 0.0486 and 0.019. Prognostic MLP: 0.0408 and 0.017. Ridge regression: 0.0227 and 0.0194. CPH: 0.0305 and 0.0031. Single‑feature models underperformed; T2 lesion volume/disease duration ratio was the strongest single‑feature heuristic (0.0432; 0.0164).
- Traditional surrogate approach (brain atrophy at 48 weeks) failed to detect significant effects in the anti‑CD20 test set (mean difference 0.066; 95% CI −0.397 to 0.529; p=0.7786) and in ORATORIO subset (0.110; 95% CI −0.352 to 0.572; p=0.6379).
- Sample size simulations: For a 2‑year trial, unenriched requires n=1374 randomized; enriching to top 50% responders reduces to n=245 randomized (screen 490), and to top 30% responders to n=111 randomized (screen 370), reflecting substantial gains in power and efficiency. One‑year scenarios also show large reductions (e.g., top 50%: n=371 randomized; screen 742).
The findings demonstrate that a deep learning CATE estimator using readily collected baseline clinical and scalar MRI features can rank and identify PPMS patients more likely to benefit from anti‑CD20 monoclonal antibodies. Predictive enrichment based on model scores substantially increases detectable effect sizes, enabling smaller, shorter proof‑of‑concept trials without relying on imperfect surrogate outcomes like brain atrophy. The model trained on anti‑CD20 data generalized to laquinimod, indicating that some predictors of response may be at least partially mechanism‑agnostic and related to underlying disease activity and progression risk. Responder enrichment patterns (younger age, shorter disease duration, higher EDSS in specific FSS domains, and greater lesion burden) align with prior subgroup analyses and meta‑analyses, supporting biological plausibility. The superior performance of a non‑linear MLP over linear baselines suggests meaningful non‑linear interactions among features underlie heterogeneity of treatment effect. Prognostic risk also correlated with predicted benefit, as a placebo‑progression model performed reasonably well, indicating that higher progression risk can be informative for enrichment, especially when extrapolating to drugs with different mechanisms. Operationally, the approach can underpin FDA‑endorsed enrichment strategies, where an initial enriched trial establishes proof‑of‑concept followed by broader confirmatory testing or stratified designs. Compared with strategies inferring from RRMS efficacy or relying on MRI surrogates, direct CATE‑based enrichment better targets the population where clinical benefit on disability progression is most likely, improving participant risk–benefit and trial feasibility.
This work introduces and validates an ensemble multi‑headed MLP framework to estimate individual treatment effects on disability progression in MS from baseline clinical and MRI features. The approach reliably ranks responders to anti‑CD20 therapy and generalizes to laquinimod, enabling predictive enrichment that markedly reduces sample sizes for 1–2 year trials. The method offers a practical route to efficient proof‑of‑concept studies in progressive MS, potentially accelerating therapeutic development. Future directions include: enhancing interpretability of neural CATE models; incorporating voxel‑level MRI via CNN modules to capture finer predictive signals; expanding validation across therapies with diverse mechanisms (including neuroprotective/remyelinating agents) to assess generalizability; and evaluating long‑term outcomes to understand benefits beyond 2–4 years and potential late responders.
Key limitations include: use of a neural network “black‑box” model with limited interpretability; higher overfitting risk mitigated but not eliminated despite regularization, early stopping, and ensembling; reliance on scalar MRI-derived metrics rather than voxel‑level imaging, potentially missing subtle features; limited external validation across drug mechanisms, requiring more diverse datasets; uncertain applicability for predicting long‑term benefit beyond 2–4 years, particularly for those predicted as non‑responders; and the choice of EDSS slope as the training target due to limitations of CDP24, which, while advantageous for modeling, may introduce differences relative to time‑to‑event outcomes used in trials. Additionally, dataset heterogeneity across trials and exclusions for missing data may affect generalizability.
Related Publications
Explore these studies to deepen your understanding of the subject.

