logo
ResearchBunny Logo
Machine learning models to accelerate the design of polymeric long-acting injectables

Medicine and Health

Machine learning models to accelerate the design of polymeric long-acting injectables

P. Bannigan, Z. Bao, et al.

Explore groundbreaking research conducted by Pauric Bannigan, Zeqing Bao, and other experts from the University of Toronto, revealing how machine learning algorithms can efficiently predict drug release from long-acting injectables, paving the way for faster, cost-effective formulation development.

00:00
00:00
~3 min • Beginner • English
Introduction
Polymeric long-acting injectables (LAIs) provide sustained drug exposure and improved adherence for chronic conditions but remain challenging to develop due to complex, interacting factors spanning drug and polymer physicochemical properties, device geometry, and processing conditions. Historically, only a limited number of biodegradable polymers (notably PLGA) have GRAS status, constraining material choices, and small changes in formulation parameters can unpredictably affect drug loading, release rates, and stability. Traditional development relies on extensive experimental iteration, while empirical and mechanistic release models and molecular simulations offer insight but are either limited in predictive power a priori or computationally costly. This study asks whether machine learning can accurately predict in vitro fractional drug release profiles from polymeric LAIs across diverse drug–polymer systems and whether interpretable ML can guide the rational design of new formulations, thereby accelerating development.
Literature Review
Prior approaches include empirical/mathematical models of release mechanisms that require high-quality input data and typically cannot predict release a priori; molecular dynamics and related simulations can link formulation parameters to release behavior but are computationally intensive and not yet substitutes for experiments. Machine learning has been explored primarily with neural networks on small, narrow datasets (e.g., ~85 PLGA protein formulations or <20 small-molecule cases), which risks overfitting in low-data regimes. A key barrier is the scarcity of open, standardized datasets for formulation science. Tree-based models have been highlighted for performance and interpretability advantages in similar low-data tasks.
Methodology
Data curation: A dataset was compiled from the literature (Web of Science search terms such as “polymeric microparticle/microgel” and “drug delivery”) and prior in-house studies. Fractional release profiles were digitized where necessary. The final dataset spanned 181 drug release profiles across 43 unique drug–polymer combinations, totaling approximately 373–378 individual fractional release measurements. LAIs encompassed soluble and insoluble polymeric systems and varied morphologies (spherical, cylindrical). Drug and polymer descriptors were computed primarily with RDKit. Features: Seventeen initial input features captured drug properties (e.g., Drug_MW, Drug_TPSA, Drug_NHA, Drug_Tm, Drug_pKa, Drug_LogP), polymer properties (Polymer_MW, cross-linking ratio (CL_Ratio), lactide:glycolide ratio (LA/GL)), system/formulation attributes (drug loading capacity (DLC/LAC), initial drug-to-polymer or drug-to-material ratio, surface area-to-volume ratio (SA)), release media surfactant percentage (S), measurement time (Time), and early fractional release values (e.g., T at early timepoints such as 6 h, 12 h, 24 h or 0.25, 0.5, 1.0 days; missing early points were imputed by best-fit polynomials over 0–7 days when not reported). Models that included early release points were termed few-shot; those without were zero-shot. Model training and evaluation: Multiple algorithms were benchmarked, including tree-based models (LightGBM (LGBM), Random Forest (RF), XGBoost (XGB), Natural Gradient Boosting (NGB where applicable), Decision Trees), linear/regularized models, KNN, SVM, and neural networks (Keras/TensorFlow). A nested cross-validation scheme was used with grouping by drug–polymer combination to avoid leakage of specific systems across train/test. For each of 10 trials, 20% of drug–polymer groups were held out as the outer test set (GroupShuffleSplit). Inner-loop model selection used GroupKFold (k=10) and randomized hyperparameter search (~100 configurations). Performance was quantified by mean absolute error (MAE) of fractional release. Feature refinement: Agglomerative hierarchical clustering (farthest-neighbor) on absolute Spearman correlations identified redundant feature clusters (e.g., strong correlations among Drug_NHA, Drug_TPSA, Drug_MW; and among early-time release features). Guided by dendrogram linkage distances, progressively pruned feature sets were evaluated with fixed LGBM hyperparameters to balance performance and parsimony, yielding an optimal 15-feature LGBM model. Model interpretation and design guidance: SHAP (TreeSHAP) analyses quantified global and local feature contributions. Dimensionality reduction (PCA followed by t-SNE) on SHAP values was used to visualize how manipulable features (drug MW, polymer MW, CL ratio, etc.) relate to predicted release behavior. Prospective validation: Guided by SHAP-derived criteria for fast vs slow release, PLGA microparticles (MPs) were formulated for salicylic acid (fast-release) and olaparib (slow-release) using an oil-in-water emulsion method with PLGA grades (e.g., RG 502H, RG 504H), size selection by sieving, and characterization by laser diffraction and SEM. In vitro release was measured under sink conditions (e.g., 0.5 wt% SDS for release media) with HPLC quantification, and experimental profiles were compared against LGBM predictions.
Key Findings
- Tree-based models outperformed linear, kernel, and neural network baselines under grouped nested cross-validation; LGBM achieved the best accuracy with a significantly narrower absolute-error distribution (p<0.05 vs others). - Few-shot models (including early release measurements such as T=1.0 day) outperformed zero-shot counterparts, demonstrating the value of limited initial experimental data for improved predictions. - Feature clustering and ablation indicated redundancy among some descriptors. Reducing from 17 to an optimized 15 features maintained strong performance; excessive pruning (≤13 features) degraded accuracy (e.g., MAE rising from ~0.115–0.116 to ~0.142 and higher across trials), confirming the importance of a sufficiently rich descriptor set. - SHAP analysis ranked Time and early release (e.g., T=1.0 day) as most influential, followed by Drug_MW and Polymer_MW; other descriptors had smaller marginal effects but contributed synergistically. - SHAP- and t-SNE-derived design rules for PLGA MPs distinguished fast- vs slow-release regimes (e.g., lower drug and polymer MWs associated with faster release; medium/high MWs with slower release; lower TPSA, SA/V, initial D/M ratio, and DLC favored faster release). - Prospective study: Predicted and experimental profiles for SA-PLGA (fast) and OLA-PLGA (slow) showed good agreement overall; OLA-PLGA exhibited a faster-than-predicted release after day 5, likely due to PLGA hydrolysis effects not fully captured by the model. - The approach demonstrates that interpretable ML can both predict in vitro release profiles and guide formulation design decisions, potentially reducing experimental burden.
Discussion
The study demonstrates that ML, particularly LGBM, can generalize across unseen drug–polymer pairs to predict fractional release profiles when trained on literature-curated datasets and grouped by drug–polymer combinations. By identifying key determinants of release (time dependence, initial release rate proxies, and molecular weights of drug and polymer), the models provide actionable insights. The few-shot paradigm balances minimal experimental input with substantial gains in accuracy, aligning with practical formulation workflows. Prospective validation confirms that data-driven design criteria can translate into formulations whose in vitro performance largely matches predictions, supporting the central hypothesis that ML can accelerate LAI development. The observed deviation in the slow-release OLA-PLGA after day 5 highlights domain-specific mechanistic phenomena (e.g., PLGA hydrolysis) that suggest directions for enriching models (time-dependent polymer degradation) and datasets (more slow-release systems). Overall, the results are relevant to pharmaceutical formulation science, offering a template for combining interpretable ML with targeted experiments for efficient design cycles.
Conclusion
This work provides a proof-of-concept that interpretable, tree-based ML models can accurately predict in vitro drug release from polymeric LAIs and inform the design of new formulations. An optimized 15-feature LGBM model, trained with grouped nested cross-validation, outperformed alternative algorithms and enabled derivation of practical design rules. A prospective study validated model-guided design for fast- and slow-release PLGA systems, showing strong agreement between predictions and experiments. Future work should expand and standardize open datasets (particularly for slow-release and diverse polymer systems), incorporate mechanistic descriptors such as time-dependent polymer degradation, and evaluate in vivo translation. With larger datasets, neural networks may become more competitive; meanwhile, tree-based models offer robust, interpretable performance to reduce time and cost in LAI development.
Limitations
- Limited dataset size and heterogeneity (≈43 drug–polymer combinations; ≈373–378 measurements) constrain generalizability and favor tree-based models over data-hungry neural networks. - Some early-time release points were imputed, potentially introducing bias. - Few-shot models rely on initial experimental measurements (e.g., T=1.0 day), limiting true a priori prediction in zero-shot settings. - While grouping prevented identical drug–polymer systems from appearing in both train and test, overlap at the level of individual drugs or polymers across inner/outer loops may introduce subtle correlations. - Mechanistic factors like time-dependent PLGA hydrolysis and morphology changes are not explicitly modeled, contributing to deviations in slow-release profiles (e.g., OLA-PLGA after day 5). - The majority of approved LAIs use PLGA; reliance on a narrow material set may limit extrapolation to novel polymers without additional data.
Listen, Learn & Level Up
Over 10,000 hours of research content in 25+ fields, available in 12+ languages.
No more digging through PDFs, just hit play and absorb the world's latest research in your language, on your time.
listen to research audio papers with researchbunny