Biology

Machine learning guided aptamer refinement and discovery

A. Bashir, Q. Yang, et al.

Discover the groundbreaking MLPD methodology that fuses machine learning with particle display to revolutionize aptamer discovery, enhancing affinity and specificity. This research, conducted by renowned experts including Ali Bashir, Qin Yang, and Jinpeng Wang, showcases how innovative approaches can create novel high-affinity aptamers for diagnostics and therapeutics.... show more

Introduction

Aptamers are single-stranded DNA/RNA ligands with high affinity and specificity and advantageous properties for therapeutics and diagnostics (non-immunogenicity, safety, modularity, manufacturability). High-quality aptamers are rare in sequence space, and experimental discovery methods such as SELEX and variants are constrained by practical library sizes (~10^15 molecules) and synthesis limits, sampling only a minute fraction of possible sequences. Thus, intelligently exploring sequence space could yield better binders with fewer experiments. Prior computational efforts, including docking-based filtering and machine learning (e.g., random forests trained on aptamer affinities), suggest that predictive models can guide design, but have not fully realized de novo sequence generation. This study proposes and validates a machine learning–guided particle display (MLPD) workflow to: (1) improve existing experimental aptamers, (2) discover de novo high-affinity DNA aptamers, and (3) truncate aptamers while retaining or improving affinity. Neutrophil gelatinase-associated lipocalin (NGAL), a clinically relevant biomarker, serves as the demonstration target.

Literature Review

The paper reviews constraints of traditional aptamer discovery (SELEX) and improvements focused on affinity, specificity, and success rates, all limited by physical library size and synthesis strategies. It cites fully specified oligo libraries (~10^6 diversity) and the limitations of random library strategies for precise exploration. Computational approaches have filtered pools via secondary structure prediction and docking. Machine learning has modeled sequence–fitness landscapes (e.g., random forests on aptamer affinities) and shown success in other biomolecular sequence design tasks, such as deep learning–guided optimization of yeast 5′UTRs for expression and antibody specificity. These precedents motivate the use of modern neural networks to predict and generate aptamer sequences with enhanced affinity, while leveraging experimental data from selection methods like particle display.

Methodology

Overview: The MLPD pipeline integrates Particle Display (PD) with neural network models to predict and generate high-affinity aptamer sequences. PD partitions aptamer particles into positive/negative pools at multiple affinity thresholds via FACS using fluorescently labeled target. Pools are sequenced by NGS to provide training data. Models are trained to predict affinity-related labels from sequence features, then used to generate candidates by guided mutational walks from seed sequences. Selected candidates are synthesized and validated by PD and K_D measurements. Truncation of aptamers to minimal core sequences is performed using model-guided substring scoring across sequence backgrounds and experimental validation.

Particle Display (PD): A DNA library with a 40-nt random region flanked by primers was converted to monoclonal aptamer-coated particles via emulsion PCR. PD rounds incubated ~10^8 particles with NGAL at defined concentrations, labeled bound target via anti-His-tag fluorophore, and sorted positives (>F_max/3) vs negatives by FACS. Two rounds established three stringency thresholds separated fourfold (approximate thresholds <2 µM, <512 nM, <128 nM, calibrated via K_D curves on subsets). Round 1 positives were amplified and mixed to seed Round 2, yielding higher positive fractions. Positive and negative pools at each condition were sequenced on Illumina NextSeq.

Training data processing: 500,454,107 quality-filtered reads were clustered (Levenshtein distance ≤5) into 910,441 clusters to avoid train–test leakage from sequencing errors. Cluster representatives (highest-count sequence per cluster) were used for efficiency; cluster membership was not split across partitions. Data were split 80/20 into train/test. Counts were normalized by total reads per pool. Concordance across affinity thresholds was verified; sequences passing higher stringency were typically present at lower ones.

Sequence features and models: Input features concatenated one-hot encoding (160 dims) with k-mer counts up to length 4 (340 dims), total length 500. Three prediction targets were used and trained with least-squares loss: (1) Counts model predicts normalized sequencing fraction per positive/negative pool; a latent affinity layer variant was also explored. (2) Binned model predicts ternary presence/absence per stringency bin. (3) SuperBin summarizes maximum stringency level into a single target with seven affinity levels, removing ambiguous/conflicting labels. Architectures used three convolutional layers followed by 1–4 fully connected layers; Adam or SGD with momentum were used depending on model (learning rates around 0.001–0.004). Hyperparameters were tuned via Google Vizier, ranking by AUC on top 1% of outputs in the test set.

Model-guided candidate generation (walking): Seeds were selected from three sources: (i) 400 high-performing experimental PD sequences (experimental seeds), (ii) 14,977 ML seeds obtained by scoring up to 1 billion random sequences per model and selecting the top ~5000 per model, and (iii) 177 purely random seeds (baseline). For each seed, iterative mutational walks were performed for five rounds: generate random mutants (0–2 nt substitutions per step; elsewhere up to 4 substitutions described for general walking), score with the model, select top-scoring for advancement (e.g., top 200 become next-round parents; top 5 selected for experimental validation per step). In total, 82,931 walked aptamers were synthesized and PD-tested across four stringencies (512 nM, 128 nM, 32 nM, 8 nM). Across the broader study, 187,499 aptamers were experimentally evaluated.

Experimental validation: Approximate affinities were assigned by PD bin thresholds and gating at F_max/3; absolute K_D values were measured for subsets via bead-based fluorescence binding assays over titration series and fitted to a single-site binding model. Calibration ensured comparability of PD and MLPD stringency thresholds.

Motif discovery: Enrichment analyses using MEME compared motifs between walked sequences and their seeds (in random-seed experiments) and between PD test positives and full PD test sets, revealing a shared 7-nt motif (consensus TGGATAG).

Truncation strategy: For selected top binders (one PD-derived, G12, and one ML-derived, G13, both <8 nM by PD), all substrings of lengths 15, 19, 23, 27, 31, 35, and 39 nt were embedded across all positions within a 40-nt window, flanked by homopolymer backgrounds (A, C, G, T) to create multiple sequence contexts. Each variant was scored by the SuperBin model, yielding score distributions per subsequence. Candidates with consistently high median scores (and/or low variance) were synthesized as 5′-biotinylated oligos and tested. Secondary structures were predicted with ViennaRNA (DNA parameters) to contextualize motif placement.

Key Findings

Predictive performance on held-out data: All models improved with increasing stringency; the Binned model generally performed best. Test-set AUCs (examples): Count-Sum Top 1%: Counts 0.84, Binned 0.89, SuperBin 0.83; Affinity threshold 128 nM: Counts 0.86, Binned 0.95, SuperBin 0.87.
ML-guided walks outperformed random walks across seed types. Starting from experimental seeds, ML walks improved recovery of high-affinity candidates 11.3-fold at 8 nM and 4.6-fold at 32 nM versus random walks, despite many mutations being deleterious near strong seeds.
From ML-screened seeds, ML walks generated de novo candidates beyond the training stringency (down to 8 nM). Walked sequences were at least Levenshtein distance ≥10 from experimental seeds, indicating novelty.
Large enrichment over PD alone: Across ML models, there was a 460-fold increase in the fraction of <128 nM aptamers compared to PD; using recalibrated affinities for top PD candidates, enrichment rose to 1214-fold.
Specific examples: A seed improved from 275 nM to 8 nM K_D via ML-guided walking. The best-performing aptamer overall originated from an ML-walk of an ML-screened seed, surpassing the best experimental seed.
Motif discovery: A consistent 7-nt motif, TGGATAG, was enriched in both ML-selected and PD-positive sequences; similar motif occurrences were noted in independent NGAL aptamer studies.
Truncation: Model-guided truncation identified 23-nt cores maintaining high affinity. G12 truncation had K_D ~11 nM vs 8 nM full-length; G13 truncation improved to 1.5 nM, a 5.2-fold gain over its full-length form and higher affinity than the best original PD candidate. Structures for both 23-mers featured a hairpin with TGGATAG in the loop, mirroring their full-length counterparts.
Overall impact: The approach predicted high-affinity aptamers from experimental candidates at an 11-fold higher rate than random perturbation, generated novel high-affinity sequences at higher rates than PD alone, and produced markedly shorter aptamers (>70% length reduction) with equal or superior affinity.

Discussion

The study addresses the fundamental limitation of experimental aptamer discovery—sparse sampling of vast sequence space—by training neural networks on PD-derived affinity partitions and using them to guide sequence exploration. Models trained predominantly on lower stringency data (<128 nM) extrapolated to much tighter binders, demonstrating that binarized affinity thresholds carry sufficient signal for useful generalization. Across seed types, ML-guided mutation robustly enriched high-affinity candidates, including de novo sequences, and in some cases exceeded the best experimental seeds. The consistent discovery of the TGGATAG motif and its placement in predicted hairpin loops supports biological plausibility and provides mechanistic clues. Model-guided truncation successfully identified minimal cores that retained or improved binding, enabling cost and complexity reductions for potential therapeutic or diagnostic use. The framework is target-agnostic and can be extended to multi-parameter optimization (e.g., affinity, specificity, kinetics, stability) by integrating orthogonal PD-style screens and jointly guiding walks up or down different property landscapes. Active learning strategies could further improve efficiency by iteratively retraining on new experimental results to better align model predictions with ground truth.

Conclusion

This work presents MLPD, a combined experimental–computational workflow that: (1) trains neural networks on PD-derived affinity partitions, (2) uses models to guide mutational walks to enrich high-affinity aptamers, (3) discovers de novo binders surpassing initial exemplars, and (4) computationally identifies minimal-length cores retaining or enhancing affinity. On the NGAL target, MLPD substantially outperformed random exploration and PD alone, uncovered an enriched sequence motif (TGGATAG), and produced 23-nt aptamers with nanomolar to sub-nanomolar affinity, including a 1.5 nM truncation. Future directions include increasing and diversifying positive training examples (larger libraries, SELEX pre-enrichment), leveraging structural signals to mine low-abundance binders, adopting active learning and more sophisticated exploration–exploitation strategies, and extending to multi-objective design to simultaneously optimize affinity, specificity, kinetics, and stability.

Limitations

Model selection used a single 20% held-out test set for hyperparameter tuning, risking optimistic estimates; cross-validation could improve robustness.
Limited number of high-stringency positive examples constrained training signal; more positives would likely improve performance.
PD-derived labels are indirect measures of affinity and require careful calibration across experiments (dependence on F_max, instrument gains, target concentration).
Aggressive optimization may induce neural network pathologies; a conservative mutation strategy was used but may not fully prevent overfitting to model artifacts.
Computationally exhaustive exploration of sequence space is infeasible; results depend on seed selection and local walks.
Demonstration is on a single protein target (NGAL); generalization to other targets, while plausible, requires empirical validation.

Related Publications

Explore these studies to deepen your understanding of the subject.

Computer Science

The Goldilocks paradigm: comparing classical machine learning, large language models, and few-shot learning for drug discovery applications

S. H. Snyder, P. A. Vignaux, et al.

Chemistry

AlphaFlow: autonomous discovery and optimization of multi-step chemistry using a self-driven fluidic lab guided by reinforcement learning

A. A. Volk, R. W. Epps, et al.

Physics

Machine-learning-guided discovery of the gigantic magnetocaloric effect in HoB₂ near the hydrogen liquefaction temperature

P. B. D. Castro, K. Terashima, et al.

Engineering and Technology

Machine learning enables the discovery of 2D Invar and anti-Invar monolayers

S. Tian, K. Zhou, et al.

Listen, Learn & Level Up

Over 10,000 hours of research content in 25+ fields, available in 12+ languages.

No more digging through PDFs, just hit play and absorb the world's latest research in your language, on your time.

listen to research audio papers with researchbunny