Chemistry

Design of target specific peptide inhibitors using generative deep learning and molecular dynamics simulations

S. Chen, T. Lin, et al.

This innovative study by Sijie Chen, Tong Lin, Ruchira Basu, Jeremy Ritchey, Shen Wang, Yichuan Luo, Xingcan Li, Dehua Pei, Levent Burak Kara, and Xiaolin Cheng showcases a groundbreaking computational method combining GRU-VAE with Rosetta FlexPepDock, leading to significant enhancements in peptide inhibitors targeting β-catenin and NEMO. With some inhibitors demonstrating up to a 15-fold improvement in binding affinity, this research elegantly merges deep learning with molecular modeling.

00:00

~3 min • Beginner • English

Index

Introduction

Designing peptide inhibitors for specific protein-protein interactions (PPIs) remains challenging due to vast sequence spaces, conformational flexibility, solvent effects, and limited peptide–protein structural data. Structure-based methods (e.g., Rosetta FlexPepDock) can refine peptide–protein complexes but are computationally intensive and may bias searches when initiated from templates. Sequence-based generative approaches (e.g., RNN/Transformer VAEs and GPT-style models) can explore large sequence spaces but often suffer from scarce labeled peptide–protein interaction data. This study asks whether integrating a deep generative model with physics-based docking and molecular dynamics (MD) can efficiently discover target-specific peptide extensions with improved affinity. The work focuses on extending a β-catenin inhibitor (both N- and C-termini) and an NEMO-binding peptide to enhance potency against their respective PPIs. The purpose is to develop and validate a scalable, target-adaptable pipeline that couples learned sequence priors with hierarchical biophysical evaluation to overcome data scarcity and structural complexity in peptide design.

Literature Review

Two broad computational paradigms exist for peptide design: (1) structure-based, which starts from a protein pocket or a bound peptide motif (e.g., Rosetta FlexPepDock for high-resolution peptide–protein modeling, pocket-based design, and template-based strategies), and (2) sequence-based generative methods, leveraging VAEs, RNNs, and Transformers to learn sequence representations and generate candidates with desired properties. Structure-based approaches can produce accurate binding pose refinements but are computationally costly and may bias sequence exploration. Sequence-based methods are limited by scarce labeled PPI datasets (only a fraction of peptide-containing PDB entries annotate peptide–protein interfaces). Pretraining on large unlabeled protein datasets (GPT-like models) and fine-tuning addresses some data limitations. Prior successes include conditional VAEs generating molecules with targeted properties and deep generative models coupled with MD for antimicrobial peptide discovery. AlphaFold can rank peptide binders but is computationally heavy and limited to natural amino acids. Overall, combining generative models with hierarchical physics-based scoring presents a promising path to target-specific peptide design.

Methodology

Overall pipeline: A GRU-based Variational Autoencoder (VAE) represents peptides in a latent space. A Metropolis-Hastings (MH) sampler generates extension encodings conditioned on desired binding properties, then decodes them to sequences. Peptides are ranked by Rosetta FlexPepDock, and top candidates are refined and re-ranked by MD-based MM/GBSA. Iterative fine-tuning with target-specific scores enriches for better binders. Datasets and labels: Unlabeled sequences came from UniProt, PixelDB, and THPdb for VAE training (sequence encoding). Labeled data for classifier training used: positives from PixelDB (peptides in peptide–protein complexes with ΔG < −35 REU) and sliced THPdb peptide therapeutics; negatives from random UniProt sequences. For target-specific fine-tuning, labels were updated using Rosetta FlexPepDock scores of generated peptides, with the top 10% designated as positives. VAE architecture and encoding: Peptides are split into a base (central 15 residues) and an extension (N- or C-terminal residues). Tokenization uses a 24-symbol vocabulary (start, end, unknown, pad, plus 20 natural amino acids). Embeddings feed a GRU encoder producing mean and variance of latent vector z; KL divergence regularizes z toward a prior; reconstruction via a GRU decoder optimized with cross-entropy. Extensions up to 35 tokens supported (N-terminal generation used 2–5 residues; C-terminal up to 7–9 residues depending on study arm). Conditional sampling with MH: Goal is to sample z from q(z|c=1) where c indicates a potential PPI binder. A Gaussian mixture model approximates q(z). q(c|z) is estimated via supervised classifiers trained on bootstrapped labeled datasets (4 models averaged): SVC for N-terminal extension and XGBoost for C-terminal extension (best-performing). A symmetric Gaussian proposal distribution with 500 burn-in steps is used; acceptance probability computed in log-scale. Accepted z is decoded recursively to an extension sequence, concatenated with the base peptide. Hierarchical binding evaluation: Rosetta FlexPepDock refines peptide–protein complexes and computes interface metrics: I_sc (interface score), pep_sc, I_bsa (interface buried SASA), and rmsALL_if (RMSD of interface atoms). A combined ranking uses these metrics. Top candidates undergo explicit-solvent MD (GROMACS, CHARMM27, TIP3P, 310 K, NPT 1 atm, PME electrostatics, LINCS constraints). Three independent 10 ns productions per complex; 1000 snapshots per trajectory for GMX_MMPBSA MM/GBSA binding free energy (GB-OBC model, 0.15 M salt, ε_in=2, ε_out=80, LCPO nonpolar term). Mean ± SD over triplicates reported. Design tasks: - β-catenin N-terminal extension: Base parent peptide YPEDILDKHLQRVIL (derived from Peptide 9 after removing two N-terminal glycines). Pretrained VAE-MH generated 100 extensions per length (2–5 residues). Rosetta screened 300 peptides; top 10 evaluated by MM/GBSA; 4 selected for synthesis based on energy and polarity. - β-catenin C-terminal extension with iterative fine-tuning: Two bases were used: full-length YPEDILDKHLQRVIL extended by up to 7 residues, and truncated YPEDILDKHLQRV extended by up to 9 residues (to probe recovery of lost binding due to removing IL). Three fine-tuning cycles (Fine-tune1/2/3) used top-scored Rosetta regions to retrain q(c|z) and enrich high-ranking peptides. From 4320 sequences, 162 above FlexPepDock cutoffs (I_sc < −6 REU, I_bsa > 250 Å², rmsALL_if < −0.2 Å) proceeded to MM/GBSA; top 10 from each base sequence selected for synthesis. - NEMO (IKKβ–NEMO) N-terminal extension: Parent NBD peptide TALDWSWLQTE binds weakly (IC50 > 100 µM). Due to NEMO/IKK flexibility and flat binding interfaces, emphasis was placed on broader sequence sampling with multiple fine-tuning cycles guided by Rosetta interface energies. Top 67 candidates were further inspected by MD and structural analysis; 4 selected for synthesis considering pose, polarity, diversity, and length. Experimental assays: β-catenin binders were tested by competitive fluorescence polarization (FP). NEMO binders were tested via a competitive HTRF assay using GST-NEMO and biotinylated IKKβ peptide. Selected peptides were synthesized (SPPS) and purified (RP-HPLC); identities confirmed by MS.

Key Findings

- β-catenin N-terminal extensions (Pretrain model): - Rosetta distributions improved with longer N-extensions (more favorable I_sc, larger I_bsa) with minimal pose perturbation (rmsALL_if centered around ~−0.8 Å). - VAE-MH Pretrain outperformed Rosetta Design for N-extensions (lower I_sc, higher I_bsa). - Hierarchical selection: 300 candidates screened; top 10 MM/GBSA improved by ~4–8 kcal/mol vs parent; two candidates (NAL-6, NAL-9) selected showed improved IC50 values. - Experimental (FP): Of 4 tested, 2 improved vs parent IC50 0.15 µM: NAL-9 (EGEKQ extension) IC50 = 0.084 µM; NAL-6 IC50 = 0.10 µM; NAL-10 performed poorly (IC50 > 3 µM). MD suggested beneficial interactions of negatively charged extensions with β-catenin residues R212 and release of E209, with potential to exploit nearby C213 for covalent strategies. - OBOC library of random N-extensions (including noncanonical residues) did not yield stronger binders than the parent, indicating limited potential for large improvements via N-terminal extension and highlighting the focused power of the computational pipeline (2/4 improved among designed candidates). - β-catenin C-terminal extensions with iterative fine-tuning: - Fine-tuning enriched high-ranking sequences (rank-difference increases across cycles). FlexPepDock metric distributions shifted favorably (I_sc lower, I_bsa higher, rmsALL_if lower) relative to Pretrain. - 162 top Rosetta peptides underwent MM/GBSA; truncated-base (YPEDILDKHLQRV) extensions showed broader MM/GBSA energy distributions. - Experimental (FP) for 8 peptides (Table 2): 4 improved over parent; best CAL-2 (Fine-tune3; YPEDILDKHLQRVWWFTDDHW) IC50 = 0.010 ± 0.006 µM (~15-fold better than parent 0.15 ± 0.04 µM). Others: CAL-4 0.070 ± 0.021 µM; CAL-1 0.078 ± 0.01 µM; CAL-3 0.089 ± 0.013 µM. Three full-length IL-containing extensions were modest (0.20–0.30 µM). A synthesis error control (CAL-7 human error) showed no improvement (IC50 = 1.4 µM), consistent with design predictions. - Structural insights (MD): Aromatic residues at positions 14–15 and downstream aromatic/acidic patterns were favored; CAL-2 interactions included W14 forming π–cation with R376 and stacking with W338, and W21 π–π with W338, rationalizing enhanced affinity. Across TCF/LEF1 complexes, acidic residues engaging Q302/R376 were observed and captured by the model in C-terminal proposals. - NEMO extensions: - Despite challenging, flat, flexible binding surfaces, sequence-space fine-tuning shifted Rosetta energy distributions favorably over cycles. - Of 4 synthesized peptides, 2 improved vs NBD: NBD+2 IC50 ~50 µM; NBD+12 IC50 ~75 µM; others >100 µM. Favorable polar/acidic content matched a polar binding region (K90, R87, Q86, F82); N-terminal Trp in NBD+2 stacks against F82. - Screening/Ranking performance: - Rosetta alone separated good vs poor binders but lacked fine rank resolution; MM/GBSA correlated with IC50 with r² ≈ 0.6, supporting hierarchical selection. - Overall, 6/12 designed β-catenin peptides improved; best CAL-2 exceeded the most potent training-set binder by ~3-fold (0.010 vs 0.037 µM), using only natural amino acids.

Discussion

The study addresses whether combining deep generative sequence models with physics-based docking and MD can efficiently identify target-specific peptide binders from vast sequence spaces. For β-catenin, the pipeline reliably found improved binders: modest gains via N-terminal extension and significant gains via C-terminal extension after iterative fine-tuning on target-specific Rosetta scores, confirming that transfer learning enhances generation of target-aligned sequences. The hierarchical ranking—fast Rosetta screening followed by more accurate MM/GBSA—balanced speed and fidelity, with MM/GBSA correlating better with experimental potency. Structural analyses (MD) elucidated why certain sequence motifs (aromatic residues and acidic patterns near critical β-catenin hotspots R376/W338) enhanced binding, validating that the pipeline learns and exploits biophysically plausible interactions. For NEMO, despite difficult, flat, flexible interfaces and weak, unstructured parent binding, the approach still yielded two improved peptides, underscoring generality and robustness when structural guidance is limited. Collectively, results show that integrating data-driven latent space sampling with physics-based refinement can navigate enormous sequence spaces, enrich high-affinity candidates, and generate experimentally validated improvements over parent peptides, including surpassing previous bests within the training distribution.

Conclusion

This work presents an integrated computational framework combining a GRU-based VAE with Metropolis–Hastings sampling, Rosetta FlexPepDock screening, and MD-based MM/GBSA refinement to design target-specific peptide inhibitors. Applied to β-catenin and NEMO, the pipeline generated focused candidate sets from enormous sequence spaces and delivered experimentally validated affinity improvements. For β-catenin, 6/12 designed peptides improved over the parent, with C-terminal extension CAL-2 achieving IC50 = 0.010 ± 0.006 µM (~15× improvement), outperforming the best training-set binder using only natural amino acids. For NEMO, 2/4 designs improved over a weakly binding parent. Iterative fine-tuning effectively transferred target-specific information into the generative model. Future directions include: incorporating noncanonical amino acids and stapling chemistries; improving scoring via enhanced sampling and free-energy methods; explicit modeling of conformational ensembles for flexible targets; integration with cell permeability and stability predictors; and exploring de novo peptide design beyond extensions, as well as covalent strategies suggested by β-catenin C213 proximity.

Limitations

- N-terminal extension design space appeared limited for β-catenin; both computational and OBOC library screening yielded only modest or no improvements, likely due to polar, solvent-exposed surfaces competing with binding. - Reliance on Rosetta scoring for fine-tuning introduces model bias; Rosetta separated good vs poor binders but lacked fine rank resolution. Discrepancies between Rosetta-style optimization objectives and MD/MM energy functions hindered evolutionary algorithm optimization. - MM/GBSA, while correlating with IC50 (r² ~ 0.6), is still an approximate endpoint method; more rigorous alchemical free energy methods were not used due to computational cost. - Target-specific labeled data remain scarce; iterative fine-tuning mitigates but does not eliminate data limitations. - Experimental validation sets were relatively small (β-catenin: 12; NEMO: 4), limiting statistical generalizability. Additionally, downstream properties such as cell permeability, proteolytic stability, and in-cell efficacy were not comprehensively evaluated.

Related Publications

Explore these studies to deepen your understanding of the subject.

Engineering and Technology

Generative design of stable semiconductor materials using deep learning and density functional theory

E. M. D. Siriwardane, Y. Zhao, et al.

Medicine and Health

Design and Analysis of a Deep Learning Ensemble Framework Model for the Detection of COVID-19 and Pneumonia Using Large-Scale CT Scan and X-ray Image Datasets

X. Xue, S. Chinnaperumal, et al.

Medicine and Health

Recent Advancements and Perspectives in the Diagnosis of Skin Diseases Using Machine Learning and Deep Learning: A Review

J. Zhang, F. Zhong, et al.

Engineering and Technology

Nondestructive monitoring of annealing and chemical-mechanical planarization behavior using ellipsometry and deep learning

Q. Sun, D. Yang, et al.

Listen, Learn & Level Up

Over 10,000 hours of research content in 25+ fields, available in 12+ languages.

No more digging through PDFs, just hit play and absorb the world's latest research in your language, on your time.

listen to research audio papers with researchbunny