Medicine and Health
Sequence-based drug design as a concept in computational drug design
L. Chen, Z. Fan, et al.
The study addresses limitations of the conventional structure-based drug design pipeline, which relies on accurate 3D protein structures, correct pocket identification (orthosteric/allosteric), and robust virtual screening or de novo design steps. Many targets lack high-resolution structures; even with modern predictors (AlphaFold, RoseTTAFold), active-site precision and pocket definition can be unreliable, especially for flexible or multi-domain proteins. Errors can propagate across pipeline steps, yielding false positives and reduced success rates. The authors propose a sequence-to-drug paradigm: learning compound–protein interactions directly from protein sequences via end-to-end differentiable deep learning, thus avoiding intermediate structural steps and potentially reducing error accumulation. The research aims to test whether a sequence-only model can generalize across protein and chemical spaces, capture binding-relevant information, and yield experimentally validated hits/targets, including for proteins lacking 3D structures.
End-to-end differentiable learning has transformed fields such as computer vision and speech recognition, and underpins successes like AlphaFold in protein structure prediction. Prior CPI prediction models using protein sequences include deep neural networks and attention-based approaches (e.g., TransformerCPI, GraphDTA, MolTrans, DeepDTA, MONN, HyperAttentionDTI), and embeddings from protein language models (e.g., TAPE-BERT, UniRep). However, the sequence-to-drug paradigm had not been comprehensively validated across generalization tests and experimental hit discovery. The paper also situates challenges in SBDD: incomplete structural coverage, uncertainty in predicted structures’ local active-site accuracy, dynamic/allosteric pockets, and docking limitations (scoring and false positives). Recent growth of ultra-large make-on-demand libraries motivates computationally efficient virtual screening methods that can explore vast chemical spaces without structure dependence.
Model: TransformerCPI2.0, an end-to-end sequence-based CPI predictor. Key updates over prior TransformerCPI: (1) removed 3-gram Word2Vec embeddings; (2) used pretrained protein language model TAPE-BERT (12-layer transformer) to compute 768-d protein embeddings; (3) replaced CNN/GLU with a self-attention transformer encoder (3 layers, 8 heads, hidden 768, FFN 3072) without positional embeddings; (4) introduced a virtual atom to capture molecular-level context within a GCN-based atom embedding (one GCN layer recommended to avoid over-smoothing). Protein embeddings serve as target sequence and atom embeddings as memory to a transformer decoder (3 layers, 8 heads). Training employed AdamW/variants (per Methods), with datasets curated to mitigate ligand bias. Dataset curation: From ChEMBL23, constructed training/validation/test sets ensuring each ligand appears with both positive and negative labels paired to different proteins. Employed label reversal splits where ligands appear in opposite classes between train and test to force models to use protein information and overcome ligand bias. Baselines retrained on same data: CPI-GNN, GraphDTA (GAT-GCN), MolTrans, and GCN. Evaluation: ROC-AUC and PRC on ChEMBL test set; generalization tested on (i) a large external set with all-new ligands and targets (342,477 pairs; pos:neg 1:3), and (ii) a time-split ChEMBL27 set (92,919 pairs; pos:neg 1:1) representing future data. Virtual screening comparison: Benchmarked enrichment factors (EF0.5%, EF1%, EF5%) on DUD-E and DEKOIS2.0 versus GOLD and AutoDock Vina. Model interpretation tools: (1) Drug resistance mutation analysis: for a given protein–ligand pair, mutate each residue to all 20 amino acids, compute score changes ΔS, average per position, and normalize to relative activity change ΔR in [0,1] to prioritize positions important for binding; visualized heatmaps and profiles. (2) Trifluoromethyl substitution effect analysis: evaluated whether predicted activity change sign (Δsc = scoreCF3 − scoreCH3) matches experimental ΔpAct for 18,217 Me→CF3 pairs (after de-duplication), with analyses on the subset showing ≥1 order (15.73% of pairs) and ≥3 orders of magnitude changes (188 cases). Prospective applications: - SPOP (E3 ligase adaptor) hit discovery: Virtual screening workflow on ~1.6M ChemDiv in-stock library. Steps: score with TransformerCPI2.0; select top ~2% (35,000) diverse; filter PAINS; cluster by ECFP (~800 clusters); apply Lipinski; select 82 candidates for assays. Assays: fluorescence polarization (FP) for competition with substrate peptide; NMR (CPMG, STD); GST pull-down; PTS; cell permeability; cellular Co-IP and ubiquitination assays; cell proliferation; PK, acute toxicity, and in vivo xenograft efficacy. Hit expansion obtained 26 analogs of 221C7, 19 active. - RNF130 (E3 ligase, no 3D structure) hit discovery: Similar pipeline on ~2M Chemspace in-stock library, top 10,000 (~0.5%) diverse, PAINS filtering, clustering (~200 clusters), Lipinski, 87 candidates purchased. Binding validated by SPR (KD), CETSA, and ITC (KD, thermodynamics, stoichiometry). - Drug repurposing (inverse workflow): Proteome-wide (2,204 DrugBank human proteins) scoring against four PPIs (rabeprazole, lansoprazole, omeprazole, pantoprazole). Top predictions inspected for novelty/feasibility; tested ARF1. Experiments: PTS to assess binding/destabilization to ARF1WT; mutation analysis highlighted region 150–165 with C159. PTS with DTT and ARF1C159A to test covalency; intact MS and MS/MS to locate covalent site (C159). Docking pose hypothesis. Functional assay: ARNO-catalyzed GDP/MANT-GTP exchange inhibition by rabeprazole. Cellular ARF1 activity (G-LISA) and lipid droplet accumulation (Nile Red) in CT26 cells. In vivo efficacy in BALB/c CT26 syngeneic model (40 mg/kg ip, 10 days), immune profiling by FACS and IHC. ARF1 knockdown controls to demonstrate on-target dependence. All assays followed detailed protocols in Methods (protein expression/purification, FP, PTS/CETSA, NMR, MS, SPR, ITC, Co-IP, ubiquitination, proliferation, PK/tox, xenografts, FACS, staining, synthesis of 230D7/222A5).
- Predictive performance and generalization: TransformerCPI2.0 achieved the best ROC-AUC and PRC among baselines on the ChEMBL test set, including superior performance in the low false-positive region. It generalized best to a large external set (342,477 pairs; all-new ligands and 1,192 new targets; pos:neg 1:3) and a time-split ChEMBL27 set (92,919 pairs; all-new ligands and 637 new targets; pos:neg 1:1), indicating transfer to unseen proteins/compounds and future data. - Virtual screening vs structure-based docking: On DUD-E and DEKOIS2.0 benchmarks, TransformerCPI2.0 yielded enrichment factors comparable to docking. It performed slightly below GOLD but similar to or better than AutoDock Vina (and better than both on another benchmark table), despite using no protein structural information. - Binding-site interpretation: Drug resistance mutation analysis on HIV-1 reverse transcriptase with doravirine (PDB: 4NCG) showed high ΔR overlap with the known binding pocket and correctly highlighted known resistance sites (e.g., P225, F227, L234, P236). Negative control (aspirin) produced a distinct ΔR pattern. - Substitution effect analysis (Me→CF3): Only 15.73% of substitutions produced ≥1 order of magnitude activity changes. TransformerCPI2.0 achieved higher accuracy than baselines (sign agreement between Δsc and ΔpAct) on both the full changing subset and the stringent ≥3 orders subset (188 cases), with statistically significant improvements (reported P-values ≤ 0.001 in most comparisons). - SPOP inhibitors: From 82 tested candidates, four hits identified by FP (hit rate ~5%); 221C7 most active (FP IC50 = 4.51 µM). Direct binding validated by NMR; GST pull-down showed disruption of SPOP MATH–PTEN interaction. 222A5 served as negative control (FP IC50 > 100 µM, no disruption). Hit expansion: 26 analogs purchased; 19 active. 230D7 exhibited improved cell permeability and FP IC50 = 8.70 µM, with direct binding (PTS, NMR), disruption of SPOP–PTEN/DUSP7 interactions (Co-IP), reduced ubiquitination of PTEN/DUSP7, increased PTEN/DUSP7 protein levels, and decreased p-AKT/p-ERK in 786-O cells. Selective antiproliferative effect in ccRCC lines (~20 µM). In vivo, 230D7 reduced 786-O xenograft growth in NSG mice dose-dependently at 25 and 50 mg/kg with no body weight loss; tumor tissues showed increased PTEN/DUSP7 and decreased p-AKT/p-ERK. - RNF130 binder: iRNF130-63 discovered for a target lacking structural data. Binding confirmed by SPR (KD ~9.36 µM, fast-on/off), CETSA stabilization, and ITC (KD 1.23 µM; ΔG −33.8 kJ/mol; ΔH −7.31 kJ/mol; N ≈ 1.0). - PPI repurposing to ARF1: TransformerCPI2.0 ranked ARF1 among top predicted targets for PPIs. PPIs bound ARF1 and destabilized it (PTS). Mutation analysis implicated region 150–165; experiments showed covalent binding at C159: PTS effects abolished by DTT or C159A mutation; MS/MS identified C159 adduct. Rabeprazole most potent in PTS, inhibited ARNO-mediated GDP→GTP exchange in vitro. In CT26 cells, rabeprazole reduced ARF1 activity (G-LISA) and induced lipid droplet accumulation. In vivo (BALB/c CT26), rabeprazole (40 mg/kg) significantly inhibited tumor growth (P=0.0001), increased CD3+CD8+ T cells and decreased exhausted T-cell subsets (FACS), and modulated IHC markers (↑CD8, ↓PD1). ARF1 knockdown abrogated rabeprazole effects on lipid droplets, tumor growth, and immune modulation, confirming ARF1 dependence. - Novelty and non-similarity: Hits/targets were unseen in training; found compounds showed low similarity to known actives and were not retrieved by other tools, supporting model generalization beyond memorization.
The results demonstrate that a sequence-only, end-to-end differentiable model can approach structure-based screening performance while avoiding reliance on 3D structural information and manual multi-step pipelines, thereby reducing potential error accumulation. TransformerCPI2.0 learned binding-relevant signals: it identified mutation hotspots overlapping orthosteric pockets and captured SAR trends in challenging substitution-effect cases. Prospectively, it found active scaffolds against difficult E3 ubiquitination pathway targets: new SPOP inhibitors (with cellular and in vivo efficacy) and the first reported RNF130 binder without structural templates, underscoring utility for targets lacking high-quality structures. The inverse workflow enabled proteome-wide target identification, repurposing PPIs to ARF1 by predicting and experimentally validating covalent engagement at C159, accompanied by functional inhibition and antitumor immune responses consistent with the ARF1-CSC lipid metabolism axis. These findings suggest the sequence-to-drug concept can complement or substitute SBDD, particularly early in campaigns or for dynamic/structure-poor targets, and can be scaled with ultra-large, synthesizable libraries to explore novel chemotypes. The SPOP and RNF130 chemotypes may also serve as warheads for PROTAC design, expanding therapeutic modalities.
This proof-of-concept establishes the sequence-to-drug paradigm using TransformerCPI2.0 to directly discover modulators from protein sequences. The model generalizes across proteins and chemotypes, interprets binding-relevant features, and enables both virtual screening (SPOP, RNF130) and inverse target identification/repurposing (PPIs→ARF1). Experimentally validated hits showed direct binding, mechanistic effects, cellular activity, and in vivo efficacy. The approach offers an alternative or complement to SBDD, especially for targets with limited or uncertain structural data, and is well-suited for rapid screening of ultra-large libraries. Future work may (i) extend interpretability and systematic benchmarking for pocket localization and activity cliffs, (ii) integrate structural cues when available (hybrid models), (iii) couple with generative design, and (iv) expand to proteome-scale target deconvolution and deorphanization.
- Generalization is not guaranteed for every novel target; performance depends on training data coverage and quality. - Interpretation tools (mutation analysis, substitution effect analysis) are qualitative aids; systematic, quantitative validation of predicted binding sites and activity cliffs is beyond the current scope. - Lack of explicit 3D structural constraints may limit pose interpretability or specificity for certain allosteric/dynamic sites. - β-Lactam-containing hits (e.g., 221C7/230D7) may carry risks (potential covalent reactivity, immunogenicity, antibiotic resistance concerns), necessitating careful medicinal chemistry optimization. - Docking comparisons show that while competitive with some methods, performance can be inferior to top commercial docking in some benchmarks. - Covalent predictions (e.g., PPIs to ARF1) require targeted validation; redox environment (DTT sensitivity) and mutation data are necessary to confirm mechanisms.
Related Publications
Explore these studies to deepen your understanding of the subject.

