Medicine and Health
Deep-learning-enabled antibiotic discovery through molecular de-extinction
F. Wan, M. D. T. Torres, et al.
In an exciting breakthrough, Fangping Wan and colleagues have harnessed deep learning to delve into the proteomes of extinct organisms, uncovering 37,176 potential antibiotic peptides, with 11,035 unique to extinct species! Their synthetic peptides showed remarkable antimicrobial properties against bacterial pathogens, paving the way for innovative anti-infective therapies.
~3 min • Beginner • English
Introduction
Antimicrobial resistance causes approximately 1.27 million deaths annually and could reach 10 million by 2050 without effective new drugs, imposing severe economic and societal burdens. Leveraging evolutionary molecular diversity, the authors previously introduced the concept of molecular de-extinction to resurrect extinct biomolecules as therapeutics. Advances in computational methods and machine learning have enabled the discovery and design of antibiotics, including identification of encrypted peptides (EPs) within proteins. Building on these efforts, this study introduces Antibiotic Peptide de-Extinction (APEX), a multitask deep learning framework to systematically mine proteomes of extinct organisms (the extinctome) to discover antimicrobial peptide antibiotics.
Literature Review
Prior computational and AI approaches for antibiotic discovery include ML-based generation of antibiotics and prediction of antimicrobial activity, haemolysis, and resistance. Proteome mining has previously identified encrypted antimicrobial peptides in modern human proteins, aided by simple composition-based scoring functions emphasizing charge and hydrophobicity. However, such scoring functions may bias toward conventional AMP features. Deep learning approaches (e.g., Stokes et al., 2020; Ma et al., 2022) and comprehensive AMP predictors have advanced the field. The authors’ prior work mined the human proteome for EPs and coined molecular de-extinction, yielding preclinical candidates (e.g., neanderthalin-1). This work extends those ideas to extinct organism proteomes, aiming to access unexplored sequence space with potentially distinct mechanisms and properties.
Methodology
Data collection and preprocessing: The extinctome was assembled from the NCBI taxonomy browser (accessed December 2021), yielding 12,860 protein sequences from 208 extinct species (5,190 non-redundant). EPs were defined as substrings of length 8–50 aa from these proteins, producing 10,311,899 EP sequences. The modern human proteome consisted of 20,388 reviewed proteins (20,307 unique) from UniProt. In-house training data comprised 14,738 antimicrobial activity measurements (MICs) across 988 peptides against 34 bacterial strains; labels were -log10(MIC), with MIC>128 µM set to 140 µM before transformation. Public AMPs were sourced from DBAASP; a balanced binary set (5,093 AMPs, 5,500 non-AMPs) was created using MIC ≤ 30 µM as AMP criterion, removing overlaps with in-house data.
Model architecture (APEX): APEX uses a peptide encoder combining a GRU-based recurrent neural network with layer normalization and a two-layer attention mechanism to extract sequence features (AAindex 566-D per residue; sequences padded to length 52 with start/stop tokens). The encoder output feeds two downstream FCNNs: (1) a multitask regressor predicting species-specific antimicrobial activity across 34 strains (mean squared error loss) and (2) a binary classifier predicting AMP vs non-AMP for public data (binary cross-entropy). A multitask constraint regularizes the output layer weights using bacterial phylogenetic similarity (from GTDB), encouraging similar predictors for phylogenetically related species. The total loss combines regression, classification, multitask constraint, and L2 regularization. Training used Adam (lr=1e-4 with scheduled decay), batch size 128, mini-batch sampling from in-house and public data, and 5,000 epochs over the in-house set.
Model selection and ensembling: The in-house dataset was split into CV (790 peptides) and independent (198 peptides) sets. Fivefold CV optimized hyperparameters; baselines (elastic net, linear SVR, extra-trees, random forest, GBDT) used k-mer and modLAMP-derived features. Top eight APEX configurations (by CV R²) were ensembled (ensemble APEX v1). To improve robustness, each of the eight was trained with five random seeds, yielding 40 models (ensemble APEX v2), whose predictions were averaged.
Mining and selection: APEX (ensemble v2) predicted activity for 10,311,899 EPs, identifying 37,176 predicted actives (median MIC ≤ 80 µM). Sequence-space comparison used Smith-Waterman similarity and UMAP to contrast with DBAASP peptides. EPs were classified as archaic (AEPs: not present in an extant protein set from UniProt Swiss-Prot, TrEMBL, and isoforms) or modern (MEPs: also present in extant organisms). For experimental validation, peptides were prioritized via ranked lists for broad-spectrum potency (median MIC across 11 pathogens), Gram-negative selectivity, or Gram-positive selectivity. Filters included: length 8–30 aa; exclusion of sequences present in the modern human proteome or in-house dataset; exclusion of candidates with >75% similarity to DBAASP peptides or to each other (keep top-ranked); activity thresholds (median MIC <80 µM for broad-spectrum) and selectivity thresholds (selectivity score <0.75). This yielded 3,784 candidates; 69 peptides (20 AEPs, 49 MEPs; 21 from secreted and 48 from non-secreted proteins) were synthesized for testing.
Wet-lab assays: In vitro MICs (broth microdilution) were measured against 11 clinically relevant pathogens (7 Gram negatives, 4 Gram positives, including ESKAPEE strains). Mechanistic assays included cytoplasmic membrane depolarization (DiSC3-5 fluorescence) and outer membrane permeabilization (NPN uptake). Secondary structure was predicted with S4PRED and experimentally assessed by circular dichroism in 60% TFE. Synergy was measured by checkerboard assays with FICI calculations against A. baumannii and P. aeruginosa. Cytotoxicity was evaluated on HEK293T cells (MTT, CC50). Serum stability was measured in 25% human serum by UPLC-MS/MS. In vivo efficacy was tested in mouse skin abscess and neutropenic thigh infection models with single-dose treatments at each peptide’s MIC, monitoring bacterial load reductions and animal weight over 2–4 days.
Key Findings
Model performance: On the independent test set, ensemble APEX v2 achieved average R²=0.546, Pearson r=0.728, Spearman ρ=0.607 across species-specific MIC predictions, outperforming all baselines. Against a set of 69 newly synthesized EPs, APEX provided the highest mean species-wise Pearson correlation of log MIC (Fig. 2b). Compared to a prior charge/hydrophobicity-based scoring function, APEX better predicted activities of 56 validated human peptides and outperformed on the 69 EPs.
Extinctome mining: APEX evaluated 10,311,899 EPs from 208 extinct species (5,190 non-redundant proteins) and predicted 37,176 actives (median MIC ≤ 80 µM), including 11,035 archaic EPs absent from extant proteomes, occupying novel sequence-space clusters relative to DBAASP peptides.
Experimental validation: Of 69 synthesized peptides (20 AEPs, 49 MEPs), 41 showed antimicrobial activity (MIC ≤ 128 µM) against at least one pathogen (59% hit rate), versus 24% for the prior scoring function on comparable sources. APEX’s predicted vs measured log MICs across pathogens yielded Pearson r=0.448 and Spearman ρ=0.404. Peptides included active leads such as mammuthusin-2 (woolly mammoth), elephasin-2 (straight-tusked elephant), hydrodamin-1 (ancient sea cow), mylodonin-2 (giant sloth), and megalocerin-1 (giant elk).
Composition and properties: AEPs differ from conventional AMPs and modern EPs, with higher content of uncharged polar residues (e.g., Asn, Gln, Ser, Thr) and aliphatic Ile/Leu, lower amphiphilicity, slightly lower hydrophobicity, and similar net positive charge. AEPs showed lower aggregation propensity and smaller predicted hydrophobic face angles. Structural analysis indicated APEX-selected peptides tended to adopt α-helical conformations in helix-inducing conditions despite their unusual compositions.
Mechanism of action: Most AEPs/MEPs depolarized the cytoplasmic membrane strongly, often more than polymyxin B (e.g., anomalopterin-1, mylodonin-4, equusin-2, hesperelin-3, hydrodamin-2), but generally did not robustly permeabilize the Gram-negative outer membrane at MIC, contrasting with many known AMPs and previously identified human EPs.
Synergy: Checkerboard assays revealed frequent synergistic or additive interactions (FICI as low as 0.38), with some combinations reducing MICs by up to 64-fold (e.g., equusin-1 + equusin-3 to 62.5 nM). Examples: hesperelin-1 + hesperelin-3; mammuthusin-1 + mammuthusin-3; equusin-2 + equusin-4; hydrodamin-1 + hydrodamin-3.
Safety and stability: 39/41 active peptides showed no notable cytotoxicity to HEK293T cells up to 128 µM; lophisin-1 and xenothrixin-1 had CC50 ≈ 68–71 µM but MICs were 8–8.84-fold lower than CC50. In 25% human serum, elephasin-2 and mammuthusin-2 retained ~40% after 6 h; others degraded faster.
In vivo efficacy: In mouse skin abscess and neutropenic thigh infection models (A. baumannii), single-dose treatments at MIC reduced bacterial loads by 2–5 log10 over 2–4 days. Notably, mylodonin-2 matched polymyxin B at day 2 in the thigh model; by day 4, elephasin-2 and mammuthusin-2 achieved reductions similar to polymyxin B. No adverse weight changes or overt toxicity were observed.
Discussion
This study demonstrates that deep learning-driven molecular de-extinction can uncover antibiotic peptides from the proteomes of extinct organisms, addressing the urgent need for new antimicrobials. APEX outperformed traditional ML baselines and composition-based scoring, enabling exploration of novel sequence space and identification of potent, selective, and synergistic peptides with low cytotoxicity. The distinct amino acid compositions and physicochemical properties of AEPs suggest alternative interaction modes with bacterial membranes, particularly enhanced cytoplasmic membrane depolarization rather than strong outer membrane permeabilization typical of many AMPs. The combination of computational discovery with wet-lab validation—spanning in vitro potency, mechanism assays, serum stability, and in vivo efficacy—supports the translational potential of these molecules. These findings validate the hypothesis that extinct proteomes harbor encrypted antimicrobial sequences and that AI can effectively resurrect them for modern therapeutic challenges.
Conclusion
The authors present APEX, a multitask deep learning framework that mines extinct organism proteomes to discover antibiotic peptides. By screening over 10 million EPs, APEX identified 37,176 predicted actives, many absent from extant proteomes, and delivered a 59% experimental hit rate among 69 synthesized candidates. Several peptides demonstrated strong in vitro activity, membrane-depolarizing mechanisms, synergy, favorable safety, partial serum stability, and in vivo efficacy comparable to polymyxin B. This work establishes molecular de-extinction as a viable strategy for antibiotic discovery and expands accessible peptide sequence space. Future work should incorporate structural/3D descriptors to enhance prediction accuracy, expand and diversify the in-house activity dataset, and leverage newly available extinct and extant proteomes to refine AEP classification and broaden discovery.
Limitations
APEX is sequence-based and omits explicit structural/3D information, potentially limiting accuracy. The in-house training dataset size constrains model generalizability; expanding it could improve performance. Extinct proteomes are sparsely annotated and limited in number and quality, restricting candidate identification and functional inference. The AEP vs MEP classification depends on currently available extant proteome databases, which may change as new data emerge. Many peptides showed limited serum stability, and broader pharmacokinetic and toxicity evaluations are needed to assess clinical translatability.
Related Publications
Explore these studies to deepen your understanding of the subject.

