logo
ResearchBunny Logo
ProtGPT2 is a deep unsupervised language model for protein design

Biology

ProtGPT2 is a deep unsupervised language model for protein design

N. Ferruz, S. Schmidt, et al.

Discover ProtGPT2, a groundbreaking language model developed by Noelia Ferruz, Steffen Schmidt, and Birte Höcker, that generates innovative protein sequences. These novel sequences maintain natural amino acid preferences and emerge from previously unexplored protein spaces, yielding well-folded structures of unique complexities. Rapid generation and public accessibility make this a significant advancement in protein research.... show more
Introduction

The study addresses whether large autoregressive language models can learn the “protein language” to generate de novo protein sequences that resemble natural proteins in composition, structure, and potential function. Motivated by advances in NLP and analogies between protein sequences and human language, the authors aim to train a generative model that (i) effectively captures protein sequence regularities, (ii) yields sequences likely to fold into stable, globular structures, and (iii) explores relationships to natural proteins, including sampling unexplored regions of protein space. The purpose is to enable high-throughput protein design from sequence alone, with potential impact across biomedical and environmental applications.

Literature Review

The paper situates ProtGPT2 within the trajectory of applying NLP to protein biology. Prior work includes supervised models for predicting protein properties and numerous unsupervised Transformer-based models (e.g., ESM, ProtTrans, ProteinBERT) that primarily use BERT-like masked language modeling for representation learning. Autoregressive protein language models (e.g., ProGen, RITA, DARK) demonstrate strong generative capabilities. However, generation quality depends heavily on decoding strategies, and prior work has not fully established large-scale, natural-like de novo sequence generation coupled to broad structural and evolutionary analyses. The authors leverage insights from NLP decoding research (e.g., top-k sampling, repetition penalties) to improve generation quality for proteins.

Methodology

Model and data: An autoregressive, decoder-only Transformer (GPT2-large architecture) with 36 layers, model dimension 1280, and 738M parameters was trained on UniRef50 (version 2021_04), using 44.88 million sequences for training and 4.99 million for validation. Sequences were tokenized with a BPE tokenizer trained on Swiss-Prot (2021_04), yielding a 50,256-token vocabulary (avg. ~4 amino acids per token). Training minimized negative log-likelihood under a causal language modeling objective using Adam (β1=0.9, β2=0.999), learning rate 1e-3, global batch 65,536 tokens (128 GPUs × 512 tokens per device; batch size 8 per device), trained for 4 days on 128 NVIDIA A100 GPUs with DeepSpeed parallelism.

Inference and sampling: The authors systematically evaluated decoding strategies (greedy, beam search, random sampling) across parameters: repetition penalty 1.1–3.0, top-k 250–1000 (step 50), and top-p 0.7–1.0 (step 0.05). They selected top-k=950, repetition penalty=1.2, default temperature and top-p=1.0 based on matching amino acid propensities to natural proteins.

Datasets: Three 10,000-sequence datasets were assembled: (1) ProtGPT2-generated sequences (sampled with chosen parameters; from an initial 100,000 generations, filtered to remove truncated sequences, then randomly downsampled to 10,000; average length 149.2±50.9 aa), (2) Natural sequences (randomly sampled from UniRef50 and length-matched to ProtGPT2 set), and (3) Random sequences (concatenations of 25 amino acid/IUPAC characters with lengths drawn from a normal distribution between 5 and 267 aa).

Analyses:

  • Amino acid propensities and decoding effects were compared across sampling strategies to approach natural distributions.
  • Disorder and secondary structure: IUPred3 (short/long/glob) assessed disorder and globularity; PSIPRED v4.0 predicted secondary structure content using HHblits-derived MSAs.
  • Homology detection: HHblits searches against Uniclust30 (2018_08) produced identity vs. alignment length plots; the HSSP curve was used to benchmark potential homology.
  • Structure prediction: AlphaFold (ColabFold batch v1.2) produced five models per ProtGPT2 sequence; per-residue confidence (pLDDT) summarized order; natural and random datasets were similarly assessed for comparison.
  • Energetics: Rosetta RelaxBB (Rosetta v3.12, REF2015) performed Monte Carlo optimization starting from AlphaFold structures; total energy per residue (REU/res) compared across datasets.
  • Dynamics: HTMD/ACEMD used Amber ff19SB for explicit-solvent MD. Twelve sequences per dataset were simulated (three replicas each) for 100 ns, totaling 108 trajectories (10.8 μs). RMSD distributions were compared; PLDDT distributions were matched to control for disorder effects.
  • Protein space networks: HMM profiles for SCOPe 2.07 (95% filter) and ProtGPT2 sequences were compared all-against-all with HHsearch. Networks were built linking sequences with ≥20 aa alignments and ≥70% HHsearch probability using Protlego; topology similarity assessed with FoldSeek; examples with AlphaFold-predicted structures were highlighted.
Key Findings
  • Sampling and propensities: Random sampling with top-k=950 and repetition penalty=1.2 yielded amino acid propensities matching natural proteins; greedy/beam search produced repetitive/degenerate outputs.
  • Disorder and secondary structure: IUPred3 globular domain detection: ProtGPT2 87.59% vs natural 88.40%. Ordered residue content: ProtGPT2 82.59% vs natural 79.71%. PSIPRED secondary structure (natural vs ProtGPT2): alpha-helix 45.19% vs 48.64%; beta-sheet 41.87% vs 39.70%; coil 12.93% vs 11.66% (n=10,000 per dataset).
  • Homology and relatedness: HHblits against Uniclust30 showed 96.2% of natural and 93.0% of ProtGPT2 sequences had best alignments above the HSSP curve; 93% of random sequences fell below. ProtGPT2 distributions showed fewer high-identity long alignments than natural sequences, indicating sequences are related yet distant and not products of memorization.
  • Structure prediction (AlphaFold): ProtGPT2 mean pLDDT 63.2 (best of 5 per sequence) and 59.6 (average over 5); 37% with pLDDT>70. Natural: 66% with pLDDT>70; mean 75.3. Random: mean 44; 7.4%>70. These indicate many ProtGPT2 sequences are predicted ordered/globular.
  • Energetics (Rosetta Relax): Average total energy per residue (REU/res): ProtGPT2 1.73, natural 1.90, random 0.13. ProtGPT2 and natural distributions align with expected ranges for relaxed structures under this scoring convention.
  • Dynamics (MD): Average RMSD per trajectory: natural 2.93 Å, ProtGPT2 3.12 Å, random 9.41 Å; ProtGPT2 vs natural not significantly different (Mann-Whitney U-test p=0.39), suggesting comparable flexibility.
  • Protein space networks: Network of 59,612 nodes and 427,378 edges (1,847 components) showed a major component with 30,690 nodes; inclusion of ProtGPT2 increased connectivity and bridged previously separate islands, indicating exploration of dark regions.
  • Novel and complex folds: ProtGPT2 generated diverse folds (all-β, α/β, α+β, membrane, all-α), including a topology (example 4266) not matched in PDB (DALI Z=5.4, RMSD 3.0 Å over 67 aa, 9% identity to 5B48). Generated structures are non-idealized, with longer loops and complex surfaces akin to natural proteins.
  • Functional hotspot preservation: In two cases, ProtGPT2 sequences preserved multiple ligand-binding side-chain positions relative to distantly related natural proteins (e.g., FAD-binding site in 1X0P; C2E-binding site in 5MIT) despite low sequence identity (~29–31%), indicating retention of functional determinants in zero-shot generation.
  • Practicality: ProtGPT2 can generate sequences in seconds on standard workstations; model and datasets are publicly available for generation or finetuning.
Discussion

The findings demonstrate that an autoregressive Transformer trained solely on unannotated protein sequences can learn sequence regularities that yield de novo proteins with natural-like amino acid usage, disorder/secondary structure content, and predicted structural order. ProtGPT2-generated sequences are broadly related to natural proteins yet maintain evolutionary distance, mitigating concerns about simple memorization. Structural predictions, energetics, and MD analyses suggest many generated sequences are well-folded, stable, and exhibit dynamics comparable to natural proteins. Network analyses show that ProtGPT2 expands connectivity in protein space, bridging islands and sampling dark regions, and can produce complex, non-idealized folds, including membrane proteins and challenging all-β architectures. Preservation of ligand-binding hotspots indicates that functional features can emerge without explicit supervision, supporting the model’s utility for augmenting existing folds and potentially enabling novel functions. These results address the core research goals by establishing that large autoregressive language models can generate fit, structurally ordered, and functionally suggestive protein sequences while exploring regions underrepresented in current databases, advancing high-throughput protein design.

Conclusion

ProtGPT2 is a 738M-parameter autoregressive language model that effectively learns the protein language from UniRef50 and generates de novo protein sequences with natural-like properties. Computational analyses indicate high rates of predicted globularity, plausible folding (AlphaFold pLDDT), reasonable energetics (Rosetta Relax), natural-like dynamics (MD), and exploration of previously under-sampled regions of protein space with complex, non-idealized folds. Evidence of preserved ligand-binding hotspots suggests potential functional relevance. The model enables rapid sequence generation and is available for direct use or finetuning to condition on specific families, folds, or functions. Future work includes adding conditional tags for controlled generation, large-scale experimental validation of foldability and function, and targeted finetuning to engineer specific activities and novel topologies.

Limitations

The study relies on computational proxies: predicted structures (AlphaFold pLDDT), energy functions (Rosetta REF2015), and short-timescale MD for a limited subset; experimental validation of folding, stability, and function was not performed. Homology assessments depend on current sequence databases and may overlook relationships in uncharted regions. Decoding choices influence sequence quality, and while amino acid propensities match natural distributions, broader biochemical property distributions were not exhaustively analyzed. The MD set was small (12 sequences per dataset), and network-based novelty is contingent on existing structural annotations. Model training used UniRef50 clustering, which may bias generalization versus UR100-scale diversity.

Listen, Learn & Level Up
Over 10,000 hours of research content in 25+ fields, available in 12+ languages.
No more digging through PDFs, just hit play and absorb the world's latest research in your language, on your time.
listen to research audio papers with researchbunny