Biology
Protein remote homology detection and structural alignment using deep learning
T. Hamamsy, J. T. Morton, et al.
Discover groundbreaking advancements in protein remote homology detection with TM-Vec and DeepBLAST, two innovative deep learning methods developed by Tymor Hamamsy and colleagues. These methods not only enhance the identification of remote homology but also outperform traditional structural alignment techniques using just sequence information.
~3 min • Beginner • English
Introduction
The study addresses the challenge of detecting remote protein homology and aligning proteins with low sequence similarity, where traditional sequence-based methods (effective above ~25% sequence identity) fail. Structural homology often persists across long evolutionary timescales and can enable improved functional annotation, design, and evolutionary insights. However, most proteins lack experimentally determined structures, and while predictors like AlphaFold2 have advanced coverage, they struggle with short sequences and are computationally intensive at repository scale. Existing structural alignment tools (TM-align, Dali, FAST, Mammoth) are accurate when structures are available but are too computationally costly for large-scale searches and require structures. The authors propose a hybrid, scalable, structure-aware sequence-based framework: TM-Vec for rapid retrieval of structurally similar proteins directly from sequences by predicting TM-scores, and DeepBLAST for producing structure-aware alignments from sequences via a differentiable dynamic programming approach. The importance lies in enabling structure-informed annotation and homology detection at repository scale, closing the sequence-structure-function gap in massive sequence collections.
Literature Review
Prior approaches include classical sequence homology and alignment tools such as BLAST, Needleman–Wunsch, Smith–Waterman, HMMER, DIAMOND, HHblits, and MMseqs2, which perform well for closely related sequences but lose sensitivity below ~25% identity. Structure-based alignment tools (TM-align, Dali, FAST, Mammoth) provide robust structural similarity measures when structures are available. Recent advances include structure search tools like FoldSeek and embedding-based methods (ProtTucker/EAT) that learn structure-aware embeddings from domains, as well as large protein language models (ProtTrans, ESM) that capture sequence-structure-function signals. Despite these, there remained a need for tools that: (1) explicitly predict structural similarity directly from sequence for scalable search over large sequence databases; and (2) generate structural alignments using sequence alone, leveraging differentiable dynamic programming and language-model embeddings.
Methodology
TM-Vec: structure-aware retrieval via TM-score prediction from sequences.
- Architecture: Twin (Siamese) neural network operating on residue embeddings from a frozen pretrained protein language model (ProtTrans ProtT5-XL-UniRef50). For each sequence, per-residue embeddings (dim=1024) are passed through φ: transformer encoder layers, average pooling, dropout, and fully connected layers to produce a 512-d vector representation (z). Cosine similarity between the two sequence vectors approximates the structural TM-score.
- Training objective: Minimize L1 distance between predicted cosine similarity and ground-truth TM-score from TM-align for training pairs.
- Data for training: Two protein-chain-pair datasets from SWISS-MODEL (up to 300 aa, 277k unique chains; up to 1000 aa), generating ~141M training/validation pairs (300 aa model) and 320M pairs (1000 aa model); held-out test: 1M pairs (300 aa model). CATH NR-S40 domain pairs (~23M pairs) constructed with undersampling of different folds, with held-out sets for left-out pairs (100k), left-out domains (100k), and left-out folds (500k). TM-align provides ground-truth TM-scores.
- Indexing and search: Encode large sequence repositories (e.g., Swiss-Prot, CATH, UniRef50) into 512-d embeddings; build a Faiss cosine-similarity index. Query: encode the query via TM-Vec and perform approximate nearest-neighbor search. Complexity: sublinear retrieval (O(log n)) for n proteins; practical sublinear performance observed.
- Training details: Models with 2 or 4 transformer encoder layers (17.3M or 34.1M parameters; sizes 199MB/391MB). ProtTrans frozen. Adam optimizer, LR 1e-5, batch size 32. Example: SWISS-MODEL up-to-300 aa model trained for 5 epochs on 8×V100 GPUs over 5 days.
DeepBLAST: structure-aware alignment from sequences via differentiable Needleman–Wunsch.
- Inputs: Two sequences X and Y; obtain residue embeddings HX, HY from ProtTrans (frozen). Learn match scoring μ and gap scoring g via neural mappings M and G (eight CNN layers, dim=1024) applied to residue embeddings.
- Differentiable dynamic programming: Replace non-differentiable max/argmax with smooth log-sum-exp (softmax) operators to enable backprop through alignment scores and traceback. Implement GPU-accelerated differentiable Needleman–Wunsch enabling batching and parallelism. Loss function: cross-entropy between predicted traceback (probabilistic alignment matrix) and ground-truth alignment traceback.
- Training data: ~5M structure-derived alignments from TM-align on curated ~40k PDB structures; exclude alignments with >10 consecutive gaps or TM-score<0.6. Training: 20 epochs on 24×A100 GPUs over 6 days; Adam with LR 5e-5; batch size 360; >1.2B parameters in the CNN components; ProtTrans frozen.
- Alignment accuracy assessment: Held-out test set of 1M structural alignments; median precision and recall ~87% for correctly aligned residue pairs.
Benchmarks and datasets:
- SWISS-MODEL and CATH (CATHS40, CATHS100; NR-S40) for TM-score prediction and representation quality; visualization via t-SNE; classification via adjusted mutual information and triplet-scoring AUPR. Retrieval comparisons vs FoldSeek, MMseqs2, HHblits, DIAMOND, ProtTucker/EAT.
- Malidup and Malisam curated structural alignment benchmarks for alignment assessment (F1 scores) vs sequence-based (BLAST, HMMER, Needleman–Wunsch, Smith–Waterman) and structure-based (FAST, TM-align, Dali, Mammoth-local) methods.
- Microbiome Immunity Project (MIP) 200k predicted structures (148 putative folds) for generalization to novel folds.
- DIAMOND benchmark: UniRef50 lookup (~7.74M representatives with SCOP family annotations), 1.71M query proteins, assessing family-level sensitivity among top-k nearest neighbors.
- BAGEL bacteriocin dataset: class and subclass clustering, comparison vs structure-prediction-plus-TM-align pipelines (AlphaFold2/ColabFold, OmegaFold, ESMFold).
Key Findings
- TM-score prediction accuracy (TM-Vec):
• SWISS-MODEL held-out 1.01M pairs: low error ≈0.025 across sequence identity; median error 0.005 for >90% identity; can resolve similarity below 10% identity (median error 0.026). Correlation with TM-align TM-scores r=0.97 (P<1×10^-5).
• CATH held-out: pairs r=0.936, median error 0.023; domains r=0.901, median error 0.023; held-out folds r=0.781, median error 0.042 (all P<1×10^-5). Highest errors in TM-score range 0.75–1.0, reduced accuracy on unseen folds but robust generalization.
• MIP novel folds: correlation r=0.785 (P<1×10^-3). For pairs where both proteins have putative folds: TPR 99.9% for same fold (TM-score≥0.5) and FPR 3.9%.
- Representation quality and retrieval:
• TM-Vec embeddings better separate CATH structure classes across tiers vs raw ProtTrans embeddings (t-SNE). At topology level, macro AUPR=0.94 vs GRAFENE 0.79 and ProtTrans 0.66.
• Retrieval of same fold (topology): CATHS100 97% accuracy; CATHS40 88.1% accuracy. On ProtTucker test (219 domains), TM-Vec retrieved homology with 81% accuracy vs ProtTucker/EAT 78% and FoldSeek 77%.
• On CATHS20 homology retrieval, TM-Vec 88% vs FoldSeek 85%, ProtTucker 71%, HHblits 49%. A TM-Vec model trained on SWISS-MODEL chains achieved 71% on this CATH domain benchmark.
- Alignment benchmarks (Malidup/Malisam):
• DeepBLAST outperformed all tested sequence-only methods; F1 (mean±s.e.): Malidup 0.265±0.020; Malisam 0.066±0.009. Sequence baselines: Needleman–Wunsch 0.098±0.010 (Malidup), 0.025±0.003 (Malisam); Smith–Waterman 0.114±0.010, 0.031±0.003; BLAST/HMMER largely failed to detect most alignments. Structure-based upper bounds: Dali 0.791±0.014 (Malidup), 0.619±0.029 (Malisam); TM-align 0.576±0.024, 0.393±0.031; FAST 0.569±0.026, 0.300±0.030; Mammoth-local 0.483±0.020, 0.187±0.017.
• On Malidup, Spearman correlations of predicted TM-scores: DeepBLAST vs TM-align r_s=0.81; TM-Vec vs TM-align r_s=0.66; DeepBLAST vs TM-Vec r_s=0.75. Example: duplicated Annexin domains (24.7% identity) aligned by DeepBLAST with TM-score 0.81 vs Needleman–Wunsch 0.33.
- Runtime and scalability (TM-Vec + DeepBLAST):
• Encoding 50,000 queries on one GPU within ~40 min; sublinear vector search: 50,000 queries against 5M proteins in ~20 s; encoding dominates runtime.
• TM-Vec faster than BLAST across tested scales; e.g., 1,000 queries on 100k database yields ~10× speedup; on 1M database ~100× speedup; DIAMOND remains faster but TM-Vec has higher remote homology sensitivity than BLAST with sublinear scaling.
• DeepBLAST GPU implementation ~10× faster than CPU; batch-parallel; runtime scales linearly with sequence lengths; GPU runtime does not grow linearly with batch size due to parallelism.
- DIAMOND benchmark (family-level sensitivity):
• All proteins up to 1000 aa: top-1 nearest neighbor shares family 92.1%; top-50 sensitivity 96.9%.
• Multiple-domain proteins: top-1 86.2% (≤600 aa) and 82.6% (≤1000 aa); top-50 94.6% (≤1000 aa).
- Bacteriocin case study:
• TM-Vec embeddings clustered bacteriocins by class and subclass; 94% of annotated bacteriocins had nearest neighbor in same class. TM-Vec better distinguished class/subclass relationships than pipelines using predicted structures (AlphaFold2/ColabFold, OmegaFold, ESMFold) plus TM-align, likely due to difficulties predicting short peptides (<50 aa). For PDB-available bacteriocins, AlphaFold2 predictions sometimes had TM-scores <0.5 vs ground truth.
• k-NN classifier for bacteriocins vs nontoxins achieved precision 98% and recall 93%.
Discussion
The combined TM-Vec and DeepBLAST framework addresses remote homology detection by predicting structural similarity and inferring structural alignments directly from sequences, thus bypassing the need for known or accurately predicted structures. TM-Vec effectively encodes structure-aware representations enabling rapid, scalable nearest-neighbor search with strong correlation to structure-based TM-scores, robust generalization to unseen folds (CATH held-out folds; MIP novel folds), and superior or competitive retrieval and classification performance against existing sequence- and structure-based baselines. DeepBLAST complements TM-Vec by producing sequence-only structural alignments with accuracy approaching structure-based methods on challenging low-identity benchmarks, substantially outperforming classical sequence aligners in the midnight zone. Together, they enable a retrieval-then-align pipeline that scales to modern repositories and improves sensitivity in difficult homology regimes. The bacteriocin case study demonstrates practical benefits where structure prediction struggles (very short peptides): TM-Vec recovers biologically meaningful class and subclass stratification and supports annotation of putative bacteriocins, suggesting broader applicability in functional annotation and natural product discovery. These results collectively indicate that structure-aware sequence embeddings plus differentiable alignment can close key gaps between sequence, structure, and function at scale.
Conclusion
This work introduces TM-Vec, a twin-network, structure-aware embedding and TM-score prediction model for scalable structural similarity search over sequence databases, and DeepBLAST, a GPU-accelerated differentiable Needleman–Wunsch aligner that produces structural alignments from sequences. TM-Vec achieves high correlation with structure-derived TM-scores, robust generalization to novel folds, competitive retrieval/classification performance across CATH tiers, and repository-scale runtimes surpassing BLAST with sublinear scaling. DeepBLAST consistently outperforms sequence-only baselines on difficult structural alignment benchmarks and approximates structure-based methods. Case studies, including bacteriocins, highlight improved class/subclass discrimination over structure-prediction-plus-alignment pipelines, especially for short sequences. Future directions include: accelerating encoding via massively parallel GPU inference; improving sensitivity to local similarities by training TM-Vec with local structural objectives; enhancing DeepBLAST indel handling using linear affine gap costs in differentiable DP; and integrating TM-score prediction and alignment into a multitask framework sharing a single pretrained language model to further boost accuracy.
Limitations
- TM-Vec is optimized for predicting global structural similarity (TM-score) and is not well-suited for detecting subtle structural effects of point mutations (e.g., VIPUR variants, both deleterious and synonymous), nor for tasks requiring primarily local similarity (e.g., certain family-level retrievals), where DIAMOND performed better in some settings.
- Generalization to entirely unseen folds degrades compared to seen folds (higher error in TM-score range 0.75–1.0), though still acceptable for remote homology discovery.
- DeepBLAST struggles with large insertions/deletions common among remote homologs, reflecting limitations of the training data and current DP formulation; incorporating linear affine gap costs into differentiable DP may improve performance.
- Runtime bottleneck for TM-Vec lies in encoding (not search); scaling to billions of proteins will require faster encoders and large-scale multi-GPU or high-memory CPU infrastructures for indexing/search (e.g., Faiss).
Related Publications
Explore these studies to deepen your understanding of the subject.

