Introduction
Identifying evolutionarily conserved functions between proteins is crucial in biotechnology. The standard approach relies on sequence similarity, which is effective for proteins with high sequence similarity (>25%). However, structural homology persists across longer evolutionary timescales, and many proteins lack sequence homology in standard databases due to distant evolutionary relationships. Metagenomics studies indicate that structural homology detection could significantly improve protein function annotation. This necessitates methods for identifying structurally similar proteins with low sequence similarity. Existing structural alignment tools are computationally expensive, hindering their application to large-scale databases. This research addresses this challenge by developing scalable tools for structure-aware search and alignment on protein sequences.
Literature Review
Traditional sequence homology detection methods, such as BLAST and HMMER, are effective for proteins with high sequence similarity. However, these methods struggle with remotely homologous proteins, which exhibit low sequence similarity but significant structural similarity. Structural alignment tools like TM-align, Dali, FAST, and Mammoth utilize protein structures to align proteins and measure structural similarity. These methods, while powerful, are computationally expensive and require protein structures, which are unavailable for most proteins. The rapid growth of protein sequence databases necessitates scalable structure-aware search and alignment tools that can handle massive datasets.
Methodology
The authors developed two deep learning tools: TM-Vec and DeepBLAST. TM-Vec is a twin neural network model that predicts TM-scores (a measure of structural similarity) directly from sequence pairs, producing vector representations of proteins. These vectors allow for efficient indexing and querying of large sequence databases to identify structurally similar proteins. DeepBLAST uses a differentiable Needleman-Wunsch algorithm and protein language models to predict structural alignments between protein sequences, leveraging information implicitly captured in protein language model embeddings. Both tools are trained using data from known protein structures and alignments. The TM-Vec model is trained on millions of protein pairs to accurately predict TM-scores from sequences, enabling efficient database indexing. DeepBLAST is trained on pairs of proteins with known sequences and structures, learning to predict structural alignments generated by TM-align. The differentiable Needleman-Wunsch algorithm allows for efficient training through backpropagation. The authors benchmarked TM-Vec and DeepBLAST against state-of-the-art sequence and structure alignment methods on various datasets, including SWISS-MODEL, CATH, Malidup, Malisam, and a bacteriocin dataset.
Key Findings
TM-Vec demonstrated high accuracy in predicting TM-scores, even for protein pairs with very low sequence identity (<0.1%). It showed strong correlation with TM-align (r=0.97), outperforming traditional sequence alignment methods in resolving structural differences below 25% sequence identity. TM-Vec’s performance was validated on the SWISS-MODEL and CATH databases, exhibiting robustness to out-of-distribution observations. Visualization of TM-Vec embeddings revealed that it captures latent structural features, separating structural categories more clearly than sequence-based embeddings. Benchmarks on CATH showed TM-Vec outperforming existing methods in retrieving proteins with the same fold and classifying proteins based on structural features. DeepBLAST outperformed all tested sequence alignment methods on the Malidup and Malisam benchmarks, achieving comparable performance to some structure-based methods. The study showcases an example where DeepBLAST accurately aligned two duplicated Annexin domains with 24.7% sequence identity, while Needleman-Wunsch failed. TM-Vec's runtime demonstrated sublinear scaling with database size, significantly outperforming BLAST in speed while maintaining high accuracy. A case study on bacteriocins demonstrated TM-Vec's ability to accurately classify and annotate bacteriocins, even for those with low sequence similarity. It outperformed AlphaFold2, OmegaFold, and ESMFold combined with TM-align in distinguishing bacteriocin classes.
Discussion
The results demonstrate the effectiveness of TM-Vec and DeepBLAST in addressing the challenges of remote homology detection and structural alignment. TM-Vec's ability to accurately predict TM-scores from sequences enables fast and sensitive structure-aware searches in large databases. DeepBLAST's ability to produce accurate structural alignments using only sequence information provides a valuable tool for analyzing proteins with unknown structures. The application to bacteriocins highlights the potential of these methods in natural product discovery. The scalability of TM-Vec opens up new possibilities for analyzing large-scale metagenomics datasets.
Conclusion
TM-Vec and DeepBLAST offer significant advancements in protein homology detection and structural alignment. TM-Vec enables efficient, large-scale structural similarity searches, while DeepBLAST provides accurate structural alignments using only sequence data. These methods offer promising applications in various fields, including protein annotation, function prediction, and drug discovery. Future research could focus on improving TM-Vec’s ability to detect local structural similarities and DeepBLAST's handling of insertions/deletions. Integrating these functionalities into a unified framework may further enhance their performance.
Limitations
TM-Vec is less effective in detecting structural differences caused by point mutations. DeepBLAST struggles with large insertions or deletions, often found in remote homologs. While TM-Vec is highly scalable, encoding speed remains a computational bottleneck for extremely large databases. The accuracy of DeepBLAST is limited by the quality of the training data, specifically the manual curation efforts in gold standard datasets like Malidup and Malisam. The comparison to state-of-the-art structure prediction methods was limited to the bacteriocin dataset and may not generalize to all protein families.
Related Publications
Explore these studies to deepen your understanding of the subject.