
Biology
Multilingual translation for zero-shot bio-medical classification using BioTranslator
H. Xu, A. Woicik, et al.
Discover BioTranslator, a groundbreaking multilingual translation method developed by Hanwen Xu, Addie Woicik, Hoifung Poon, Russ B. Altman, and Sheng Wang. This innovative tool enables scientists to move beyond controlled vocabularies by translating text descriptions of new biological concepts into actionable data instances, facilitating the identification of novel cell types, predicting protein functions, and locating drug targets with ease.
Playback language: English
Introduction
High-throughput biological techniques generate massive omics datasets, constantly revealing new aspects of biological systems. A crucial initial step in analyzing these datasets is annotation, where each data instance (e.g., a cell) is categorized into a term (e.g., cell type) from a predefined controlled vocabulary (CV). CVs are essential for integrating data from different sources and facilitating data retrieval. However, the CV-based approach presents a significant limitation: it cannot readily handle new discoveries that don't fit into existing CV categories. Creating new CV entries is time-consuming and requires considerable domain expertise to ensure semantic distinction from existing terms. This limitation hinders the exploration of novel biological phenomena. To overcome this challenge, the authors propose BioTranslator, a novel method that uses multilingual translation to link textual descriptions to non-textual biological data. Unlike traditional similarity-based approaches that rely on matching text to existing CVs, BioTranslator translates free-text descriptions directly into biological data representations (e.g., gene expression vectors). This approach mirrors the shift from the Yahoo directory (hierarchical category-based search) to Google search (free-text-based search). BioTranslator's core innovation lies in its multilingual translation framework, which integrates multiple biomedical data modalities (text, gene expression, protein sequences, etc.) into a shared latent space. This contrasts with previous bilingual approaches that only translate between two modalities. The method leverages large-scale pretrained language models, fine-tuned using biomedical ontologies, to achieve this multilingual representation learning. The framework allows for zero-shot classification, where new classes can be identified using only textual descriptions without needing any annotated training instances for those specific classes. The potential applications of this approach are vast, impacting various biological research areas.
Literature Review
The paper reviews existing annotation paradigms based on controlled vocabularies (CVs) and highlights their limitations in handling novel biological findings. It discusses the challenges associated with creating new CV entries—the time and domain expertise required, and the need to avoid semantic overlap with existing terms. The authors then contrast their multilingual approach with existing text-based similarity methods, which merely map text descriptions to existing CVs, and with bilingual cross-modal learning approaches that only translate between two modalities. The existing literature on cross-modal learning, such as image caption generation and protein sequence-text translation (ProTranslator), is reviewed to highlight the novelty of BioTranslator's multilingual framework. The use of large-scale pretrained language models and contrastive learning for biomedical data representation is also contextualized within the existing literature.
Methodology
BioTranslator employs a multilingual translation framework to bridge text data and various non-textual biological data modalities. The training process begins with the construction of paired data, where textual descriptions (obtained from biomedical ontologies) are linked to corresponding non-textual biological data instances (e.g., gene expression vectors for cell types, protein sequences for protein functions). This paired data functions similarly to parallel corpora in machine translation. The core of BioTranslator is a neural network architecture. A large-scale pretrained language model (PubMedBERT) is fine-tuned using a contrastive learning loss, utilizing existing biomedical ontologies from diverse domains (225 in total, containing 2,010,648 textual descriptions). This fine-tuning step enhances the quality of the textual embeddings. The fine-tuning process leverages a contrastive learning approach, where embeddings of neighboring nodes in the ontologies are considered positive samples, while other pairs are treated as negative samples. The cosine similarity between pairs is computed, and cross-entropy is used as the loss function. Adam optimizer is employed with a learning rate of 1 × 10⁻⁵. Different types of biological data (gene expression vectors, protein sequences, etc.) are handled using separate deep neural networks, generating individual embeddings, which are then concatenated into a combined feature vector. The model learns a mapping between textual descriptions and these combined feature vectors. The training process involves minimizing a cross-entropy loss function using the Adam optimizer. This enables BioTranslator to map both textual descriptions and biological data instances into a shared embedding space. During the testing phase, BioTranslator embeds a new class based on its textual description and identifies nearby biological data instances in the shared embedding space, performing zero-shot classification. The method also allows for generating textual descriptions for biological data instances and interpreting biomedical discoveries through text-based analysis. The specific types of biological data used depend on the application: gene expression vectors for cell type classification, protein sequences for protein function prediction, and drug SMILES representations for drug target identification. Training text data for these applications are drawn from existing databases.
Key Findings
The paper demonstrates BioTranslator's effectiveness across several applications. First, it validates the assumption that classes with similar biological annotations have similar textual descriptions using Gene Ontology (GO) data. Analysis shows significant correlation between annotation-based GO term similarity and text-based GO term similarity, and between text-based GO term similarity and graph-based GO similarity derived from the GO graph. Similar consistencies are observed for cell type and pathway annotations. Second, BioTranslator is evaluated on GO-based protein function prediction, outperforming traditional text modeling approaches (TF-IDF, Word2Vec, Doc2Vec) and a bilingual translation method (ProTranslator) on five datasets. It demonstrates improved performance in both zero-shot and few-shot scenarios compared to state-of-the-art methods such as DeepGOPlus. Third, BioTranslator effectively generates biologically meaningful textual descriptions for sets of proteins, exceeding the performance of ProTranslator in terms of BLEU score. Fourth, BioTranslator achieves high accuracy (average AUROC of 0.90) in classifying cells into specific subtypes using only textual descriptions and no access to annotated cells or marker genes, across multiple single-cell RNA-seq datasets (Tabula Muris, Tabula Sapiens, Tabula Microcebus). The method also identifies marker genes for these unseen cell types with high accuracy. A text-based cell type marker gene network is constructed, enabling text-based query and visualization of marker genes. Finally, the multilingual nature of BioTranslator allows for accurate predictions between different modalities (drug-gene, gene-phenotype, phenotype-pathway) without using paired data between those modalities, showing superiority in low-data settings compared to supervised approaches. Biologically meaningful interpretations of the predictions are demonstrated.
Discussion
BioTranslator addresses a critical need in biological data analysis by enabling the classification and annotation of novel biological entities using only textual descriptions. Its multilingual translation framework significantly advances existing methods by eliminating the reliance on paired data between all modalities. This capability is crucial for analyzing newly discovered entities where paired data might be scarce or unavailable. The superior performance demonstrated across various tasks—protein function prediction, cell type classification, drug target identification, and phenotype pathway association—highlights the versatility and effectiveness of the approach. The ability to generate textual descriptions for sets of proteins further contributes to the expansion and curation of existing biological ontologies. The findings suggest a paradigm shift in how researchers can interact with large biological datasets, facilitating data exploration and discovery. The potential implications for accelerating biomedical research and development are substantial.
Conclusion
BioTranslator presents a powerful new approach to biomedical data analysis by bridging textual descriptions and diverse biological data modalities through a multilingual translation framework. Its ability to perform zero-shot classification and cross-modal prediction without paired data significantly advances current capabilities. Future directions include adapting the method to handle noisy and informal text, integrating BioTranslator into existing CV curation workflows, and exploring its application to other biological problems. The successful integration of language models into the field of biology paves the way for more efficient exploration of vast biomedical datasets.
Limitations
The performance of BioTranslator relies heavily on the quality and completeness of existing controlled vocabularies and annotations. The method's accuracy might be affected by limitations in the quality of textual descriptions provided by users and the completeness of existing biological ontologies. Future work should address handling noisy user-generated text and expand the coverage of current ontologies. Additionally, the interpretation of results still needs human expert validation to ensure biological accuracy.
Related Publications
Explore these studies to deepen your understanding of the subject.