logo
ResearchBunny Logo
Multilingual translation for zero-shot bio-medical classification using BioTranslator

Biology

Multilingual translation for zero-shot bio-medical classification using BioTranslator

H. Xu, A. Woicik, et al.

Discover BioTranslator, a groundbreaking multilingual translation method developed by Hanwen Xu, Addie Woicik, Hoifung Poon, Russ B. Altman, and Sheng Wang. This innovative tool enables scientists to move beyond controlled vocabularies by translating text descriptions of new biological concepts into actionable data instances, facilitating the identification of novel cell types, predicting protein functions, and locating drug targets with ease.... show more
Introduction

High-throughput techniques have created ever-expanding repositories of omics datasets. Early analysis often relies on annotation to controlled vocabularies (CVs), which enables integration across datasets but constrains discovery to pre-defined terms. This CV-based paradigm struggles with new findings that do not fit existing categories, and creating or curating new CV terms is time-consuming and requires expert knowledge. The authors propose BioTranslator, a multilingual cross-modal translation framework that maps user-written textual descriptions directly to non-text biological data (e.g., gene expression vectors), enabling zero-shot retrieval and classification of biological instances without reliance on pre-existing CV labels for new classes. Unlike text-similarity mapping to nearest CV, BioTranslator performs text-to-biology translation and co-embeds multiple modalities into a shared latent space by fine-tuning large-scale language models with biomedical ontologies. This approach aims to unlock free-text interaction with biological data, supporting tasks such as novel cell type identification, protein function prediction, and drug target identification.

Literature Review

The work positions itself relative to: (1) CV-based annotation systems used across biology to standardize datasets but limited to known concepts; (2) cross-modal learning approaches (e.g., image captioning) that are typically bilingual (two modalities) and require paired data; (3) prior biomedical bilingual translators such as ProTranslator that translate between protein sequence and text; and (4) ontology- and graph-based function prediction methods that leverage the Gene Ontology (GO) structure. The authors argue that a multilingual translation framework, where multiple biomedical modalities and text are jointly embedded into a shared space, enables capabilities not achievable by bilingual models: translation between non-paired modalities via text, improved performance by integrating diverse ontologies, and better handling of low-resource settings. They also note the promise of large-scale pretrained language models (e.g., PubMedBERT) over classical text representations (TF-IDF, Word2Vec, Doc2Vec) for capturing contextual semantics in biomedical text.

Methodology

BioTranslator is a multilingual translation framework between text and multiple biological modalities. Training pairs are constructed from biomedical ontologies: each class node’s textual description (text) is paired with biological instances annotated to that class (non-text). One text can map to multiple instances and vice versa.

Text encoder and multilingual fine-tuning:

  • Initialize text encoder with PubMedBERT using [CLS] token readout.
  • Fine-tune on 225 biomedical ontologies (2,010,648 textual descriptions) via contrastive learning: ontology neighbor nodes serve as positives; other pairs are negatives. Cosine similarity with cross-entropy loss is used.
  • Optimization and hyperparameters: Adam, learning rate 1e-5, batch size 16, max text length 256 tokens.
  • Rationale: fine-tuning on multiple ontologies aligns semantically related classes, estimating P(Y_unseen|Y_seen) to support zero-shot classification.

Mapping text to biological instances:

  • For each biological instance represented by k feature types, embed each feature using a separate fully connected network to obtain k embeddings (dimension h_y). Concatenate to form combined vector F_i ∈ R^{k*h_y}.
  • Train with paired text–instance data to annotate instances to CV terms using logistic cross-entropy over terms with learnable weights W; text embeddings Y_i have fixed dimension d_bio=768 from the text encoder.
  • Loss_BioTranslator = −∑_i ∑_j [ A_ij log(1/(1+exp(−F_i^T W_j))) − (1−A_ij) log(exp(−F_i^T W_j)/(1+exp(−F_i^T W_j))) ].

Zero-shot inference:

  • For a novel class, embed the user’s text to Y_novel; embed query instance features to F_q. Compute probability via dot product and logistic mapping to assign relevance without any annotated instances for the novel class.

Applications and data modalities:

  1. Protein function prediction (GO): Texts are GO term descriptions; biological features include protein sequence, networks, and descriptions; trained to retrieve proteins for held-out GO terms.
  2. Cell type classification: Texts are Cell Ontology names/descriptions; biological instances are single-cell RNA-seq gene expression vectors; 5-fold CV and cross-dataset evaluations, including zero-shot for unseen types; training hyperparameters for BioTranslator and baselines: LR 1e-4, batch size 128, epochs 15; method can also identify marker genes using two strategies (correlation-based using expression; cross-modal embedding-based without expression).
  3. Multimodal prediction without paired data: Train separate encoders to map each modality to the shared text-guided embedding space:
    • Gene-to-text: Mashup network features; 15,835 gene–text pairs from STRING; 2-layer MLP; 30 epochs.
    • Pathway-to-text: sum-pooled gene embeddings within each Reactome pathway.
    • Drug-to-text: SMILES features with textual descriptions from ChEBI; 32,967 drug–text pairs; Transformer (ChemBERTa initialization), 5 epochs, LR 3e-5, linear schedule with 400 warmups; exclude test drugs to avoid leakage.
    • Phenotype-to-text: HPO network features via Mashup and HPO ‘def’ field; 17,575 phenotype–text pairs; 50% phenotypes held out for test with texts never seen during training to avoid leakage.
  • Downstream tasks (no paired data between non-text modalities): Gene2Drug (targets) on GDSC and STITCH; Gene2Phenotype (Monarch); Pathway2Phenotype (Reactome–Monarch). Predictions are made via nearest neighbors/dot product in the shared space.

Marker gene identification:

  • Approach 1: Spearman correlation between gene expression across cells and predicted probability for a cell type; AUROC computed per type for seen and unseen.
  • Approach 2: Cross-modal embedding method without expression: compute P_gene,CL = 1/(1+exp(−F_gene^T W Y_CL)) using gene embeddings from GO-trained model and cell type text embedding Y_CL; ground truth via binarized known markers; cross-dataset evaluation across Tabula Muris (Droplet/FACS), Tabula Sapiens, Tabula Microcebus, Lemur1–4; only shared genes considered.

Network analysis:

  • Construct a cell type–marker gene network using textual embeddings; edges connect nodes with high text-embedding similarity; community detection, community embedding by averaging member embeddings; annotate communities with nearest GO term (restricting annotated gene counts 100–2500; search ancestors/descendants if needed); visualize with Cytoscape v3.9.1.

Evaluation baselines and comparisons:

  • Text models: TF-IDF, Word2Vec, Doc2Vec.
  • Bilingual translator: ProTranslator.
  • Graph-based zero-shot: clusDCA, ML-ZSL, MZSL-GCN (using GO graph).
  • Protein function prediction comparator: DeepGOPlus (with/without BLAST) for few-shot.
  • Supervised models for cross-modal tasks: neural networks and logistic regression trained with varying ratios of paired data; class-feature fusion via bi-nonlinear or concatenation; LR uses no class features. BioTranslator uses no paired data between non-text modalities and no test-item texts.
Key Findings
  • Text similarity reflects annotation and ontology structure:

    • Strong correlation between text-based GO term similarity and annotation-based similarity (ANOVA p=2.28e-207); higher text similarity corresponds to closer GO graph distance. Similar consistency observed for pathways and cell types.
  • Protein function prediction (zero-shot over held-out GO functions):

    • Across five datasets (GOA Human/Mouse/Yeast, SwissProt, CAFA3) and three domains (BP/MF/CC), BioTranslator outperformed TF-IDF, Word2Vec, and Doc2Vec by mean absolute AUROC gains of 0.10 (BP), 0.13 (MF), and 0.12 (CC).
    • Multilingual BioTranslator substantially outperformed bilingual ProTranslator, indicating benefits of integrating 225 ontologies for text fine-tuning.
    • Outperformed graph-based zero-shot methods (clusDCA, ML-ZSL, MZSL-GCN), showing advantages of textual descriptions over GO graph for novel function annotation and applicability to arbitrary user-defined functions.
    • Few-shot: BioTranslator surpassed DeepGOPlus, particularly for functions with very few training annotations; analysis with BLAST and without BLAST showed consistent improvements.
    • Text generation from protein sets: Achieved BLEU=0.32 on GOA (human) vs ProTranslator’s 0.26, with many biologically accurate generated descriptions.
    • Robustness: Consistent performance across different pretrained LMs; PubMedBERT (domain-specific) performed best; fine-tuning on ontologies crucial, as untuned models underperformed.
  • Cell type classification and marker genes:

    • Zero-shot cell type identification using only textual descriptions on Tabula Muris, Tabula Sapiens, and Tabula Microcebus achieved average AUROC ≈0.90 when 50% of cell types were unseen.
    • Cross-dataset classification across eight datasets showed strong performance; e.g., AUROC on unseen types >0.90 when trained on Tabula Sapiens and tested on Tabula Microcebus.
    • Marker gene identification for unseen cell types feasible using only textual descriptions; AUROC ~0.82 using path-distance on a text-based cell type–marker gene network; literature-supported markers identified (e.g., RAG1, CD3E, CD44, ID2 for proT).
    • Using only cell type names (without full descriptions) yielded slightly lower but still strong performance.
  • Multimodal predictions without paired data:

    • BioTranslator, trained without any paired data between non-text modalities, achieved promising AUROC on Gene2Drug (GDSC, STITCH), Gene2Phenotype, and Pathway2Phenotype.
    • Outperformed supervised baselines on two low-resource tasks (Pathway2Phenotype and Gene2Drug on GDSC); on larger datasets (STITCH, Gene2Phenotype), outperformed supervised logistic regression without class features and was competitive with models using class features.
    • Case study: Predicted association of a drug with the ‘Respiratory electron transport’ pathway; overlapping informative terms across drug, pathway, and gene descriptions provided interpretability.
Discussion

BioTranslator bridges free-text queries and diverse biomedical data by co-embedding textual descriptions with biological modalities via a multilingual translation framework. Fine-tuning domain-specific language models (PubMedBERT) on a large collection of biomedical ontologies aligns semantically related classes, enabling zero-shot classification, text generation for new biological sets, and cross-modal prediction without paired data. The results show that text similarity correlates strongly with annotation and ontology proximity, supporting the central premise. BioTranslator outperforms classical text models, bilingual ProTranslator, and graph-based zero-shot approaches for protein function prediction; it accurately annotates unseen cell types and identifies markers using only text; and it enables unsupervised cross-modal association predictions, particularly excelling in low-resource scenarios. These capabilities suggest substantial utility for expanding and curating controlled vocabularies, accelerating biological discovery, and supporting tasks where annotations are sparse or absent. The interpretability of predictions via overlapping, biologically meaningful terms further enhances usability.

Conclusion

The study introduces BioTranslator, a multilingual cross-modal framework that translates user-written text into biological data embeddings, enabling zero-shot classification and discovery across proteins, cells, drugs, phenotypes, and pathways. By fine-tuning PubMedBERT on 225 ontologies and learning mappings from multiple biological modalities to a shared text-guided space, BioTranslator achieves state-of-the-art performance in annotating novel protein functions, identifying and characterizing unseen cell types, generating textual descriptions of protein sets, and predicting cross-modal associations without paired data. Future work includes: (1) improving robustness to noisy, lay, and short user text via text normalization; and (2) leveraging BioTranslator to refine and expand controlled vocabularies by extracting high-contribution phrases and collaborating with experts to curate new terms.

Limitations
  • Dependence on expert-written ontology descriptions and CV annotations for training; performance may degrade with noisy or non-expert user inputs.
  • Current system assumes availability and quality of ontology-derived text and existing annotations; gaps or biases in these resources can affect generalizability.
  • Cross-modal encoders are trained per modality with available features (e.g., Mashup networks, SMILES); performance may vary with feature quality and coverage.
  • While strong in low-resource settings, supervised models with extensive paired data and class features can outperform on large, well-annotated datasets.
  • Evaluation of some associations (e.g., drug–pathway) is limited by lack of large, well-curated benchmarks.
Listen, Learn & Level Up
Over 10,000 hours of research content in 25+ fields, available in 12+ languages.
No more digging through PDFs, just hit play and absorb the world's latest research in your language, on your time.
listen to research audio papers with researchbunny