logo
ResearchBunny Logo
Microbial gene expression analysis of healthy and cancerous esophagus uncovers bacterial biomarkers of clinical outcomes

Medicine and Health

Microbial gene expression analysis of healthy and cancerous esophagus uncovers bacterial biomarkers of clinical outcomes

D. E. Schaffer, W. Li, et al.

Discover how the esophageal microbiome influences esophageal carcinoma in a groundbreaking study by Daniel E. Schaffer, Wenrui Li, Abdurrahman Elbasi, Dario C. Altieri, Qi Long, and Noam Auslander. This research unveils significant links between microbial proteins and patient survival, suggesting new prognostic biomarkers related to esophageal cancer development.... show more
Introduction

Esophageal carcinoma (ESCA) is a common and highly lethal cancer with 600,000 new cases yearly and a 5-year survival rate of 15–26%. The healthy esophagus harbors resident bacteria (notably Streptococcus), and shifts in the esophageal microbiome have been associated with ESCA and Barrett’s esophagus. Although short-read RNAseq enables tumor transcriptome profiling, microbial read lengths (75–80 nt) and low abundance make accurate microbial identification and assembly difficult. Existing approaches are limited by short reads, database size, and computational cost. The study aims to develop an efficient method to detect and assemble microbial transcripts from human RNAseq, and to compare microbial taxa and microbial protein expression between ESCA and healthy esophagus. The central questions are whether specific bacterial taxa or microbial proteins are associated with ESCA presence and whether any microbial features correlate with clinical outcomes, providing potential biomarkers and mechanistic insights into host–microbe interactions in ESCA.

Literature Review

Prior work has linked microbiome alterations to gastrointestinal cancers, including colorectal cancer where Streptococcus bovis, Bacteroides fragilis, and Fusobacterium nucleatum have been implicated. ESCA has shown high bacterial read abundance in TCGA compared to other cancers, but many studies rely on tumor-adjacent tissue rather than truly healthy controls, potentially missing full cancer–healthy differences. Short-read RNAseq poses challenges for metatranscriptomic assembly and classification due to read length and large microbial databases. Alignment-free and database-guided methods have detected uncommon tumor bacteria, and comparisons of tumor vs adjacent tissue have shown shifted microbial prevalence. However, comprehensive comparisons to healthy esophagus and associations with clinical outcomes remain limited. The study builds on previous deep learning-based viral detection and targeted assembly approaches to better characterize bacterial expression from short reads.

Methodology

Data sources: 75-base RNAseq reads from 170 TCGA ESCA samples and 76-base reads from healthy esophagus (GTEx; reported as 165 samples and up to 742 individuals; after filtering, analyses used reads from 161 ESCA cases and 742 healthy esophagus). Protocols involved total RNA isolation, poly(A) selection via oligo(dT), cDNA synthesis, amplification, and Illumina HiSeq 2000 sequencing. Preprocessing: Reads mapping to human genome were removed. For unmapped reads containing ambiguous nucleotides (N), small numbers of Ns were replaced with random nucleotides for prediction only; 75-base TCGA reads were padded with one random 3' nucleotide to 76 bases for model input. On validation data, replacing 1–2 nucleotides minimally impacted performance. Overall, 2,656,993,221 TCGA reads and 631,388,031 GTEx reads were considered after filtering. Deep learning classification: A convolutional neural network was trained to classify 76-base sequences as human, bacterial, or viral using simulated reads from human transcriptome, viral transcripts, and bacterial genomes. Multiple models/hyperparameters were screened; the best model was selected on held-out validation and tested on an independent set. Seeded assembly: Reads assigned high bacterial or viral model scores (≥0.46) were designated as seed reads. Seeds were prioritized by descending bacterial score, then viral score. A greedy extension assembly (modified from prior work) extended seeds in both directions; Ns matched any base during extension with retention of the non-N nucleotide. Taxonomic and gene mapping: Assembled contigs were aligned to curated microbial nucleotide databases using BLAST, including NCBI representative bacterial genomes and curated viral databases (reference human and novel/non-human viruses). For taxa-level characterization, species-level hits were pooled to genera to reduce hypothesis burden. Contigs were also mapped to microbial genes via BLASTX against the RefSeq non-redundant microbial protein database; presence per sample was defined by the lowest E-value hit (≤1e-5 threshold implied) per gene. Covariate-adjusted enrichment analyses: To adjust for confounders between TCGA and GTEx (risk, age, sex, weight, smoking, and sequencing depth), individuals with complete data (715 GTEx and 122 TCGA) were used to fit a bootstrap logistic regression predicting cohort membership; AUC=1.0 indicated strong cohort separability. Weighted chi-squared tests (weights = inverse probability of cohort assignment) evaluated over-/under-representation of bacterial genera and microbial protein families, with FDR correction for multiple testing. Phylogenetics: Representative 16S rRNA sequences (one per genus) were aligned with MUSCLE v5.1 and a tree constructed with FastTree v2.1.11; visualization via iTOL. Survival analyses: ESCA overall and disease-free survival (TCGA Clinical Data Resource) were compared between positive vs negative samples for each bacterial species/genus (≥5 positives and ≥5 negatives) and for microbial protein families, using log-rank tests (lifelines), with FDR correction. Host transcriptome analyses: TCGA ESCA RSEM values were used to compare host gene expression between samples positive vs negative for microbial iron-related proteins significantly associated with poor outcomes (accessions WP_006809445.1, WP_002335904.1, WP_131667507.1). Differentially expressed genes were selected by unadjusted p<0.05 with median Z-score >0.2 in positives and <0 in negatives. KEGG enrichment identified pathways upregulated in Fe-positive samples. Genome-scale metabolic modeling: Sample-specific metabolic models constrained by host gene expression were analyzed using GSMM and Flux Balance Analysis on Recon1 via COBRA Toolbox v3.0 to estimate oxygen consumption and ATP production in Fe-positive vs Fe-negative samples.

Key Findings
  • Model performance: On held-out test data, one-vs-all AUPRCs were 0.89 (human), 0.91 (bacterial), 0.80 (viral); AUROCs were 0.95 (human), 0.94 (bacterial), 0.89 (viral).
  • Microbial diversity: Across 161 ESCA and 742 healthy esophagus samples, 961 unique bacterial species were detected. Among 145 genera sufficiently represented, 32 were significantly over-represented in ESCA (P<0.05), with additional genera significantly under-represented in ESCA.
  • Under-abundant in ESCA (healthy-enriched): Genera markedly less frequent in ESCA than healthy included Cutibacterium, Sphingomonas, Fictibacterius, and Corynebacterium. Sphingobium occurred in 18.3% of healthy vs only 1 ESCA sample.
  • Over-abundant in ESCA: Nine genera were ≥50 percentage points more common in ESCA than healthy, including Bacillus (detected in nearly all ESCA samples with any bacterial reads vs ~21% of healthy), Clostridiodoter, Peribacillus, Candidatus, Rhodothermaceae, Delftia, Halopolsenodona, Methylophilus, and Larkinella. Increased Proteobacteria were noted, consistent with prior reports in other cancers.
  • Helicobacter prevalence was very low in both cohorts (GTEx 0.1%, TCGA 0.6%), supporting its specificity to gastric cancer.
  • Viruses: 691 unique viral strains were detected (61 ESCA, 503 healthy). Herpesviruses were most common (32 ESCA, 162 healthy). Geobacillus bacteriophages (E2, E3 types) appeared in 192 healthy samples but none in ESCA; Geobacillus bacteria were rarely detected, suggesting potential alternative hosts or higher phage expression.
  • Survival by taxa: No bacterial genera or species showed significant associations with ESCA survival after FDR correction.
  • Microbial proteins: 16,261 bacterial protein families were identified across samples; 87.6% (14,284) showed ≤5 percentage-point difference in prevalence between ESCA and healthy. 697 proteins were overrepresented in ESCA by >25 percentage points; many were phage-related (e.g., major capsid, terminase, portal, tail proteins). Other ESCA-enriched proteins included acyl-CoA dehydrogenase, LLM-class flavin-dependent oxidoreductase, ABC transporters, S49 peptidases, and phosphatases. Over 2,000 protein families were significantly differentially present after covariate adjustment (q<0.05).
  • Survival by proteins: 34 microbial protein families were significantly negatively associated with survival (q<0.05). Several are linked to mitochondrial functions and iron–sulfur biology (pyruvate dehydrogenase, succinate dehydrogenase, aconitase, and iron–sulfur cluster assembly SufB). Presence of microbial Fe-related proteins was associated with worse survival.
  • Host pathway associations: ESCA samples positive for microbial Fe proteins showed upregulation of host pathways: bacterial infection (P=2.2e-5), endocytosis (P=2.6e-6), oxidative phosphorylation (P<0.0001), and ferroptosis (P=5.2e-5). Elevated host ferroptosis gene expression correlated with worse survival (e.g., P=4.64e-09; additional association P=9.31e-03). GSMM predicted higher oxygen consumption and ATP production in Fe-positive samples, consistent with enhanced mitochondrial respiration and oxidative stress.
Discussion

The study demonstrates that while broad bacterial taxonomies differ markedly between ESCA and healthy esophagus, taxa presence alone does not predict clinical outcomes. Instead, the expression of specific microbial proteins—especially those related to mitochondrial metabolism and iron–sulfur cluster machinery—correlates with poor survival, implicating microbial functional activities rather than specific taxa. The pronounced enrichment of phage proteins in ESCA suggests bacteriophages may modulate tumor-associated microbiomes and gene transfer, potentially favoring cancer-associated bacterial functions. Host transcriptomic analyses show that microbial Fe-related proteins associate with upregulation of oxidative phosphorylation and ferroptosis pathways, and metabolic modeling supports increased oxygen consumption and ATP production in affected tumors. These findings suggest a microbiome–mitochondria–iron axis contributing to ESCA progression and underscore the need for functional microbial profiling beyond species identification. The proposed pipeline overcomes short-read limitations by combining deep learning-based read triage with targeted assembly, enabling robust detection of microbial genes from standard RNAseq and facilitating insights into host–microbe interactions in cancer.

Conclusion

This work introduces an efficient RNAseq-based pipeline that integrates a deep learning classifier with targeted assembly to sensitively detect microbial transcripts in human tissues. Applied to ESCA and healthy esophagus, the approach reveals extensive differences in bacterial genera but no taxa-level survival associations. In contrast, dozens of microbial protein families—particularly phage components and bacterial proteins involved in mitochondrial metabolism and iron–sulfur cluster biology—are enriched in ESCA and predict poor outcomes. These microbial functional markers align with host pathway dysregulation (oxidative phosphorylation, endocytosis, bacterial infection, ferroptosis) and increased metabolic flux, suggesting mechanistic links to cancer progression. The framework provides prognostic microbial gene biomarkers and can be extended to other malignancies. Future directions include: validating microbial protein biomarkers in independent cohorts; establishing causal mechanisms via experimental models; extending the method to fungal and other eukaryotic components; improving quantification beyond presence/absence; and harmonizing cohorts to minimize confounding and technical variability.

Limitations
  • Cohort differences (TCGA vs GTEx) in demographics, risk factors, and sequencing depth introduce confounding; although addressed via weighted analyses (AUC=1.0 for cohort separability), residual confounding may remain.
  • Potential contamination in sequencing pipelines; common contaminants were screened, but low-biomass samples remain vulnerable.
  • Short-read constraints and variable success of targeted assembly may limit recovery of full-length microbial transcripts and bias detection toward more abundant elements; highly divergent microbial elements may be missed.
  • Microbial feature quantification is binary (presence/absence), limiting expression-level analyses and differential abundance estimation.
  • RNAseq protocols optimized for eukaryotic mRNAs may underrepresent microbial RNAs, and recovered microbial reads may not fully represent the true microbiome.
  • The current pipeline focuses on bacteria and viruses; fungal and other eukaryotic microbes are not profiled.
  • Differences in sequencing depth (generally higher in ESCA) may inflate detection rates in tumors and contribute to asymmetry in protein prevalence.
Listen, Learn & Level Up
Over 10,000 hours of research content in 25+ fields, available in 12+ languages.
No more digging through PDFs, just hit play and absorb the world's latest research in your language, on your time.
listen to research audio papers with researchbunny