
Medicine and Health
A deep learning approach reveals unexplored landscape of viral expression in cancer
A. Elbasir, Y. Ye, et al.
Discover viRNAtap, a cutting-edge deep learning pipeline that uncovers viral contigs from RNA sequencing data. This innovative tool has unveiled unexpected viruses linked to cancer and highlighted human endogenous viruses associated with poor survival rates. Conducted by Abdurrahman Elbasir, Ying Ye, and team, this research opens new avenues in understanding viral infections in clinical settings.
~3 min • Beginner • English
Introduction
Viral infections contribute to approximately 15% of cancers, acting as direct carcinogens via viral oncogene expression or as indirect carcinogens through infection-associated mutagenesis and inflammation. Well-established oncogenic viruses include high-risk human papillomaviruses (HPV), hepatitis B virus (HBV), hepatitis C virus (HCV), Kaposi’s sarcoma herpesvirus, and Merkel cell polyomavirus. Advances in sequencing have facilitated the detection of viral sequences in tumor tissues, notably using TCGA datasets. However, current discovery approaches mainly depend on sequence homology to known viruses, a limitation when reads originate from novel or divergent viruses, especially with short-read RNA-seq. Deep learning methods for metagenomic DNA have demonstrated potential to identify viral sequences beyond homology constraints. Motivated by these advances, the authors propose a deep learning framework for RNA-seq to distinguish viral from human reads and reconstruct viral contigs, aiming to reveal overlooked viral diversity in cancers and to connect viral expression with genomic and clinical features.
Literature Review
Prior TCGA-wide surveys used DNA-seq and RNA-seq to screen for known viruses, evaluate integration sites (e.g., HBV), and discover novel isolates via homology-based methods. Combined DNA/RNA studies quantified known cancer-associated viruses across cancers. Nonetheless, incomplete sampling of viral diversity and rapid viral evolution limit homology-based detection, particularly for short reads. In metagenomics, alignment-free and deep learning tools (e.g., VirFinder, Seeker, DeepViFi, DeepViral, ViraMiner, PPR-Meta) can identify viral sequences and human-infecting virus potential, suggesting applicability to RNA-seq. The clinical literature documents associations between oncovirus presence and cancer phenotypes (e.g., HPV with better survival in HNSC) and implicates HERV expression (especially HERV-K/HERV-H) in tumors and poor outcomes. Reports also note plant and other unexpected viruses in human tissues, though routes of exposure and contamination remain debated.
Methodology
Overview: The viRNAtap/viRNATrap framework integrates a deep learning classifier to label RNA-seq reads as viral versus human and an assembly procedure that leverages model scores to build viral contigs from unmapped reads. Assembled contigs are post-processed via BLAST searches against curated viral databases (reference human viruses, curated potentially functional HERVs, and a database of divergent/non-human-host viruses) to annotate viral origins.
Data: 7,272 TCGA RNA-seq samples from 14 cancer types (6,717 tumors; 555 matched non-cancer) were analyzed. BAM files were downloaded from GDC. Reads were aligned with Bowtie2 to hg19 (1000 Genomes version) and the PhiX phage (NC_001422); only unmapped reads were retained for viral discovery.
Deep learning model: Input sequences were one-hot encoded 48 bp fragments, reflecting TCGA RNA-seq read length. Positive class comprised viral coding sequences from NCBI Virus Variation; negative class comprised human transcripts (hg19) from NCBI. Sequences were segmented into 48 bp (human) and into 2 bp sliding windows for viral sequences to balance classes. Architecture: one 2D convolutional layer (64 filters, ReLU), a pooling layer, and three fully connected layers (final output with activation). Learning rate 0.0005. Train/test split was used to evaluate model performance.
Assembly: Model-scored unmapped reads above a threshold (seed ~0.7) were assembled into contigs. The approach uses model scores to seed and extend assemblies, improving speed relative to de novo assembly of all unmapped reads.
Databases and search: BLASTn (E-value thresholds typically ≤1e-3 to 1e-2) of contigs against: (1) NCBI RefSeq human viruses plus additional HPV strains (from PaVE), (2) curated potentially functional HERVs (n=3,044) compiled from the HERV annotation database; candidate HERV loci were filtered for evidence of retroviral proteins (POL/RT/GAG/ENV) via BLASTn to nr and profile searches using MAFFT-aligned vertebrate POL genes, and (3) a collection of divergent viruses from various hosts (e.g., plant, arthropod, bat, mouse, bacterial phages). Quality criteria included minimum contig lengths and coverage/similarity thresholds; potential vector/contaminant accessions were removed.
Quality standards: Reference viruses required strong aggregate coverage and similarity across contigs (e.g., average similarity >90%, sufficient coverage; short maximal contigs under 100 bp manually verified). HERVs required contigs >200 bp and high identity. Divergent viruses required either maximal contig length >300 bp with moderate coverage or ≥100 bp with higher coverage for manual review. All divergent hits were verified by BLAST against nr; selected accessions were further searched with STAR to increase coverage.
Performance evaluation: Model performance was compared with DeepViFi, DeepViral, VirFinder, and an off-the-shelf Seq2Seq baseline. A simulation assessed detection probability versus read count; ≥5–9 viral reads yielded high identification rates (~93–99%).
Genomic/clinical correlates: Tumor mutation burden (TMB) and aneuploidy (chromosome-arm level alterations) were compared between virus-positive vs negative samples. Immune composition was inferred with CIBERSORT (22-cell signature). Survival analyses employed Kaplan–Meier and log-rank tests (Python lifelines v0.26.4), with FDR corrections applied within cancer types and globally for HERV analyses.
Experimental validation: Geobacillus virus E2 expression was tested in cancer cell lines (CCLE/OVCAR) via RT–qPCR. RNA extraction (TRI-reagent), cDNA synthesis (Thermo Fisher High-Capacity kit), and qPCR (QuantStudio 3) were performed; GAPDH served as control; fold change by 2^-ΔΔCt. Specific primers targeting Geobacillus virus E2 were used.
Contaminant filtering and caveats: Decontamination tools (e.g., ViperEditor) and Kraken2 were considered; Kraken2 was ineffective with 48 bp reads. Manual inspection supported key divergent virus calls.
Key Findings
- The deep learning classifier achieved AUROC 0.81, AUPRC 0.82, accuracy 0.71, recall 0.83, precision 0.67, F1-score 0.74 on test data. It outperformed DeepViFi, DeepViral, and VirFinder on most metrics (precision lower than the best comparator), and enabled fast, alignment-free seeding of assemblies.
- Known oncoviruses were robustly detected across TCGA tumor types; viRNATrap had >95% true positive rates for HR-HPV in CESC/HNSC and for HBV in LIHC when compared with TCGA clinical annotations.
- HBV expression was found in 58 LIHC tumors and 7 matched non-cancer samples; HCV in 13 LIHC tumors. HBV was also detected in a few samples of other cancer types.
- Adenovirus 2 (AV2) reads were identified in 8 LIHC samples (6 tumors, 2 non-cancer) and sporadically in KIRC, PAAD, and one LUAD non-cancer sample, exceeding prior TCGA-wide reports that found AV2 in ≤6 LIHC cases.
- Additional human viruses detected included human polyomaviruses (notably polyomaviruses 6 and 7) in BCA and HNSC, and Parvovirus B19 in some tumor and matched non-cancer samples.
- HR-HPV-positive HNSC samples had lower TMB and aneuploidy than HR-HPV-negative samples. HR-HPV-positive patients showed a trend toward better overall survival (not significant after FDR correction). In LIHC, HBV-positive tumors showed higher TMB and a positive association with overall survival trends (not significant after FDR).
- HERVs: A curated set of potentially functional HERVs revealed widespread expression across cancers. HERV-K and HERV-H positivity associated with significantly lower overall survival in COAD, LUAD, LUSC, and LIHC (FDR-significant in selected comparisons). Some HERV loci were proximal to cancer-relevant genes (e.g., within hundreds of bp of ICOS and PADI). HERVs linked to poor survival were not associated with TMB or aneuploidy but were more likely to co-occur with mutations in driver genes (TP53, KRAS, ARID1A, PTEN), without a single dominant gene association.
- Divergent/non-canonical viruses: Plant mosaic viruses (e.g., watermelon mosaic virus; Bermuda grass latent virus) were detected across multiple tumor types, particularly adenocarcinomas. A Reoviridae-related member and other unexpected viruses were observed across several cancers.
- Arthropod-associated viruses: IV31 (Armadillidium vulgare indeterminate virus 31; Iridoviridae/Batidoviridae-like) reads were detected in 152 endometrial (UCEC) tumor samples (>25% of UCEC cohort). IV31 presence associated with improved overall survival and higher inferred CD8+ T-cell and Treg fractions. IV31 positively co-occurred with Trichomonas vaginalis (TV) expression (Fisher’s exact p = 1.8e-8), and both IV31 and TV associated with PTEN mutations; IV31 also associated with CTNNB1 and PIK3R1 mutations.
- Bacteriophage-related signal: Geobacillus virus E2 expression was detected in 3 UCEC samples and, in an external analysis, in ~10% of high-grade serous ovarian cancers; expression was validated by RT–qPCR in a predicted-positive ovarian cancer cell line.
- Additional findings included Redondoviridae expression in head and neck cancers not previously reported.
- Simulation indicated high detection probability when ≥5–9 viral reads are present (93–99%).
Discussion
The study addresses the limitations of homology-based viral discovery from short-read tumor RNA-seq by introducing an alignment-free deep learning framework that discriminates viral from human reads and uses classifier scores to guide contig assembly. This approach expands the detectable virome in cancer tissues, recovering both known oncoviruses and unexpected viral clades with limited similarity to reference databases. The robust recovery of HR-HPV and HBV, including concordance with clinical annotations, confirms accuracy for known oncogenic viruses, while additional detections (e.g., AV2, polyomaviruses) highlight improved sensitivity.
Systematic characterization of HERVs across cancers using a curated function-oriented database reveals that HERV-K/HERV-H expression is associated with poor overall survival in multiple cancer types, independent of TMB or aneuploidy, and in the context of varied driver mutations. This supports a detrimental prognostic role for endogenous retroviral activity in several tumors.
The identification of divergent and non-human-host viruses (plant, arthropod, bacterial phage-related) in tumor RNA-seq underscores the complexity of the tumor virome and potential environmental, commensal, or contaminant sources. Notably, IV31 in UCEC showed an association with improved survival and immune infiltration and co-occurred with TV, suggesting possible interactions between infections and the tumor microenvironment. The validation of Geobacillus virus E2 in ovarian cancer cell lines offers orthogonal evidence that some non-canonical viral signals represent genuine expression rather than purely computational artifacts. Nonetheless, contamination and misclassification remain critical considerations; the authors applied stringent filters, manual BLAST verification, and secondary alignments (STAR) to mitigate false positives.
Overall, the framework facilitates rapid, sensitive detection of viral expression from RNA-seq, enabling discovery-oriented analyses that link viral presence with genomic features and patient outcomes.
Conclusion
The authors present viRNAtap/viRNATrap, a deep learning-based, alignment-free pipeline to identify viral reads and assemble viral contigs from tumor RNA-seq. Applied to 14 TCGA cancers, the method recapitulates known oncoviruses, uncovers endogenous retroviral expression linked to poor survival across multiple cancers, and detects unexpected divergent viruses, including arthropod- and plant-associated clades, with select findings confirmed experimentally (Geobacillus virus E2 in ovarian cancer cell lines). The approach broadens the landscape of viral expression detectable in human cancers and provides a resource and methodology for future discovery of cancer-related viruses. Future work should refine organism-agnostic models, improve sensitivity for low-abundance viruses, clarify sources and biological relevance of divergent viral signals, expand experimental validations, and assess clinical utility for prognosis and therapeutic stratification.
Limitations
Potential contamination and environmental sources cannot be fully excluded for divergent/non-human-host viruses despite stringent filtering and manual verification. The deep learning model was trained to distinguish human versus viral reads, leaving performance on reads from other organisms less defined. Short read lengths (48 bp) limit alignment-based confirmation and tools like Kraken2 perform poorly on such short fragments. Sensitivity drops when very few viral reads are present (<5), as indicated by simulations. Some survival associations (e.g., HR-HPV, HBV) were not significant after multiple testing correction. HERV curation may miss conserved loci filtered during pre-processing, and TV alignments required strict parameters due to high false positive rates. Affiliation mapping and some taxonomic assignments in text may contain inconsistencies/typos, and not all reported divergent viruses have experimental validation.
Related Publications
Explore these studies to deepen your understanding of the subject.