logo
ResearchBunny Logo
Introduction
Viral infections are causally linked to approximately 15% of human cancers. Traditionally, the study of viral expression in tumors relied on aligning RNA sequencing reads to known viral databases, limiting the discovery of divergent or novel viruses. This research addresses this limitation by developing a novel alignment-free approach, acknowledging the incomplete nature of both known viral clades and those known to infect humans. The rapidly evolving genomes of viruses and cancers, coupled with the limitations of short reads from RNA sequencing technologies, make the identification of new cancer-associated viruses extremely challenging. The use of deep learning techniques to detect viral reads from RNA sequencing has the potential to overcome these hurdles, thus prompting the development of viRNAtap.
Literature Review
Existing methods for identifying viruses in cancer rely heavily on sequence similarity to known viruses, using tools such as alignment-based approaches. Studies utilizing TCGA data have focused on screening for known viruses, analyzing viral integration sites, or discovering novel viral isolates through RNA sequencing. However, the inherent limitations of these homology-based methods mean that divergent viruses may be missed. Recent advancements in metagenomic DNA sequencing have seen the successful application of deep learning techniques for bacterial virus detection, which inspired the development of the alignment-free method presented in this paper.
Methodology
The viRNAtap framework consists of two main components: a deep learning model and a contig assembly module. The deep learning model was trained to distinguish between viral and human RNA reads using a dataset of 48 bp fragments from viral and human transcripts. The model architecture incorporated a 2D convolutional layer and three fully connected layers. The model's performance was evaluated using AUROC, AUPRC, accuracy, recall, precision, and F1-score, demonstrating superior performance compared to existing viral identification models (DeepViFi, DeepViral, VirFinder). The pipeline then takes unmapped reads from RNA sequencing data (after mapping against the human genome and PhiX phage), uses the model to score these reads, and assembles the high-scoring reads into potential viral contigs. These contigs are then searched against three curated viral databases: RefSeq reference human viruses, a curated database of potentially functional HERVs, and a database of divergent viruses from various hosts. The analysis involved 7272 samples from 14 cancer types in TCGA, including both tumor and matched non-cancer samples. Genomic correlates of viral expression, including tumor mutation burden (TMB) and chromosome-level copy number alterations, were also analyzed. Statistical methods included Kaplan-Meier curves, log-rank tests, Wilcoxon rank-sum tests, and Fisher's exact tests to assess associations between viral expression and overall survival, TMB, aneuploidy, and other genomic features. Experimental validation of Geobacillus virus E2 expression was performed using qRT-PCR in ovarian cancer cell lines.
Key Findings
viRNAtap successfully identified the expression of known cancer-associated viruses (e.g., HBV, HCV, HPV) with high accuracy. The pipeline also uncovered the expression of unexpected and divergent viruses not previously linked to cancer, including mosaic plant viruses in various adenocarcinomas, Ventorina (a member of the Reoviridae family), Geminivirus, Catuvirus, and Coxsackievirus. Notably, the study identified Armadillidium vulgare indeterminate virus (IV31), a virus infecting arthropods, in a significant number of endometrial cancer samples, and found a positive association with improved overall survival in these patients. The analysis also revealed expression patterns of HERVs across different cancer types, with several HERVs showing a significant association with poor overall survival. Specifically, HERV expression was associated with poor survival in COAD, LUAD, LUSC, and LIHC. Furthermore, HERVs associated with poor survival tended to be expressed in the presence of somatic mutations in frequently mutated cancer driver genes (TP53, KRAS, ARID1A, and PTEN). The study further validated the presence of Geobacillus virus E2 in ovarian cancer cell lines through qRT-PCR, confirming its expression in high-grade serous ovarian cancer samples. The findings highlight potential associations between IV31 and Trichomonas vaginalis infection, suggesting a possible indirect link between the arthropod virus and endometrial cancer through TV infection. The presence of IV31 was significantly associated with mutations in PTEN, CTNNB1 and PIK3R1.
Discussion
The alignment-free approach of viRNAtap provides a significant advancement in identifying viral expression in cancer, particularly for divergent viruses that are missed by traditional alignment-based methods. The findings expand our understanding of the tumor virome and its potential role in cancer development and progression. The discovery of unexpected viruses associated with certain cancers opens up new avenues for research into the mechanisms of carcinogenesis. The association between HERV expression and poor prognosis reinforces the importance of investigating these endogenous viruses as potential therapeutic targets. The association of IV31, a virus typically infecting arthropods, with better survival in endometrial cancer suggests complex interactions within the tumor microenvironment that warrant further investigation. The potential link between IV31 and Trichomonas vaginalis infection opens new avenues for understanding co-infections and their impact on cancer prognosis. The successful validation of Geobacillus virus E2 expression in ovarian cancer further demonstrates the sensitivity of the viRNAtap pipeline. Future research should focus on elucidating the precise roles of these newly identified viruses in cancer initiation and progression.
Conclusion
viRNAtap represents a powerful new tool for identifying viral sequences in RNA sequencing data, overcoming the limitations of alignment-based approaches. This study successfully identified both known and novel viruses in various cancer types, revealing new insights into the complex interplay between viral infections and cancer. Future research should focus on functional studies to elucidate the mechanistic roles of the identified viruses and explore their potential as diagnostic and therapeutic targets.
Limitations
The study's reliance on TCGA data, which has inherent biases, could limit the generalizability of the findings. Further validation in independent cohorts is necessary. While viRNAtap demonstrates high sensitivity, the interpretation of some findings, such as the presence of plant viruses, requires further investigation to rule out potential contamination. The association of some viruses with survival requires further functional validation to determine causality.
Listen, Learn & Level Up
Over 10,000 hours of research content in 25+ fields, available in 12+ languages.
No more digging through PDFs—just hit play and absorb the world's latest research in your language, on your time.
listen to research audio papers with researchbunny