logo
ResearchBunny Logo
MArVD2: a machine learning enhanced tool to discriminate between archaeal and bacterial viruses in viral datasets

Biology

MArVD2: a machine learning enhanced tool to discriminate between archaeal and bacterial viruses in viral datasets

D. Vik, B. Bolduc, et al.

Exciting advancements in our understanding of archaeal viruses are here! Meet MArVD2, the cutting-edge machine learning tool developed by researchers including Dean Vik and Benjamin Bolduc, which correctly classifies 85% of archaeal viruses with remarkably low false detection rates. Discover how this innovative tool pushes the boundaries of viral sequence analysis!

00:00
00:00
~3 min • Beginner • English
Introduction
Archaea play critical roles in global biogeochemical cycles across diverse environments, including the mesopelagic ocean where Nitrososphaeria are key ammonia oxidizers contributing to nitrogen cycling and greenhouse gas emissions, and wetlands/permafrost where methanogenic Euryarchaeota drive methane production. As oceanic low-oxygen zones expand and permafrost thaws, understanding archaeal ecology and their viruses becomes increasingly important for ecological assessments and climate predictions. While bacterial viruses (phages) have been extensively cataloged through metagenomics and analytical pipelines, archaeal viruses remain underrepresented, particularly outside extreme environments. Fewer than ~380 archaeal viruses have well-documented genomes or large fragments, with additional putatives in IMG/VR, compared to hundreds of thousands of phage population genomes. A major barrier is the lack of robust, scalable methods to systematically distinguish archaeal viruses from phages in large datasets. This study addresses that gap by developing MArVD2, a machine learning-enhanced tool to improve archaeal virus identification across environments, building on curated references and leveraging genome features and homology signals to deliver high-throughput, accurate classification.
Literature Review
Methodology
Resources and curation: The authors developed OcAVdb, a curated database of marine archaeal viruses assembled from literature through 2019. Only sequences >10 kb that clustered exclusively with archaeal viruses in gene-sharing networks (vConTACT2) and contained archaeal/archaeal-virus-like ORFs (DRAMv annotations) were retained. Training data comprised 857 viral sequences (>10 kb; roughly balanced archaeal viruses and phages) from RefSeq v85, VirSorter curated databases, and a marine virome from the Eastern Tropical South Pacific (ETSP), covering hot springs, hypersaline ponds, and oceans. Benchmarking data included archaeal viruses and randomly selected phages from IMG/VR v2.0 plus 25 additional putative marine archaeal viruses from Tara Oceans GOV2.0 mesopelagic samples. All benchmarking sequences were >10 kb and were verified as viral (VirSorter) and curated using vConTACT2 and DRAMv. Manual verification identified “verified archaeal viruses” and “verified phage” by clustering with references; others were categorized as putative or singletons. Feature extraction and model development: MArVD2 computes a feature table with 27 genomic and annotation-derived features per input contig. ORFs are predicted with Prodigal (-p meta), and annotations are integrated via: (i) MMseqs2 searches against viral sequences in NCBI nr, (ii) hmmsearch against pVOGs, and (iii) iterative jackhmmer against OcAVdb. Stringent thresholds are applied (MMseqs2 e-value <1e-5; hmmsearch full-length e-value <1e-10; jackhmmer e-value <1e-5). Genomic features include gene length, gene density, and strand bias among others. Co-correlation filtering removed features with Pearson r>0.95. Using scikit-learn, a random forest classifier was trained with a 70:30 split (training:out-of-bag), fivefold cross-validation, and recursive feature elimination to identify the minimal informative feature subset; feature importances were evaluated via Gini importance. During development, F1 plateaued at 0.98 using only eight top features, though all 27 contributed to optimal performance. Proximity matrix-based clustering highlighted 19 discordant training sequences (10 archaeal viruses, 9 phage), often with few OcAVdb hits and including pTN2-like Thermococcales plasmids, indicating underrepresented reference space. Benchmarking and sensitivity analyses: On an independent benchmarking dataset (verified archaeal viruses and phage from IMG/VR and GOV2.0), performance metrics were computed (TPR, SPEC, ACC, MCC, FDR), along with AUROC and AUPRC. Threshold calibration used prediction probabilities from the RF (fraction of trees voting archaeal). Additional tests evaluated robustness to dataset size (5–75% of sequences), contig length (1, 2.5, 5, 7.5, 10, >10 kb), and inclusion of non-viral microbial fragments (10–75% of data; 10–200 kb fragments from IMG/M, verified non-viral by VirSorter). Comparisons to the original MArVD (re-implemented using VirSorter outputs, MetaGeneAnnotator, BLASTP against RefSeq v77, and Pfam annotations) were performed on the same datasets to contextualize improvements. Implementation and availability: MArVD2 builds and saves the trained random forest model and feature table for reuse; re-running with new data generates per-sequence predictions and probabilities. All data (OcAVdb, training, benchmarking), model files, and code are publicly available (CyVerse, Zenodo, Bitbucket, Bioconda).
Key Findings
- Training performance: F1 score plateaued at 0.98 using eight most important features (of 27 total), indicating a compact, informative feature set. Only 19/857 training sequences showed discordance in proximity clustering; many had few OcAVdb hits or represented plasmid-like elements (e.g., pTN2-like Thermococcales plasmids). - Benchmarking accuracy: Among 221 verified archaeal viruses, MArVD2 correctly classified 212; it misclassified 18 of 582 verified phages as archaeal. Additionally, 47 putative archaeal viruses were correctly classified. Overall metrics: TPR 0.96, ACC 0.97, SPEC 0.97, MCC 0.92, FDR 0.08. - Comparison to original MArVD: On the same benchmarking data, original MArVD had TPR 0.98, ACC 0.92, SPEC 0.90, MCC 0.79, FDR 0.27; MArVD2 greatly reduced false positives and improved overall accuracy and MCC. - Threshold calibration: Average prediction probability for verified archaeal viruses was 0.87; with threshold >0.87, 71% of verified archaeal viruses were recovered with only one false positive. At 0.80 threshold, 85% of verified archaeal viruses (n=188) were identified with only two false positives among verified phage. FPR remained <2% until threshold <0.55; at 0.55, 95% of archaeal viruses (n=210) were recovered with 13 verified phage and 20 putative phage false positives. - ROC/PR performance: AUROC 0.99 and AUPRC 0.99, with precision remaining ≥98% until sensitivity exceeded 80%, indicating strong performance even for unbalanced datasets dominated by phage. - Effect of contig length: Performance (TPR, ACC, MCC, AUROC, AUPRC) exceeded 90% only for contigs >10 kb. Specificity remained high across all sizes; FDR stayed <15% across fragment sizes. - Effect of microbial contamination: Inclusion of microbial fragments increased FDR and reduced MCC (e.g., up to 53% FDR at 75% microbial contamination using a permissive threshold). Using the recommended 0.80 threshold reduced FDR to 16% in the 75% microbial dataset; false positives above this threshold were archaeal-derived (metagenomic) sequences. - Practical guidance: For optimal performance, use viral datasets pre-filtered by virus identification tools, contigs ≥10 kb, and an archaeal-virus probability threshold of 0.80.
Discussion
The study addresses the critical need for scalable discrimination between archaeal viruses and bacteriophages, a bottleneck that has limited ecological and evolutionary insights into archaeal viromes outside extreme environments. By integrating curated references (OcAVdb), genome-derived features, and multi-database homology signals into a random forest model, MArVD2 achieves high precision and robust overall performance across marine, hypersaline, and hot spring environments. Benchmarking demonstrates high AUROC/AUPRC and low false discovery rates at practical probability thresholds, indicating reliability even in datasets with many more phage than archaeal viruses. Threshold calibration enables users to balance sensitivity and precision; a recommended 0.80 probability threshold recovers ~85% of verified archaeal viruses with very few false positives. Sensitivity analyses reveal that performance depends on contig length and sample purity: specificity remains high even on short contigs, but overall accuracy and MCC drop below 10 kb and with increasing microbial contamination. These findings guide practical usage—pre-filtering for viral contigs and prioritizing ≥10 kb sequences—to minimize false positives. The model also highlighted underrepresented reference space (e.g., Thermococcales-associated elements), underscoring the importance of continued database expansion and the flexibility of MArVD2 to incorporate user-defined training data. Collectively, MArVD2 substantially improves archaeal virus detection and classification, enabling more accurate mapping of archaeal virus–host interactomes and their ecological roles.
Conclusion
MArVD2, a machine-learning enhanced tool leveraging curated archaeal virus references and multi-source genome feature profiles, substantially improves the scalable discrimination of archaeal viruses from bacteriophages. It delivers high accuracy and precision on independent benchmarks across marine, hypersaline, and hot spring environments, with optimal performance at a probability threshold of 0.80 and for contigs ≥10 kb. The tool’s modular design, public availability, and support for user-defined training sets position it to improve archaeal virus discovery as references expand. This capability will facilitate higher-resolution analyses of archaeal virus ecology and their contributions to Earth system processes. Future work should expand reference coverage (including non-marine and underrepresented archaeal hosts), improve performance on shorter contigs, and refine strategies to reduce false positives in datasets with cellular contamination.
Limitations
- Performance declines on short contigs (<10 kb): TPR, ACC, MCC, AUROC, and AUPRC drop below 90% for shorter fragments, although specificity remains high. - Sensitivity to microbial contamination: Increasing proportions of non-viral microbial fragments elevate FDR and reduce MCC; even with threshold tuning, contaminated datasets can yield non-trivial false positives. - Reference coverage gaps: Misclassifications/proximity outliers were enriched for sequences with few OcAVdb hits (e.g., Thermococcales plasmid-like elements), indicating that incomplete reference space limits performance for some archaeal lineages. - Environment coverage: Training and benchmarks emphasized marine, hypersaline, and hot spring environments; performance in other habitats may vary until additional archaeal virus references are incorporated.
Listen, Learn & Level Up
Over 10,000 hours of research content in 25+ fields, available in 12+ languages.
No more digging through PDFs, just hit play and absorb the world's latest research in your language, on your time.
listen to research audio papers with researchbunny