Veterinary Science

Divide-and-conquer: machine-learning integrates mammalian and viral traits with network features to predict virus-mammal associations

M. Wardeh, M. S. C. Blagrove, et al.

This groundbreaking study conducted by Maya Wardeh, Marcus S. C. Blagrove, Kieran J. Sharkey, and Matthew Baylis reveals the potential of machine learning in uncovering over 20,000 unknown virus-mammal associations, highlighting a significant gap in our understanding of these relationships, especially among wild mammals and various viruses.... show more

Introduction

The study addresses the limited understanding of mammalian host ranges of known viruses, a critical factor for assessing zoonotic risk and preventing spillover into human populations. Many mammalian viruses have unknown or under-described host associations, with research effort biased toward humans and domesticated species. High-profile outbreaks (e.g., SARS-CoV, MERS-CoV, SARS-CoV-2) underscore the importance of mapping potential host ranges, as cross-species transmission often involves intermediate or multiple hosts. Existing data, although biased, can be leveraged to estimate under-observed associations. The authors frame the problem as predicting unknown virus–mammal associations by integrating information on viral traits, mammalian traits, and global network topology of known associations, with the goal of quantifying underestimation in host range and identifying priority knowledge gaps.

Literature Review

Prior work has shown that host range correlates with zoonotic potential and that viral sharing patterns can be analyzed via global host–pathogen networks. Network topology and features such as viral genome type (DNA vs RNA) and host taxonomic groups (e.g., bats vs rodents) influence pathogen sharing. Motifs—small subgraphs serving as building blocks of complex networks—capture indirect interactions and have been applied in biological and ecological systems to predict missing links. Previous studies have used network and phylogeographic methods to understand disease emergence, predict missing links in host–parasite networks, and estimate mammalian viral diversity. However, integrating network motifs with detailed viral and mammalian trait data in a multi-perspective machine-learning framework has been limited. The authors build on this literature by combining motif-based network features with host and virus trait models to improve prediction of unknown associations.

Methodology

Data: Species-level virus–mammal associations were extracted from EID2 (Dec 2019), integrating evidence from GenBank sequence metadata and PubMed titles/abstracts. After curation and aggregation to species-level, 6,331 associations between 1,896 virus species and 1,436 terrestrial mammal species were retained (0.23% of 2,722,656 possible links). Evidence included both sequence/isolation/PCR (70.48%) and serology (29.52%); 22.79% of associations had both publication and sequence evidence; 33.03% sequence-only; 44.18% publication-only. A separate pipeline using sequence-evidence-only associations (55.82% of total) was also trained for sensitivity analysis. Framework: The problem was framed as supervised link prediction using three complementary perspectives, each producing probabilistic predictions for all possible virus–mammal pairs and combined by majority voting (predicted if at least two perspectives agree).

Mammalian perspective (local host models): For each mammal with ≥2 known viruses (n=699), models were trained to predict associations with 1,896 viruses using viral traits (Table 1), including host-driven distances (mean phylogenetic, mean ecological, maximum phylogenetic breadth), genome/capsid properties (RNA vs DNA, sense, circularity, segmentation, envelope), replication/entry/release features (GC content, genome size, replication site, release mode, cell entry mechanisms), and transmission routes. Response was binary (known association=1, otherwise 0). Class imbalance was addressed via SMOTE. Eight algorithms were trained per host: avNNet, GBM, Random Forest, XGBoost, SVM (radial, linear, polynomial with class weights), and Naive Bayes, using 10x10 CV with adaptive resampling to optimize AUC. The best-performing classifier per host (based on AUC, TSS, F1) was selected, then refit in 50 replicate runs to form a bagged ensemble; median probabilities were used; 90% CIs were computed across replicates.
Viral perspective (local virus models): For each virus with ≥2 known hosts (n=556), models predicted associations with 1,436 mammals using mammalian traits (Table 2): phylogeny (phylogenetic distance to known hosts, evolutionary distinctiveness), taxonomy/domestication, ecological traits (body mass, life-history, reproductive traits, habitat, diet), mean ecological distance (Gower-based), geospatial features (range size, climate, land cover and agriculture diversity, biodiversity, urbanization/human population). Same algorithms, SMOTE balancing, CV/tuning, selection, and 50-replicate ensembles as above.
Network perspective (global model): The virus–mammal bipartite network was encoded via counts of higher-order potential motifs for each candidate association. For each virus–mammal pair, the focal link was "force-inserted" and counts of all 3-, 4-, and 5-node potential motifs involving the pair were computed within the 3-step ego networks of the virus and mammal. This produced motif count features (Fig. 4C/D). Research effort covariates (total sequences and publications per virus and mammal from EID2) were included in network models. Extreme class imbalance was addressed by repeated random under-sampling to balanced sets (1,000 positives + 1,000 unknowns) and training the same algorithm set with 10-fold CV and adaptive tuning to optimize AUC. This was repeated 100 times to produce an ensemble; median probabilities and 90% CIs were computed. The best-performing ensemble (SVM with radial kernel and class weights) by AUC, F1, and TSS was retained. Ensembling and validation: The three perspectives’ probabilities were combined by majority voting at a 0.5 threshold per perspective. Performance was assessed on a held-out test set comprising 15% of all pairs (n=407,265; 954 positives). Models were trained on the remaining 85% (n=2,315,391; 5,377 positives). Additional validations included systematic removal of known associations (leave-one-link-out) and evaluation against an external dataset of wild mammal–virus associations from literature. Variable importance was computed via model-independent ROC-based filter methods for mammalian and viral perspectives, and via model-based relative influence for network motif features. A secondary pipeline incorporated research effort directly as predictors into mammalian and viral perspectives; results were highly concordant with the main pipeline.

Key Findings

Predicted unknown associations: Median 20,832 unknown virus–mammal associations overall (90% CI [2,736, 97,062]); 18,920 [2,440, 91,517] in wild or semi-domesticated mammals. Individual perspectives predicted: mammalian 41,537 [4,275, 238,971]; viral 21,352 [2,536, 95,630]; network 76,081 [27,738, 205,814]. Overall associations increased ~4.29-fold ([~1.43, ~16.33]); ~4.89-fold ([~1.5, ~19.81]) for wild/semi-domesticated mammals.
Sequence-evidence-only pipeline: Predicted 15,721 [1,603, 88,553] unknown associations overall; 13,930 [1,298, 83,043] in wild/semi-domesticated mammals.
Example (West Nile virus, WNV): Mammalian perspective predicted median 90 [17, 410] new WNV–mammal links (~2.61-fold [~1.3, ~8.32]); viral perspective 48 [0, 214] new hosts (~1.86-fold [~1, 4.82]); network perspective 721 [448, 1,317] (~13.88-fold [9, 24.52]). After voting, median 117 [15, 509] new WNV–mammal associations (~3.45-fold [~1.3, ~12.2]). For Rousettus leschenaultii, predicted 45 [5, 235] additional viruses (~1.37-fold [~1.26, ~13.37]).
Host range expansion: Average predicted mammalian host range per virus = 14.33 [4.78, 54.53] (~3.18-fold [~1.23, ~9.86]). RNA viruses: 21.65 [7.01, 82.96] hosts (~4.00-fold [~1.34, ~14.15]); DNA viruses: 7.85 [2.81, 29.47] (~2.43-fold [~1.14, ~6.89]).
Wild/semi-domesticated mammals: Predicted ~4.28-fold [~1.2, ~14.64] increase in number of virus species per species, to 16.86 [4.95, 68.5] on average; average of 13.45 [1.73, 65.04] unobserved viruses per species. Order-level fold increases per species included Ruminantia (~9.12), Primates (~7.12), Suina (~7.12), Perissodactyla (~5.74), Lagomorpha (~5.27), Rodentia (~4.79), Carnivora (~3.91), Chiroptera (~2.79) (Table 4).
Zoonotic and domestic-virus focus: Predict a 5.35-fold increase in associations between wild/semi-domesticated mammals and known zoonotic viruses (excluding rabies), and a 5.20-fold increase with viruses of economically important domestic species. Bats: 5.55-fold (3.77 more associations per species for all viruses), 7.42-fold (2.30 per species for zoonoses), 8.29-fold (2.42 per species for domestic-animal viruses). Rodents: 5.45-fold (2.69 per species overall), 6.43-fold (3.69 per species zoonoses), 7.7-fold (2.92 per species domestic-animal viruses). Notable predicted gaps include Lyssaviruses (non-rabies), Bornaviruses, and Rotaviruses.
Viral family-level host range increases (median predicted hosts [CI] and fold increase): Orthomyxoviridae 71 [15.5, 293.25] (~9.51); Bornaviridae 60.25 [15, 196.5] (~7.76); Rhabdoviridae 52.8 [23.68, 149.09] (~7.33); Hepeviridae 70.67 [25.33, 220] (~6.67); Filoviridae 48.5 [12.45, 161.65] (~5.71); Togaviridae 31.75 [7, 155.62] (~5.77); Flaviviridae 40.59 [11.26, 131.77] (~5.09); Coronaviridae 22.86 [6.23, 94.89] (~4.81); Paramyxoviridae 23.22 [7.76, 88.76] (~4.77); Poxviridae 32.5 [9.39, 111.21] (~4.56); Reoviridae 26.28 [9.39, 98.79] (~4.46).
Variable importance: Mammalian perspective—mean phylogenetic distance to known hosts (median 95.4% [75.6%, 100%]) and mean ecological distance (90.90% [43.50%, 100%]) were top predictors; maximum phylogenetic breadth also influential (74.70% [16.60%, 100%]). Viral perspective—mean phylogenetic distance (all viruses median 98.75% [93.01%, 100%]) and mean ecological distance (94.39% [71.86%, 100%]) were top; life-history (longevity, body mass, reproductive traits) added significant signal. Network perspective—most influential motifs: M4.1 (median 100% [90.19%, 100%]), M5.1 (97.84% [89.19%, 99.93%]), M5.7 (97.22% [87.7%, 98.77%]), M4.6 (96.75% [86.13%, 100%]); 5-node motifs had higher median influence than 3- or 4-node motifs. Research effort covariates were also important (viruses 90.26% [82.94%, 95.36%]; mammals 88.42% [78.38%, 94.87%]).
Performance: On held-out test set, ensemble voting achieved AUC 0.938 [0.862–0.959], F1-score 0.284 [0.464–0.124], TSS 0.876 [0.724–0.918] without research effort in local models; with research effort, AUC 0.920 [0.823–0.944], F1 0.272 [0.526–0.093], TSS 0.840 [0.646–0.888]. Voting outperformed any individual perspective. Systematic removal test recovered ~90.70% (virus-removed) and 89.92% (mammal-removed) of links. External validation: predicted 84.02% [77.69%, 89.60%] of external associations with detection quality >0 (77.82% [68.46%, 86.51%] any quality); similar with research effort included.

Discussion

The multi-perspective framework addresses the core question of how under-described current virus–mammal association data are and which hosts are likely susceptible to known viruses. By independently modeling associations from mammal, virus, and network viewpoints and combining via voting, the approach improves precision-recall (F1) and robustness over single-perspective models. Findings indicate substantial underestimation of associations—particularly in wild and semi-domesticated mammals—expanding predicted host ranges of many key viral families (e.g., Orthomyxoviridae, Rhabdoviridae, Bornaviridae, Flaviviridae) and highlighting major gaps for zoonotically relevant viruses (Lyssaviruses excluding rabies, Bornaviruses, Rotaviruses). The framework pinpoints taxa (bats, rodents, wild ruminants) where surveillance can be prioritized and suggests specific host–virus pairs for targeted sampling. Variable-importance analyses confirm the central role of phylogenetic and ecological proximity to known hosts, along with life-history and network motif structure, in shaping host susceptibility, providing biological interpretability. Comparisons with external datasets and link-removal tests support predictive validity. Overall, the results help quantify and locate knowledge gaps to inform surveillance strategies, risk assessments for zoonotic spillover, and disease management at the wildlife–livestock–human interface.

Conclusion

The study introduces a divide-and-conquer, machine-learning framework integrating mammalian traits, viral traits, and network motif features to predict unknown associations between known viruses and mammalian hosts. The approach predicts over 20,000 previously unrecognized associations and a ~3–4-fold expansion of average viral host ranges, revealing pronounced under-description in wild and semi-domesticated mammals and in key zoonotic viral groups. The ensemble voting method improves performance over single-perspective models and enables species- and virus-level prioritization for surveillance. Future work should incorporate additional viral genetic features and geographic distributions, explicit receptor compatibility data as a fourth perspective, vectors and intermediate hosts, and avian species to better capture ecological transmission pathways. Expanding and standardizing negative (unsusceptible) data and addressing research-effort biases will further enhance inference and generalizability.

Limitations

Research effort bias: Predictions are influenced by uneven study intensity across species and viruses; heavily researched viruses and mammals tend to yield more predicted associations. Including research effort mitigates but does not eliminate this bias.
Lack of negative (unsusceptible) labels: Absence of standardized non-association data limits the ability to disentangle true biological unsuitability from under-sampling.
Scope limited to known viruses (known-unknowns): The framework cannot predict associations for completely novel, previously unobserved viruses (unknown-unknowns) due to missing viral and network features.
Taxonomic and data coverage: Birds and other non-mammalian hosts were not integrated; limited avian data restricts capturing cross-taxa transmission pathways important for certain viruses (e.g., flaviviruses, influenza).
Evidence variability: Association evidence types (serology vs sequence/isolation) vary in strength and specificity across clades; while a sequence-only pipeline was evaluated, residual heterogeneity may affect estimates.

Related Publications

Explore these studies to deepen your understanding of the subject.

Medicine and Health

Machine Learning Model to Differentiate Between Acute Kidney Injury and Functional Decline in Children with Urinary Tract Infection

T. Cm

Biology

MArVD2: a machine learning enhanced tool to discriminate between archaeal and bacterial viruses in viral datasets

D. Vik, B. Bolduc, et al.

Medicine and Health

Machine-learning algorithms for asthma, COPD, and lung cancer risk assessment using circulating microbial extracellular vesicle data and their application to assess dietary effects

A. Mcdowell, J. Kang, et al.

Psychology

Using machine learning to understand social isolation and loneliness in schizophrenia, bipolar disorder, and the community

S. J. Abplanalp, M. F. Green, et al.

Listen, Learn & Level Up

Over 10,000 hours of research content in 25+ fields, available in 12+ languages.

No more digging through PDFs, just hit play and absorb the world's latest research in your language, on your time.

listen to research audio papers with researchbunny