Biology

Understanding and modeling human traits and diseases: Insights from the comparative genomics resources of Zoonomia

M. Ye and D. Zhang

This groundbreaking research by Maosen Ye and Deng-Feng Zhang uncovers the genetic underpinnings of human traits and diseases using data from the Zoonomia Project. By scrutinizing evolutionary constraints across 240 mammalian genomes, the team reveals the critical roles of conserved genomic regions, coding and non-coding regions, and guides the development of better animal models for studying human diseases.... show more

Introduction

The paper addresses the challenge of translating statistical associations from human genetic studies, particularly GWAS of complex traits and diseases, into biological mechanisms and clinical strategies. It posits that evolutionary conservation (constraint) across species is indicative of functional importance, and thus comparative genomics can prioritize and interpret disease-relevant variants. Leveraging the Zoonomia Project’s whole-genome comparisons across 240 placental mammals, the authors outline how conservation at single bases, genes, and noncoding regulatory elements can illuminate genetic architecture and guide the selection and design of animal models to study human traits and diseases.

Literature Review

The article synthesizes findings from a special issue of Science reporting Zoonomia Project resources and analyses. Sullivan et al. quantified single-base constraint across 240 mammals, showing that 3.3% of human genome bases are constrained, with a strong enrichment in noncoding regions. Constraint scores correlate negatively with human allele counts and are higher for pathogenic versus benign ClinVar variants, enhancing causal variant prioritization; constrained bases also cluster, implying gene/element-level constraints. Kirilenko et al. introduced TOGA, a machine-learning-based ortholog and coding annotation pipeline applied to 488 placental mammal genome assemblies, enabling large-scale comparative gene annotation (while focused on conserved coding genes). Andrews et al. examined conservation of ENCODE-defined CREs and TFBSs, identifying highly constrained subsets presumed to govern core biological processes, and found primate-specific regulatory elements tied to environmental interactions; nearly all primate-specific TFBSs overlap transposable elements. Variants in constrained noncoding regions explain a larger share of trait heritability. Kaplow et al. developed TACIT, a phylogeny-aware machine learning toolkit that predicts tissue-specific enhancer activity across species and identifies candidate enhancers linked to phenotypes (e.g., brain size), demonstrating utility for mapping enhancer variation to complex traits.

Methodology

This commentary integrates methodologies from recent comparative genomics studies: (1) multi-species whole-genome alignment across 240 placental mammals to compute base-pair evolutionary constraint scores and assess conservation at single-base resolution; (2) orthology inference and large-scale coding gene annotation using TOGA, a machine-learning classifier applied to hundreds of mammalian assemblies; (3) conservation analyses of noncoding regulatory elements leveraging ENCODE-defined CREs/TFBSs and comparative constraint metrics; (4) machine learning approaches (TACIT) that, trained on tissue- or cell-type-specific enhancers across species and informed by phylogeny, predict enhancer activity genome-wide in any species and relate enhancer variation to phenotypes; and (5) integration with human datasets (ClinVar and GWAS) to connect conservation with pathogenicity, allele frequency, and trait heritability. The paper discusses how these tools and resources can be operationalized to choose appropriate species and targets for modeling coding versus noncoding disease variants in animals.

Key Findings

Only 3.3% of bases in the human genome are evolutionarily constrained; 80.7% of these constrained bases are noncoding and 19.3% are within coding sequences (Sullivan et al.).
Common variants are depleted in constrained bases; constraint scores are negatively correlated with allele counts in human populations.
Pathogenic variants have higher constraint scores than benign variants (ClinVar), supporting the use of constraint to prioritize causal variants; constrained bases show clustering consistent with element- or gene-level constraint.
Using only primate species reduces resolution to 10–100 bp, whereas hundreds of mammalian genomes enable base-pair specificity.
57.6% of coding sequence bases are constrained, validating coding gene annotations.
TOGA enables scalable orthology-informed coding gene annotation across 488 placental mammal assemblies.
Among noncoding regulatory elements, 439,000 of human CREs (47.5% of all ENCODE CREs; ~4% of the human genome) and ~2 million TFBSs (~0.8% of the human genome) are highly constrained across mammals (Andrews et al.).
Primate-specific CREs constitute ~10% of human CREs and are linked to environment-interaction genes; ~20% of TFBSs are primate specific, and nearly all primate-specific TFBSs overlap transposable elements.
Variants in constrained noncoding regions explain a larger proportion of heritability of human traits compared with variants in less constrained regions (Andrews et al.).
TACIT predicts tissue-specific enhancer activity across species and can identify candidate enhancers associated with phenotypes such as brain size (Kaplow et al.).
Comparative genomics resources and ML tools collectively aid in interpreting GWAS signals, especially in noncoding regions, and in selecting species/targets for disease modeling.

Discussion

The synthesis argues that evolutionary constraint provides a principled filter to bridge statistical associations to mechanism by highlighting genomic positions and elements likely to be functionally important. Single-base constraint informs variant pathogenicity and prioritization, while conservation of CREs/TFBSs delineates regulatory programs under purifying selection that contribute disproportionately to trait heritability. Machine learning approaches such as TOGA and TACIT extend these insights by scaling orthology-aware annotation and by predicting tissue-specific regulatory activity across species, enabling cross-species functional inference. For disease modeling, constrained coding variants can be directly engineered (e.g., knock-in models) in suitable species. For noncoding variants, where sequence divergence and context-dependence are greater, conservation analyses and activity prediction are essential to select the right species and the correct CREs to perturb. Integrating base-level constraint, element-level conservation, 3D chromatin context, and transcriptional regulation supports moving beyond traditional transgenic/knockout paradigms toward “trans-element” and “trans-epigenic” models that more faithfully recapitulate human regulatory variation.

Conclusion

Comparative genomics across hundreds of mammals, exemplified by Zoonomia, yields high-resolution maps of evolutionary constraint that illuminate both coding and noncoding genome function. Coupled with tools like TOGA and TACIT and with functional genomics datasets (ENCODE, ClinVar, GWAS), these resources enhance variant interpretation, prioritize disease mechanisms, and guide the rational design of animal models for human traits and diseases. Future work should expand comparative functional genomic resources, especially for noncoding regions, and integrate multi-layer regulatory information (co-regulation, gene–gene and gene–environment interactions, 3D genome) to enable precise “trans-element” and “trans-epigenic” modeling and targeted interventions across diverse species.

Limitations

The commentary highlights several constraints: (1) TOGA currently focuses on conserved coding gene annotation, leaving noncoding annotations less complete; (2) noncoding elements show lower sequence conservation and complex, context-specific regulation, complicating cross-species functional inference; (3) resolution of constraint diminishes when limited to closer clades (e.g., primates), potentially reducing power to pinpoint causal bases; (4) existing animal models for noncoding variation are limited in scope (often rodents and select noncoding RNAs), underscoring challenges in selecting appropriate species and regulatory targets; and (5) as a synthesis article, it does not present new experimental validation, and the translational utility depends on continued development of comparative resources and high-throughput assays.

Related Publications

Explore these studies to deepen your understanding of the subject.

Social Work

Daily rhythm of urban space usage: insights from the nexus of urban functions and human mobility

F. Du, J. Wang, et al.

Psychology

Do perceived control and time orientation mediate the effect of early life adversity on reproductive behaviour and health status? Insights from the European Value Study and the European Social Survey

B. C. Farkas, V. Chambon, et al.

Environmental Studies and Forestry

Evaluation of the Impact of Concentration and Extraction Methods on the Targeted Sequencing of Human Viruses from Wastewater

M. Jiang, A. L. W. Wang, et al.

Sociology

Duration of agriculture and distance from the steppe predict the evolution of large-scale human societies in Afro-Eurasia

T. E. Currie, P. Turchin, et al.

Listen, Learn & Level Up

Over 10,000 hours of research content in 25+ fields, available in 12+ languages.

No more digging through PDFs, just hit play and absorb the world's latest research in your language, on your time.

listen to research audio papers with researchbunny