logo
Loading...
The DoGA consortium expression atlas of promoters and genes in 100 canine tissues

Veterinary Science

The DoGA consortium expression atlas of promoters and genes in 100 canine tissues

M. Hörtenhuber, M. K. Hytönen, et al.

Discover how researchers led by Matthias Hörtenhuber and a diverse team of experts have built a groundbreaking tissue biobank for dogs, unraveling vital gene expression data that could illuminate human diseases. This extensive study offers an interactive atlas into canine biology and development.... show more
Introduction

Dogs are powerful models for inherited diseases and complex traits due to their unique population history and breed structure, which allow mapping with smaller cohorts. Despite advances in canine reference genomes (e.g., CanFam3.1 and newer long-read assemblies), functional annotation remains incomplete, especially for regulatory elements like TSSs, promoters, and enhancers that often underlie complex traits. The Dog Genome Annotation (DoGA) Consortium was formed to address this gap by building a large dog and wolf tissue biobank, applying 5′-end transcriptomic technologies (STRT2-seq, CAGE-seq) to identify and annotate promoter regions, and constructing a promoter and gene expression atlas across 100 tissues. This work aims to facilitate gene discovery in regulatory regions, improve understanding of gene regulation in dogs and wolves, and enhance translational relevance to human health.

Literature Review

The paper situates its contribution within prior canine genomics work. The widely used CanFam3.1 reference (2005; improved 2014) contained >20,000 gaps and missed many regulatory elements. Recent long-read assemblies across breeds and dingo have greatly improved contiguity. However, functional annotations of regulatory elements remained limited. Existing resources, such as BarkBase and EpiC-dog, have provided epigenomic annotations and tissue-enriched gene sets but cover fewer tissue types and lack base-pair resolution of TSS usage. The FANTOM consortium’s promoter-level atlas in mammals provides a methodological template for promoter discovery and annotation that the authors adapt to dogs. The study also leverages external datasets for validation and comparison, including ATAC-seq and histone mark ChIP-seq, and the Human Protein Atlas for cross-species tissue-enrichment comparisons.

Methodology

Biobank and sampling: The DoGA Biobank contains nearly 6000 samples from up to 132 RNA-later or snap-frozen tissue types across 49 animals (13 dogs, 12 dog embryos, 24 wolves; including 11 adult and 13 pup wolves). Thorough clinical exams preceded euthanasia for pet dogs. Necropsy and histology ensured selection of tissues without pathological changes. Approximately 1×1×1 mm³ aliquots were prepared for RNA; adjacent tissue slices were formalin-fixed and HE-stained for pathology review. Wolf tissues were collected postmortem from wildlife parks and additional Russian wolves under ethics approvals. Metadata is accessible via the DoGA DCC.

RNA isolation: Tissues in RNAlater were stored at −80 °C after stabilization; snap-frozen tissues stored at −80 °C. RNA was extracted using QIAZOL and/or RNeasy Mini kits with protocol adjustments for tough tissues. RNA quality control included DeNovix spectrophotometry, Qubit quantification, and Bioanalyzer/Tapestation integrity; RIN >7.0 was required.

STRT2-seq library preparation and sequencing: The STRT protocol was adapted for bulk RNA (40 ng input; 48 samples per library), with 8 bp UMIs and Globin-Lock to deplete globin mRNA. ERCC spike-ins were added for normalization. Reverse transcription used oligo(dT) with template switching to incorporate UMIs, followed by PCR (10 cycles), barcoding, pooling, adapter ligation, and further amplification (10–12 cycles). Libraries were QC’d by TapeStation and KAPA quantification and sequenced on Illumina NextSeq 500 (75 cycles, High Output v2).

Data processing: BCL files were demultiplexed with Picard, FASTQs aligned to CanFam3.1 and UU_Cfam_GSD_1.0/CanFam4 plus ERCC using HISAT2. UMI-annotated BAMs were generated by merging aligned and unaligned BAMs; lane-level BAMs were merged. PCR duplicates were marked. QC metrics provided in Supplementary Data.

Promoter identification and classification: 5′-end read positions were clustered with Paraclu (as in FANTOM5). Clusters overlapping the last exon ±500 bp were excluded. Gene-level expression was computed by summing cluster expression over annotated gene bodies (NCBI CanFam4 v106). Counts were normalized to ERCC using logNormCounts (SingleCellExperiment). Filters: normalized count >50 across samples; comprehensive promoters: >20 in ≥1 sample; robust promoters: >30 in ≥3 samples. Promoters within ±500 bp of annotated TSSs were classified as known; the highest-expressed site per gene was primary; additional sites alternative; clusters not within ±500 bp of annotated genes were novel.

Validation with external epigenomics: Overlaps were computed between extended STRT2 clusters (−1000 bp upstream, +500 bp downstream) and ATAC-seq peaks (BarkBase) and ChIP-seq peaks for H3K27ac and H3K4me3 (Son et al.), re-mapped to CanFam4 using nf-core pipelines. Overlaps were summarized by promoter class.

Tissue enrichment analysis: Tissue-enriched promoters/genes were defined when the median expression in a tissue was ≥3× higher than the mean across all other tissues (median used for robustness). Cross-species conservation used 1:1 orthologs inferred by gene symbols compared to the Human Protein Atlas.

RT-PCR/Sanger validation of new first exons: Five robustly expressed candidates with putative new first exons (ATPSME, MTIF3, TNNI3, TBC1D22B, GPT) were validated by RT-PCR and Sanger sequencing in tissues with specific expression (retina, frontal cortex, testis, myocardium). Consensus cDNAs were aligned (UCSC BLAT) to CanFam4.1.

Use-case analyses: - OMIA gene expression profiles across tissues. - Conservation of tissue-enriched expression between dog and human. - Cardiac sub-tissue promoter usage (atria vs ventricles). - Overlap between robust promoters and lineage- and behavior-associated SNP clusters (lifted over to CanFam4). - Embryonic development (20–25 vs 30 days) promoter and gene usage dynamics.

Resources and data availability: DoGA DCC (sample/library metadata, raw/processed data for CanFam3.1/CanFam4), Expression Atlas (interactive browser), Zenbu and UCSC track hubs, GitLab analysis scripts. WGS: six dogs and four wolves (HiSeq X PE150), variants processed per established pipeline; data deposited under BioProject PRJNA907518; STRT2-seq data to SRA under same project.

Key Findings
  • Biobank: 5926 samples (1850 unique) from 132 tissues across 13 dogs, 12 embryos, and 24 wolves; histopathology confirmed normal tissues for transcriptomics.
  • STRT2-seq expression atlas: 361 samples from 100 tissues across 13 organ systems (≥3 biological replicates for nearly all tissues; both sexes represented).
  • Promoter discovery: 106,681 comprehensive promoters identified; 46,644 robust promoters reproducible across replicates. Robust set composition (Table 1): primary 15,654; alternative 16,596; novel 14,394. Comprehensive set: primary 24,978; alternative 40,282; novel 41,421 (as reported).
  • Epigenomic validation (robust promoters; overlap percentages): ATAC-seq 55.2% overall (primary 75.0%, alternative 47.9%, novel 42.3%); H3K27ac 54.9% (primary 71.7%, alternative 53.1%, novel 12.9%); H3K4me3 35.3% (primary 59.2%, alternative 29.5%, novel 5.4%).
  • RT-PCR/Sanger validation: New first exons/TSSs supported for ATPSME, MTIF3, TBC1D22B, GPT; TNNI3 showed a novel TSS upstream of exon 3; CanFam4.1 improvements captured most new TSSs compared to CanFam3.1.
  • Tissue enrichment: 12,505 robust promoters and 9367 genes enriched in one or more tissues. Testis had the most enriched promoters (6626) and genes (6326). By organ system, 9108 robust promoters and 4770 genes were organ-system enriched; embryo had most enriched genes (1825), CNS had most enriched robust promoters (1779).
  • Cross-species conservation: Among 1001 enriched 1:1 human orthologs, 726 (73%) showed enrichment in the same tissue in dog and human; testis had the largest shared set (473 genes).
  • Cardiac sub-tissue patterns: As in humans, NPPA enriched in atria; MYL2 enriched in ventricles; RGS3 showed chamber-specific alternative promoter usage (RGS3_p1 high in right ventricle; RGS3_p2 high in right atrium), mirroring human data.
  • OMIA disease gene context: Clear tissue enrichments observed (e.g., SIX6 mainly in retina/eye tissues). RBP4 exhibited multiple alternative promoters with differential tissue enrichment (8 enriched in liver; 3 broadly expressed including retina).
  • SNP overlaps: Of 6327 lineage-specific SNP clusters, 291 (7%) overlapped promoter regions (robust set: 1213 primary, 1256 alternative, 1053 novel promoters overlapping at least one cluster). Of 1079 behavior-associated SNP clusters, 284 (26%) overlapped promoters (257 primary, 240 alternative, 276 novel); the “Sorry when wrong” behavior category had the most overlaps; many overlapped promoters were alternative or novel without strong tissue-level enrichment.
  • Embryonic development: Enriched genes increased from 96 (20–25 days) to 454 (30 days), reflecting organogenesis onset. Stage-specific promoter/gene activity observed (e.g., AIPL1 earlier; CBX2 peaking at 30 days).
Discussion

The study delivers the first comprehensive base pair–resolution promoterome for the CanFam4 reference across 100 tissues, substantially enhancing functional annotation beyond prior canine RNA-seq and histone mark datasets. High overlap of robust promoters with promoter-associated chromatin features and successful RT-PCR validations support data quality. Tissue-level clustering and enrichment analyses demonstrate biologically coherent expression patterns, including fine-grained CNS and cardiac sub-tissue resolution. The DoGA resource enables prioritization of disease-associated genes and variants by providing tissue context and promoter usage, exemplified by OMIA gene profiles and alternative promoter usage in RBP4. Cross-species comparisons affirm dogs as relevant models for human tissue biology, with strong conservation of tissue-enriched gene expression. Overlaps between lineage/behavior SNP clusters and promoters, including many previously unannotated promoter regions, highlight candidates for functional follow-up and potential mechanistic links to breed traits and behaviors. Embryonic analyses reveal dynamic promoter usage aligning with developmental milestones. Collectively, these resources will accelerate discovery related to regulatory variants, alternative TSS usage, allele-specific expression, and comparative genomics in canids with translational relevance to humans.

Conclusion

The DoGA consortium presents a large canine biobank and an atlas of promoter and gene expression across 100 tissues, identifying ~107k promoter candidates (including ~41k novel) and a robust set of ~46.6k promoters validated by chromatin marks and targeted RT-PCR. The resource reveals extensive tissue-specific and developmental regulation, conservation with human tissue-enriched gene expression, and overlaps with lineage- and behavior-associated SNPs that can guide functional studies. Open-access tools (DCC, Expression Atlas, genome browser hubs, and analysis scripts) make the data broadly usable. Future work will expand tissues and developmental stages, enhance resolution (e.g., single-cell), refine promoter classifications (including grouping nearby novel promoters), extend regulatory annotations to elements like enhancers, and deepen dog–wolf comparative neurogenomics to study domestication and behavior.

Limitations
  • Incompleteness: The promoter catalog and expression atlas are not exhaustive; additional tissues, life stages, and deeper sequencing would uncover more, especially lowly expressed promoters.
  • Sampling constraints: Recruitment of privately owned dogs with health/behavioral issues may introduce bias; mitigated through clinical exams and exclusion criteria.
  • Lack of single-cell resolution: Limits detection of cell type–specific promoters within complex tissues.
  • Tissue gaps and RNA quality: Some tissues (e.g., parts of intestine) lacked high-quality RNA, reducing coverage; low-input and truncated transcripts can obscure true TSSs.
  • Threshold and annotation dependencies: Promoter classification (primary/alternative/novel) depends on distance thresholds and current gene models; improved models may reclassify sites. Difficulty grouping nearby novel promoters into larger loci prevented labeling them as primary/alternative.
  • Fine-grained CNS sampling: High similarity among closely related CNS subregions may reduce detection of enriched genes/promoters compared with more distinct tissues like testis.
  • Ortholog mapping: Conservation analyses relied on gene symbols rather than orthology frameworks like TOGA, potentially missing or misassigning orthologs.
Listen, Learn & Level Up
Over 10,000 hours of research content in 25+ fields, available in 22+ languages.
No more digging through PDFs, just hit play and absorb the world's latest research in your language, on your time.
listen to research audio papers with researchbunny