Biology
Universal prediction of vertebrate species age at maturity
A. M. Budd, S. Y. Yong, et al.
The study addresses the need for a universal, rapid, and generalisable method to estimate vertebrate species’ age at maturity, a key life-history trait closely linked to extinction risk, generation time, and population growth. Traditional methods are species-specific, indirect, or impractical for many taxa (e.g., long-lived or cryptic species). The authors hypothesise that CpG density in gene promoters—reflecting regulatory architecture and epigenetic states—can reliably predict species-level age at maturity across vertebrates. They aim to develop and validate an all-vertebrate model and group-specific models (fish, mammals, reptiles) leveraging genome sequences to enable broad conservation applications, including triage for data-deficient species.
Prior work showed promoter CpG density is associated with lifespan across vertebrates and can predict age at maturity in mammals. Vertebrate promoters exhibit distinctive CpG distributions predictive of epigenetic regulation (DNA methylation, histone marks) and gene expression states, providing a mechanistic basis for life-history associations. Alternative approaches to infer age at maturity include physiological predictors in fishes, mark–recapture and genetic methods in turtles, and hormonal assays in whales, but these are taxon-specific and resource-intensive. DNA methylation emerged as a strong predictor of age at maturity in mammals (explaining ~72% variance), but promoter CpG-based methods promise broader applicability across taxa due to data availability and conserved promoter architecture.
Data collection and curation:
- Compiled vertebrate age-at-maturity records from AnAge, FishBase, Amniote Life History Database, PanTHERIA, Animal Diversity Web, and primary literature. Removed duplicate entries across databases. Detected and removed within-species outliers via IQR (1.5× rule). Calculated a single species-level value as the mean of remaining reports; combined sexes due to limited sex-specific genomic data and frequent equality of reported values.
- Initially identified 1379 species with both age-at-maturity and genome data (NCBI genomes). Removed three extreme outlier species (z-score > 8) based on known age at maturity. Excluded 17 species with no promoter BLAST matches. Final modelling set: 1359 species.
- Grouping: fish (polyphyletic; n=331, 55 orders), reptiles including birds (monophyletic Sauropsida; n=461, 31 orders; birds n=403 from 28 orders), mammals (n=550, 21 orders), amphibians (n=17, 2 orders). No amphibian-specific model due to insufficient data and lack of amphibian reference promoters.
Genomic data and promoter retrieval:
- Downloaded genomes from NCBI (accessed 28/01/2024). Where multiple assemblies existed, used NCBI reference/representative assemblies; assessed genome completeness with BUSCO (v5.2.2) and noted contiguity (N50). For within-species variability, used 98 assemblies across 14 species.
- Reference promoter sets from EPDnew: human (Homo sapiens), chicken (Gallus gallus), zebrafish (Danio rerio). Extracted ±100 bp around TSS for representative promoters.
- For each species genome, built BLAST+ databases (v2.12.0) and queried group-appropriate reference promoters (human for all-vertebrate and mammal models; chicken for reptiles; zebrafish for fish), requiring ≥70% identity. Selected the top hit per promoter per species.
Feature construction and CpG metrics:
- Computed CpG observed/expected (CpG O/E) per promoter: [#CpG/N]/([#C/N]×[#G/N]). Also computed genome-wide GC proportion. Encoded species order as one-hot variables.
Modelling framework:
- Target: natural log-transformed age at maturity (years).
- Split by percentiles of transformed age and order: 70% training, 15% validation, 15% fixed test; repeated 10 times to create 10 outer folds (test set fixed) for nested CV.
- Feature filtering: removed promoters with <10% species coverage in training/validation. Imputed remaining missing promoter values as zero (reflecting no detected ortholog). Standardised features and target (z-scores) within training/validation; applied corresponding scaling to test to prevent leakage.
- Inner CV: Within each outer fold, fit elastic net regression (glmnet/glmnetUtils) with 10-fold inner CV to optimise alpha and lambda; used lambda.1se for parsimony. Predict training and validation; compute Pearson correlations and compare absolute errors between training and validation via two-sided unpaired t-tests; discard models showing significant overfit.
- Ensemble (bagging): Refit retained models on combined training+validation; aggregate predictions by median across models to form ensemble per group (all-vertebrate, fish, mammal, reptile). Also developed relative age at maturity models (age at maturity divided by maximum lifespan), imputing missing lifespan by closest taxonomic rank.
Uncertainty quantification (prediction intervals):
- Re-ran elastic net in Python (v3.12) using MAPIE (v0.7.0) to estimate 90% prediction intervals via three methods: naive (mean±k·SD), jackknife+-after-bootstrap (jackknife+ab), and cross-validation. Standardisation mirrored R workflow with strict separation to avoid leakage. Bagged intervals by taking median bounds across retained models. Evaluated coverage as PICP and width as MPIW; selected jackknife+ab for final predictions due to best trade-off between coverage and width.
Application to new species:
- Predicted ages at maturity for 1912 species lacking reported values but with genomes (group-specific models only). Preprocessing ensured feature alignment to training (promoter set filtering, one-hot order alignment, scaling). Annotated IUCN categories (accessed 05/08/2023).
- Dataset and grouping: Final modelling set included 1359 species after removing 3 extreme outliers and 17 with no promoter matches; groups: fish (331), mammals (550), reptiles incl. birds (461), amphibians (17; not modelled separately).
- Model performance (test data):
- All-vertebrate model: R=0.78, explaining ~61% variance between known and predicted ages.
- Within-group (using all-vertebrate model): fish R≈0.17 (R^2≈0.028), mammals R≈0.91 (R^2≈0.84), reptiles R≈0.46 (R^2≈0.21).
- Group-specific models: fish R≈0.67 (R^2≈0.45), mammals R≈0.93 (R^2≈0.86), reptiles R≈0.72 (R^2≈0.52); substantial improvement for fish and reptiles over the all-vertebrate model.
- Relative age at maturity models performed worse: R^2=0.34 (all-vertebrate), 0.37 (fish), 0.51 (mammals), 0.26 (reptiles).
- All-vertebrate model: R=0.78, explaining ~61% variance between known and predicted ages.
- Error metrics: Median error rates for group-specific models were ~25–34%, corresponding to ~0.4–0.9 years; abstract reports median error ~30% (<1 year). Amphibians showed highest errors due to small sample size (n=17).
- Robustness and determinants of error: Prediction error increased with evolutionary divergence from reference species and decreased with promoter sequence similarity (identity, length, number of BLAST hits) in fish and reptiles; opposite trend in mammals likely due to many shared promoters. Genome completeness (BUSCO) did not correlate with error in the all-vertebrate model.
- Within-species reproducibility: Across 98 assemblies from 14 species, predictions clustered near species medians; deviations linked to lower completeness/contiguity, yet overall method robust to variation in assembly quality.
- Prediction intervals (90% nominal): Jackknife+ab yielded PICP near 0.88–0.89 with narrower MPIW than CV; CV overcovered (>0.90) with wider intervals. Selected jackknife+ab for final predictions. Coverage was lower for extreme early/late maturing species.
- Example MPIW (absolute) and PICP by method (test data): All-vertebrate jackknife+ab MPIW≈3.73 years, PICP≈0.88; fish 7.18, 0.88; mammals 3.07, 0.89; reptiles 2.65, 0.89.
- Biological signals: Top promoter features included genes with reproductive tissue-biased expression (e.g., CARF, dpfl, MAP2K7, ZNF646, RP3-336H9/RPGR; KMT2A, RRP7A, HSP90AB1; RP6-170F5/SERTM2) and homeobox genes (LHX8, hoxc11a, hoxc3a). Mammal model highlighted PAX8 and FKBP11 (mTOR pathway). Nine genes overlapped with a methylation-based age-at-maturity study (e.g., ESRRB, FAF1, RBM10, ZMIZ1, DICER1, LRBA, TRPM3, NPTN, KIFC3). Enrichment analyses implicated developmental, regulatory, transcriptional, and RNA biosynthetic processes.
- New predictions: Generated age-at-maturity predictions (with intervals) for 1912 species lacking reported values (fish n=645, mammals n=136, reptiles incl. birds n=687). Predictions typically fell within ranges of close relatives.
- Conservation relevance: Age at maturity correlated with IUCN Red List category among fish (R=0.41, p=1e-10), mammals (R=0.36, p=4.8e-17), and reptiles (R=0.48, p=2e-26); relationships weaker for predicted vs. known values. Relative age at maturity correlated poorly (max R≈0.07).
The study demonstrates that promoter CpG density, augmented with genome GC content and taxonomic order, can predict vertebrate species’ age at maturity with substantial accuracy, enabling a universal, genomics-based approach applicable across major vertebrate groups. Group-specific models outperform the all-vertebrate model, especially for fish and reptiles, reflecting deep evolutionary divergence and, for fishes, genome duplication history and polyphyly. Mammalian performance is highest, likely benefitting from closer reference (human) promoters and recent common ancestry. Error analyses indicate that evolutionary distance from reference species and promoter sequence similarity strongly influence prediction accuracy and uncertainty for non-mammalian groups. Within-species analyses show robustness to assembly quality variability, indicating predictions can be made from a single moderate-quality genome. The models capture biological processes relevant to reproductive maturation rather than general aging, as evidenced by enriched developmental/regulatory functions and reproductive tissue-biased genes (e.g., LHX8, PAX8, FKBP11). Overlap with methylation-based predictors supports convergent biological signals linking promoter architecture and epigenetic regulation to life-history traits. The predicted ages correlate with extinction risk categories, supporting integration of these predictions into conservation assessments where demographic data are lacking. Although relative age-at-maturity models underperform, absolute age estimates combined with uncertainty intervals provide actionable information for prioritising species in conservation triage. Overall, the approach offers a scalable, cost-effective tool for estimating a critical life-history parameter directly from genome sequences, complementing and extending previous lifespan predictors and methylation-based models.
This work delivers universal and group-specific ensemble elastic net models that predict vertebrate species’ age at maturity from promoter CpG content, genome GC, and taxonomic order. The models achieve strong test performance (up to R^2≈0.86 for mammals; 0.45–0.52 for fish and reptiles), median errors of ~25–34% (<1 year), robust within-species reproducibility, and well-calibrated prediction intervals using jackknife+-after-bootstrap. Predictions for 1912 previously unreported species expand available life-history data and show anticipated relationships with extinction risk. The method enables rapid, generalisable estimates from a single moderate-quality genome and can directly support conservation decision-making. Future directions include: developing amphibian-specific and finer-scale clade models as more promoter annotations and genomes become available; incorporating sex-specific predictions where sexed genomes and sex-specific maturity data exist; exploring additional covariates (e.g., body size, metabolic rate, lifespan) when data availability permits; and refining uncertainty quantification and feature selection as reference resources improve.
- Taxon coverage: No amphibian-specific model due to limited data and lack of amphibian reference promoters; amphibians exhibited highest errors.
- Reference dependence: Accuracy declines with evolutionary divergence from the reference species and with lower promoter sequence similarity, particularly for fish and reptiles.
- Data requirements: Requires a reasonably complete genome assembly and known species order; while robust to variable assembly quality, lower contiguity/completeness can increase variability for individual assemblies.
- Biological scope: Models estimate species-level mean age at maturity; sex-specific differences were not modelled due to data limitations.
- Extremes: Prediction interval coverage is lower for very early- or late-maturing species, suggesting underestimation of uncertainty at extremes.
- Relative measure: Models for relative age at maturity (age at maturity:lifespan) underperformed, and lifespan imputation may introduce additional uncertainty.
Related Publications
Explore these studies to deepen your understanding of the subject.

