logo
Loading...
Herbarium collections remain essential in the age of community science

Biology

Herbarium collections remain essential in the age of community science

I. Eckert, A. Bruneau, et al.

Discover how traditional herbarium specimens capture the diversity of vascular plants in Canada far more effectively than iNaturalist observations. This study by Isaac Eckert, Anne Bruneau, Deborah A. Metsger, Simon Joly, T. A. Dickinson, and Laura J. Pollock reveals that digitizing herbarium collections is a crucial investment for biodiversity modeling and conservation.... show more
Introduction

Community science platforms (e.g., iNaturalist) have generated unprecedented volumes of biodiversity occurrence data, now comprising about half of GBIF records since 2010. However, these data remain spatially and taxonomically biased, often concentrated in urban areas and toward common or conspicuous species. Natural history collections such as herbaria, though also biased, may better represent rare species and rural areas and contain irreplaceable historical, morphological, and genetic information. The central research question is whether digitized herbarium specimens outperform community science observations in representing plant diversity, distributions, and environmental niches, and how each data source contributes to species distribution modeling (SDM) needs. The study focuses on Canadian vascular plants, comparing the coverage, biases, and niche representation of iNaturalist observations versus herbarium specimens, and projects the benefits of digitizing Canada’s undigitized herbarium holdings for biodiversity modeling and conservation planning.

Literature Review

Prior studies show community science can be biased toward larger, common species and human-dense areas, which can reduce conservation efficiency. Herbaria have been shown to better match expected biodiversity patterns and reduce spatial and taxonomic bias relative to community science records. Natural history collections contain extensive legacy data valuable for taxonomy, systematics, global change biology, and trait/genomic research. Despite this, only ~21% of the world’s ~396M herbarium specimens have been digitized. Recent work (e.g., Daru & Rodriguez 2023) suggests collections outperform community science in bias metrics. Calls exist for large-scale digitization (e.g., metaherbarium concepts), but barriers include funding and time. The literature highlights the need to assess whether the rapid accumulation of community observations can eventually offset biases, and whether additional digitization materially improves SDM performance under climate change, motivating this study.

Methodology
  • Study scope: All 4392 terrestrial vascular plant species in Canada (VASCAN list). Occurrence data from GBIF (Canada and USA) 1900–January 2024; after filtering coordinate uncertainty >25 km, 7,519,260 records remained (23% herbarium, 72% iNaturalist).
  • Data partitioning: iNaturalist identified via institution code "iNaturalist"; herbarium via basisOfRecord "PRESERVED_SPECIMEN".
  • Bias assessment: Temporal bias visualized by annual accumulation since 1900. Spatial bias quantified with species-level Nearest Neighbor Index (NNI; log-transformed; spatialEco in R); two-sided t-test for differences. Also modeled counts vs human population density (GPWv4 2020, resampled to 25 km grid) using negative-binomial GLMs (MASS), reporting AIC and Kullback–Leibler pseudo-R².
  • Spatial distribution visualization: Rasterized counts to 25 km grid across Canada and USA; mapped sampling imbalance (herbarium minus iNaturalist density).
  • Taxonomic bias: Compared variance in records per species between data types using F test on relative counts.
  • Phylogenetic/functional bias: Built a phylogeny (rtrees) for Canadian plants. Compiled TRY traits (e.g., seed mass, height, SLA, lifespan, N-fixation, growth form, photosynthetic pathway, dispersal syndrome, phenology, leaf compoundness, woodiness, leaf type). Imputed missing values using phylogenetic vector regressions (PVR) and missForest; retained 4147 species with complete data. Constructed a functional dendrogram (Gower distances, UPGMA). Calculated phylogenetic/functional signal of record counts using Pagel’s λ with phylosig (after log-transforming counts). Visualizations used square-root counts.
  • Diversity coverage and accumulation: Quantified taxonomic (proportion of species), phylogenetic (evolutionary distinctiveness via phyloregion::evol_distinct), and functional distinctiveness coverage for each data type. Built randomized accumulation curves (1000 runs) adding records and tallying marginal diversity gained; fit logarithmic beta regressions (betareg) to compare accumulation rates and to extrapolate required additional observations.
  • Niche coverage: Used BIEN range polygons (available for 3269 species; 3174 with GBIF records). Rasterized ranges at 25 km and computed spatial coverage (proportion of occupied cells with records). For environmental niche, compiled five climate normals (mean annual precipitation, precipitation as snow, humidity, degree days >0°C, degree days >18°C; AdaptWest), standardized variables, computed Euclidean distances among in-range cells, clustered to a climatic dendrogram per species, and computed cell climatic distinctiveness (normalized to sum to 1). Built niche accumulation curves for both data types (1000 randomizations), fitting logarithmic beta regressions to estimate niche coverage gains per additional record.
  • Extrapolation of digitization benefits: Estimated 7.3M undigitized Canadian herbarium specimens. Allocated expected additional records per species proportional to current herbarium incidence and extrapolated taxonomic/phylogenetic/functional and niche coverage gains using fitted herbarium curves. Estimated the number of additional iNaturalist observations needed to match digitization benefits using iNaturalist curves.
  • SDM linkage: Used pre-built boosted regression tree SDMs (Eckert et al. 2023) with climatic, topographic, land cover, and soil predictors; predictions aggregated to 25 km and thresholded (0.5). Computed range-filling score (proportion of BIEN range cells predicted occupied). Modeled relationship between current niche coverage (square-root transformed) and range filling via inflated beta regression (GAMLSS, BEINF family), then predicted increases in range filling based on potential niche coverage from digitization. Assessed nonlinearity between increases in niche coverage and increases in range filling via linear regression on log-log scale.
  • Cost estimation: Surveyed curators and literature to estimate digitization cost at approximately $3 per specimen using traditional workflows; noted potential reductions via high-throughput automation.
Key Findings
  • Dataset: 12,293,856 records across 3968 species (1900–Jan 2024); after filtering high spatial uncertainty, 7,519,260 records remained (23% herbarium, 72% iNaturalist).
  • Biases: Herbarium records showed less temporal clustering (more even coverage over 12 decades) and less spatial clustering than iNaturalist. Mean log NNI: herbarium −0.88 vs iNaturalist −0.93 (p=0.006). Record density correlated with human population density for both data types but more strongly for iNaturalist (pseudo-R²≈0.32) than herbarium (pseudo-R²≈0.11), both p<0.001. Approximately 37% of the land area (mostly northern Canada) had no records.
  • Taxonomic representation and bias: Although species had more iNaturalist observations on average (mean 1234 vs 396 for herbarium), variance in iNaturalist counts was 4.8 times higher (F=4.776, df=4391, p<0.001). Half of all iNaturalist observations (≈2.7M) represented only 4% (178) of species. Consequently, 47% of Canadian plants are better represented by herbarium records.
  • Phylogenetic and functional bias: Higher phylogenetic/functional signal (Pagel’s metrics reported as stronger) in iNaturalist than herbarium, indicating stronger clade/trait biases for iNaturalist.
  • Diversity coverage: Of 4392 species, herbarium records represented 3662 (83.4%) vs iNaturalist 3504 (79.8%). Species at risk: herbarium 93.2% vs iNaturalist 88.7%. Unrepresented by either: 714 species (16.3%). Herbarium captured 3.4% more phylogenetic diversity and 1.8% more functional diversity. It would take over 4.2 million additional iNaturalist observations to match the current diversity coverage of digitized herbarium specimens.
  • Niche coverage: On average, both data types together captured 9.2% of species’ environmental niches. Separately: iNaturalist 5.7%, herbarium 5.1%. Per record, herbarium captured 1.8 times more niche space; accumulation curves indicate herbarium accumulates environmental niche coverage about 9% faster. Overlap in niches captured by both was small (mean 1.6%).
  • SDM performance: Current SDMs using both data types fill, on average, 20.8% of species’ Canadian ranges. Range filling increases with niche coverage (inflated beta regression; AIC −2102; R²=0.33; p<0.001). Increases in niche coverage translate nonlinearly and disproportionately to increases in range filling (log–log slope >1; R²=0.16; p<0.001).
  • Digitization benefits: Digitizing the remaining 7.3M Canadian herbarium specimens could add approximately 156 species (3.6%), 5.3% more phylogenetic diversity, and 2.3% more functional diversity, and increase herbarium-conferred niche coverage from 5.1% to about 19.8%; combined with existing iNaturalist data, average niche coverage could reach ~23.8%. This would increase ability to model species distributions by over fivefold per species on average. Matching these gains would require about 27.3 million additional iNaturalist observations, which is likely to take decades given current rates (GBIF had ~78M iNaturalist observations across all taxa as of Jan 2024).
Discussion

The study demonstrates that despite the rapid growth of community science, herbarium specimens provide more even, less biased coverage and more efficiently capture taxonomic, phylogenetic, functional, and environmental niche dimensions of biodiversity on a per-record basis. Because SDM performance scales positively—and nonlinearly—with niche coverage, the more informative herbarium records disproportionately improve predictive ability for species ranges. Consequently, mass digitization of existing herbarium collections represents a practical, cost-effective pathway to rapidly reduce biodiversity data shortfalls, especially in the context of climate-informed conservation (e.g., 30x30 targets). Beyond superior occurrence information, physical specimens enable additional layers of data (morphology, genetics, traits) critical for understanding evolutionary history, detecting cryptic taxa, and evaluating phenological and functional responses to global change. Despite demonstrated benefits, digitization progress is impeded by funding and time costs; however, emerging high-throughput imaging and AI workflows substantially reduce both, making large-scale digitization increasingly feasible. Community science remains valuable for engagement and continued data inflow but is unlikely, on its own, to match the specific benefits of herbarium digitization in the near term. Targeted collection strategies, particularly focusing on rare species and undersampled regions, together with digitization, could synergistically accelerate improvements in biodiversity knowledge and modeling needed for effective conservation planning under climate change.

Conclusion

Herbarium specimens, even at one-third the record count of iNaturalist, better capture taxonomic, phylogenetic, and functional diversity and more efficiently sample species’ environmental niches, leading to superior inputs for species distribution modeling. Digitizing Canada’s 7.3 million undigitized specimens would substantially expand diversity and niche coverage and more than quintuple modeling capacity per species, benefits that would otherwise require tens of millions of additional community science observations. Given the cost-effectiveness and growing feasibility of high-throughput digitization, investing in herbarium digitization—complemented by targeted specimen collection and strategic use of community science—offers a critical path to close biodiversity data gaps and improve conservation planning under climate change. Future work should: refine digitization workflows with AI/automation; prioritize taxa and regions to maximize niche coverage gains; integrate trait/genomic data layers from specimens; and build open, interoperable platforms (e.g., a global metaherbarium) to fully leverage digitized collections.

Limitations
  • Coverage constraints: Niche analyses were limited to species with available BIEN range maps (3269 total, 3174 with GBIF records), potentially biasing results toward better-known taxa.
  • Spatial scale assumptions: Analyses used a coarse 25 km grid, assuming at least some suitable habitat within occupied range cells; finer-scale heterogeneity and absences within polygons are not captured.
  • Digitization extrapolation: Assumes undigitized herbarium holdings mirror the taxonomic composition of currently digitized specimens and can be adequately georeferenced; actual representation may differ due to past clade- or project-focused digitization.
  • Model fitting: Some beta regressions did not converge for niche/density accumulation; results emphasized community averages rather than species-specific outcomes.
  • Bias persistence: While herbarium records reduce biases relative to iNaturalist, they still exhibit temporal and spatial unevenness and share some systematic biases.
  • SDM framework choices: Use of previously built BRTs, thresholding at 0.5, and aggregation to 25 km may influence absolute range-filling estimates; alternative modeling frameworks or thresholds might yield different magnitudes though qualitative patterns likely persist.
Listen, Learn & Level Up
Over 10,000 hours of research content in 25+ fields, available in 22+ languages.
No more digging through PDFs, just hit play and absorb the world's latest research in your language, on your time.
listen to research audio papers with researchbunny