
Earth Sciences
Enabling FAIR data in Earth and environmental science with community-centric (meta)data reporting formats
R. Crystal-ornelas, C. Varadharajan, et al.
Explore the world of data interoperability in Earth and environmental science! This research highlights the FAIR principles and introduces eleven innovative (meta)data reporting formats crafted by a talented team of authors, including Robert Crystal-Ornelas and Charuleka Varadharajan. Discover how these enhancements can transform data accessibility and promote scientific collaboration.
Playback language: English
Introduction
The FAIR principles (Findable, Accessible, Interoperable, and Reusable) are crucial for increasing transparency and reproducibility in Earth and environmental science research. While advancements in data repositories and search engines have improved data preservation, findability, and accessibility, significant challenges remain in data interoperability and reuse. This is largely due to the diversity of Earth science data and the limited time and resources available to researchers for effective data management. Formal (meta)data standards exist, but they are often limited in scope, and their development can be lengthy. Reporting formats, which are community-driven initiatives focused on harmonizing data types within scientific domains, offer a more agile alternative. These formats provide instructions, templates, and tools for consistent data formatting, facilitating efficient data collection, harmonization, and reuse. However, such formats are currently lacking for many environmental data types, and their adoption is often hampered by complexity and resource constraints. This paper focuses on the development and implementation of community-centric (meta)data reporting formats within the context of the Environmental Systems Science Data Infrastructure for a Virtual Ecosystem (ESS-DIVE) data repository, addressing the challenges of integrating multidisciplinary data from various sources (hydrological, geological, ecological, biological, and climatological). The goal is to create a flexible and modular framework that can readily accommodate future reporting formats and significantly enhance data interoperability and reuse, thereby advancing scientific understanding and prediction.
Literature Review
The authors reviewed existing literature on FAIR data principles, highlighting the importance of data standardization for improved data findability, understanding, and reuse. They examined various formal (meta)data standards, such as the ISO 8601 standard for date and time formatting and the Open Geospatial Consortium's Sensor Observation Service standard for sensor data. While acknowledging the value of these accredited standards, the authors emphasized their limited availability for many environmental data types and the considerable time involved in their development. This review led them to focus on the development and implementation of community-centric reporting formats as a more practical and efficient approach to improve data interoperability and reuse within specific scientific domains. Existing examples of successful reporting formats, such as FLUXNET's half-hourly flux and meteorological reporting format, were cited as evidence of their potential to facilitate both access and reuse of consistently formatted data. However, the authors also recognized the limitations of existing formats, such as the lack of comprehensive coverage for most environmental data types and the challenges of adoption due to complexity and resource constraints. This gap in current standards and the need for practical, community-driven solutions motivated the creation of the specific reporting formats presented in the study.
Methodology
The research employed a community-centric approach involving domain scientists, software engineers, and informatics specialists to develop the eleven (meta)data reporting formats. The process began with a thorough review of 112 pre-existing data standards and resources, creating crosswalks to identify gaps and determine essential variables and metadata for harmonization. This revealed that no existing standards fully met the ESS research community's needs, necessitating the creation of new formats. Eleven reporting formats were subsequently developed (six cross-domain and five domain-specific), covering various data types and providing guidelines for formatting and describing general research elements (e.g., file metadata, tabular data, physical samples, model data) and specific data types relevant to interdisciplinary research (e.g., biogeochemical samples, soil respiration, leaf-level gas exchange). Existing standards and conventions were adopted wherever possible, but new formats were created where needed. Throughout the development process, a balance was sought between practicality for scientists and machine-actionability for FAIR data principles. Key aspects of the development included striking a balance between pragmatism for scientists and machine-actionability, creating harmonized templates with consistent terms and formats (e.g., YYYY-MM-DD date format), and ensuring spatial data consistency (latitude, longitude in decimal degrees). All formats requiring CSV files adopted relevant recommendations from existing CSV standards. The reporting formats were shared and archived in three ways: (1) as datasets in the ESS-DIVE repository, enabling public download and citation; (2) on GitHub, for version control, feedback, and ongoing edits; and (3) as a website via GitBook, for broader accessibility. The development process included iterative feedback from the scientific community (a total of 247 individuals from 128 institutions), ensuring the practicality and usefulness of the formats. Guidelines were then established to aid other research communities in replicating the community-centric approach.
Key Findings
The community-centric approach resulted in four key outcomes:
1. **Comprehensive Crosswalks:** The teams created (meta)data crosswalks (Supplementary Files 1–20) mapping existing resources to each data type, revealing gaps in existing standards and informing the development of the new formats.
2. **Eleven New Reporting Formats:** Eleven reporting formats (Supplementary Table 1) were created to encompass diverse ESS (meta)data. These included six cross-domain formats (dataset metadata, file-level metadata, CSV file guidelines, sample metadata, terrestrial model data archiving, and location metadata) and five domain-specific formats (microbial amplicon abundance tables, leaf-level gas exchange, soil respiration, water and soil chemistry measurements, and hydrologic measurements). All formats included a minimal set of required metadata fields for programmatic parsing and optional fields for detailed context.
3. **Multi-Platform Data Sharing:** The developed reporting formats were published in the ESS-DIVE repository, hosted on GitHub for version control and feedback, and rendered as a project website using GitBook, ensuring wide accessibility and flexibility.
4. **Community Guidelines:** Guidelines for creating community-centric (meta)data reporting formats were established, encouraging communities to review existing standards, create crosswalks, develop and test templates iteratively with user feedback, define a minimum set of metadata for reuse, and host the final documentation on publicly accessible and easily updatable platforms (Box 1). These guidelines emphasize the importance of integrating feedback from scientists who collect and reuse the data to create formats that are both scientifically useful and practically implementable.
The paper presents several examples of datasets published on ESS-DIVE that utilize the newly developed reporting formats (Table 1), demonstrating their real-world applicability and impact. The adoption of these formats enhances data curation, findability, and reuse, allowing for automated metadata quality assessments and the assignment of DOIs for increased searchability across various platforms (e.g., Google Dataset Search, DataONE, OSTI, DataCite).
Discussion
The study demonstrates the value of community-led (meta)data reporting formats in making archived data more reusable and interoperable. The scientist-centric approach ensured the formats' practicality and usefulness for researchers. The adoption of consistent data compilation methods helps avoid ad hoc practices and facilitates efficient data integration, particularly for multi-year projects or collaborations involving multiple teams or analyses. The use of reporting formats like those developed for ESS-DIVE enables automated metadata quality assessments, DOI assignment, and searchability across the DataONE network, significantly enhancing data accessibility and reuse. The authors acknowledged the challenges of adopting pre-existing standards that may not fully meet the research community’s needs. They highlighted the pragmatic choices made in balancing ideal FAIR principles with the practical needs of time-constrained researchers, such as replacing complex terminology with more user-friendly alternatives. The iterative feedback process and the multi-platform approach to disseminating the reporting formats are key contributors to success. The paper advocates for incentives to promote widespread adoption of these or similar formats. These incentives include involving data collectors and reusers in the development process, providing user support, and creating software tools that facilitate data conversion and validation. Future work involves developing automated formatting checkers and software for data conversion and integration, further enhancing the FAIRness and usability of the data.
Conclusion
This research presents a successful model for developing and implementing community-centric (meta)data reporting formats for Earth and environmental science. The eleven newly developed formats significantly improve data interoperability and reusability within the ESS-DIVE repository. The community-centric approach, iterative feedback process, and multi-platform dissemination strategy proved effective in creating practical and scientifically useful formats. Future work should focus on automating data validation, conversion, and integration to further enhance data FAIRness and promote wider adoption within the research community. This model can be replicated by other research communities to improve data management practices and facilitate greater scientific discovery.
Limitations
While the study provides a comprehensive set of reporting formats, the authors acknowledge that the success of these formats depends on their adoption by the scientific community. The long-term impact of these formats remains to be fully evaluated. Furthermore, the formats were developed within a specific context (ESS-DIVE), and their direct applicability to other data repositories or research domains might require adaptations. The reliance on community participation and engagement presents a challenge in terms of resource allocation and sustaining long-term commitment.
Related Publications
Explore these studies to deepen your understanding of the subject.