Agriculture
The benefits and struggles of FAIR data: the case of reusing plant phenotyping data
E. A. Papoutsoglou, I. N. Athanasiadis, et al.
Dive into the world of plant phenotyping with research conducted by Evangelia A. Papoutsoglou, Ioannis N. Athanasiadis, Richard G. F. Visser, and Richard Finkers. Explore how FAIR data can enhance reusability and reproducibility in agricultural studies, using impactful case studies in potato genetics.
~3 min • Beginner • English
Introduction
Plant phenotyping is critical for breeding improved plant varieties, but the resulting data are heterogeneous due to diverse experimental goals, designs, settings, and data collection modalities. This complexity hinders reuse, especially for meta-analyses across independently generated datasets, due to ambiguities, missing documentation, undiscoverable data, and differing data types. The FAIR (Findable, Accessible, Interoperable, Reusable) data principles aim to improve this landscape. In plant phenotyping, the MIAPPE metadata standard supports findability, interoperability, and reusability, while accessibility is facilitated by implementations such as BrAPI. Without such standards, data reuse becomes arduous “data archaeology,” requiring aggregation, syntactic/semantic alignment, and linkage to cofounder data.
To investigate the benefits and struggles of FAIR data, the authors re-implemented a meta-analysis using FAIR infrastructures: annotating primary data with metadata, discovering and aggregating them, integrating with secondary environmental data, and exploring them interactively via Jupyter notebooks. The proof of concept (PoC) evaluates technical feasibility and the potential time/effort benefits when FAIR and MIAPPE are used.
As a case, they revisit Hurtado-Lopez’s thesis on genotype-by-environment and QTL-by-environment interactions for developmental traits in potato, based on five field experiments conducted across the Netherlands, Venezuela, Finland, and Ethiopia over 11 years with partially overlapping subsets of the CxE population. Traits of morphological, developmental, and agronomic nature were evaluated, integrating genetic, phenotypic, molecular, and environmental data. The thesis highlighted effects of temperature and photoperiod.
Because experiments were performed by different, uncoordinated teams over time, documentation gaps and format heterogeneity complicated reuse, often requiring direct communication with original data collectors. The thesis identified three documentation elements crucial for successful multi-environment reuse—content, origin/source, and structure—now part of MIAPPE. Although standardization like MIAME existed then, data were often disorganized or incomplete, necessitating time-consuming efforts; some data could not be used. Harmonizing formats and file structures was another challenge, though substantial reuse and novel findings were still achieved.
More organized documentation would have resolved ambiguities, reduced time spent locating information, and enabled uniform integration and manipulation across experiments. The envisioned scenario streamlines acquisition, integration, and analysis/reuse, making processes reproducible. In the PoC, datasets are FAIRified and the discovery/integration process is explored, with descriptive visualizations for exploratory analyses, and an investigation of challenges and benefits.
Literature Review
Methodology
The study presents a proof-of-concept pipeline to FAIRify, discover, integrate, and reuse plant phenotyping data, using potato CxE experiments (five field trials: Netherlands 1999, Venezuela 2003, Finland 2004 and 2005, Ethiopia 2010) and associated environmental data (photoperiod and temperature) from Hurtado-Lopez’s thesis.
Data acquisition and preparation: Field trial data (spreadsheets) were retrieved from archived sources. Selected traits were extracted into tabular text files. Environmental data comprised daily photoperiod (sourced via timeanddate.com when original resolution was coarse) and daily average temperatures (Excel files). The authors curated metadata from the thesis, publications, and local documents.
Metadata standardization: Metadata were organized according to MIAPPE 1.1 using the MIAPPE spreadsheet template and then transformed into RDF using the Plant Phenotype Experiment Ontology (PPEO). MIAPPE sections compiled included Investigation (5 experiments), Study (per experiment), Person (coordinators and thesis author), Data files, Biological material (each CxE genotype and cultivars), Environment (available conditions), Events (e.g., planting dates, treatments when dates existed), Observation unit (mostly plant-level; 2003 Venezuela data were genotype-level), and Observed variable (trait definitions). Ambiguities and inconsistencies in original documentation were noted and addressed where possible.
Environmental data modeling: Weather data were transformed into RDF using the AEMET weather ontology to represent weather stations, locations, and variable measurements.
FAIR Data Point (FDP): An FDP was constructed to expose dataset metadata in a hierarchical structure (FDP → Catalog → Dataset → Distributions). The standard FDP metadata model focuses on resource-level descriptors and lacked domain-specific content descriptors necessary for effective discovery. The authors extended the Dataset-level metadata by embedding MIAPPE descriptors to enable content-oriented indexing and searchability. A later FDP implementation addressed this by adopting DCAT2 for enhanced interoperability.
Data exposure and access: RDF data (phenotypic and weather) were served via a Jena Fuseki triple store as SPARQL endpoints (two datasets/endpoints, one for phenotypic, one for weather, supporting federated queries). FDP metadata were exposed as RDF/Turtle via a Python Flask server. The phenotypic dataset distribution included a SPARQL endpoint for querying.
Discovery and integration workflow: Users navigate from the FDP root to relevant catalogs (phenotypic), inspect datasets, and select the SPARQL distribution. They query Investigation and Study metadata to summarize studies (IDs, locations, dates, coordinates, altitude). Overlapping genotypes across experiments are verified via SPARQL (101 common genotypes across all five). Weather data discovery occurs similarly; federated SPARQL queries match studies to nearby weather stations by computing distances between study coordinates and station coordinates. Trait selection and comparability checks precede aggregation: for plant-level data (most studies), per-plant values are averaged to genotype-level; for 2003 Venezuela, data were already genotype-level. Cross-study comparisons focus on tuber weight per genotype, plotting across experiments, including for the subset of genotypes present in all studies. An integrative measure, cumulative photo-beta thermal time (PBTT), is computed per experiment (combining temperature and photoperiod), and tuber weight performance ranges (min–max per genotype) are plotted against PBTT to visualize genotype performance stability across environments.
Reproducibility: The pipeline, RDF generation scripts (PPEO-based), Jupyter notebooks for data exploration and SPARQL queries, and code to run the FDP and triple store are provided in a public repository with Zenodo archival DOIs.
Key Findings
- FAIR-based discovery and integration were successfully demonstrated for multi-environment plant phenotyping data using MIAPPE, an FDP, RDF, and SPARQL endpoints, enabling cross-domain federation with environmental data.
- Genotype overlap: 101 genotypes were common across all five experiments, permitting multi-environment comparisons.
- Environmental alignment: Federated SPARQL queries matched each study location to an appropriate weather station by minimal coordinate distance (e.g., Finland studies matched to Ruukki station; Wageningen to WUR station; Merida to Merida station; Holeta to Holeta station), confirming availability of relevant weather data.
- Integrated metric: Cumulative photo-beta thermal time (PBTT) per experiment was computed: 1999NL = 56.18; 2003VE = 23.11; 2004Fin = 31.31; 2005Fin = 39.3; 2010ET = 24.13.
- Trait variability: Visualizations of tuber weight per genotype across experiments revealed sharp differences among genotypes and substantial variability for the same genotype across environments, with performance stability interpretable from min–max lines plotted against PBTT.
- Efficiency insights: Historical FAIRification and metadata structuring are expected to substantially reduce researcher effort. Prior to FAIR practices, harmonizing and interpreting data took on average about two weeks per experiment; with structured MIAPPE documentation and FDP-based discovery, this could be reduced to hours for familiar investigators.
- Metadata enhancement: Embedding MIAPPE metadata into the FDP Dataset level improved content-oriented findability and machine-actionable search compared to generic theme labels.
- Interoperability and reusability were achieved through explicit licensing (CC-BY 4.0), MIAPPE-compliant metadata, and RDF-based data models, supporting reproducible integration and exploration.
Discussion
The study addresses how FAIR principles and community standards can enable reproducible reuse of heterogeneous plant phenotyping data. By FAIRifying legacy datasets with MIAPPE and exposing them via an FDP and SPARQL endpoints, the authors demonstrate end-to-end discovery, access, and integration with environmental data. This directly responds to the challenges identified in the case study (format heterogeneity, incomplete documentation, and logistical barriers) by providing rich, machine-readable metadata and uniform access protocols.
Findability improved through FDP indexing and embedded MIAPPE descriptors that summarize dataset content, enabling users and machines to assess relevance without downloading data. Accessibility and interoperability were realized using HTTP, RDF, TTL, SPARQL, and ontologies (PPEO for phenotyping and AEMET for weather). Reusability benefited from MIAPPE minimum information and explicit licensing, allowing reliable interpretation and combination of datasets across domains.
Nonetheless, the work reveals practical hurdles: metadata often existed as free text and required substantial curation; experimental details known to original data generators were not always recorded in structured form; and ambiguous trait definitions and inconsistent abbreviations impeded integration. The high technical barrier (RDF/SPARQL) suggests the need for user-friendly interfaces and better indexing of FDPs. The discussion also emphasizes that defining domain-appropriate “minimum information” for findability is crucial, as generic themes are insufficient for effective search.
Overall, the pipeline shows that FAIR practices make meta-analyses more feasible and reproducible, while highlighting areas for community improvement in metadata completeness, tooling, and standard adoption.
Conclusion
This work contributes a practical demonstration of FAIRifying and reusing heterogeneous plant phenotyping data by integrating MIAPPE-compliant metadata, an FDP for discovery, RDF representations, and SPARQL-based querying, including cross-domain federation with environmental data. The proof of concept shows tangible benefits for findability, interoperability, and reusability, and provides reproducible materials (code, data, notebooks).
Future directions include: broader adoption of DCAT2-enhanced FDPs and indexing across repositories for content-aware search; improved graphical user interfaces to lower the expertise barrier; clearer domain-specific minimum information profiles for findability; prioritization strategies for FAIRifying historical datasets; increased automation of FAIRification when legacy formats are consistent; and stronger encouragement by journals to require and promote data standards and FAIR-ready submissions.
Limitations
- The proof of concept was conducted only once, limiting the generalizability of conclusions about FAIRification benefits and scalability.
- No new biological analyses were performed; the focus was on data discovery, integration, and exploratory visualization.
- Metadata completeness was limited by the original documentation; some environmental and event details were sparse or missing, constraining reuse.
- FAIRification steps involved manual curation and domain knowledge; automation was limited and may depend on the consistency of legacy data formats.
- FDP findability remains constrained without broader indexing infrastructures; current implementations require enhancement for content-based search and better user experience.
Related Publications
Explore these studies to deepen your understanding of the subject.

