
Computer Science
On the Readiness of Scientific Data Papers for a Fair and Transparent Use in Machine Learning
J. Giner-miguelez, A. Gómez, et al.
This study analyzes how scientific data documentation aligns with machine learning and regulatory needs for fairness and trustworthiness. By examining 4,041 data papers across domains and comparing them with NeurIPS D&B dataset descriptions, the authors identify coverage gaps and trends and propose practical recommendations to make datasets more transparent and ML-ready. Research conducted by Joan Giner-Miguelez, Abel Gómez, and Jordi Cabot.
~3 min • Beginner • English
Introduction
The paper investigates whether scientific data papers published across disciplines provide the dimensions of documentation needed for fair and trustworthy use in machine learning. Motivated by documented harms from dataset biases and poor generalization in ML models, and by regulatory and community calls for better dataset documentation, the authors evaluate the presence and evolution of key documentation dimensions (uses, generalization limits, social concerns, maintenance, and provenance details on collection and annotation). Two research questions guide the study: RQ1: To what extent are data papers documented with the dimensions demanded by recent ML documentation frameworks? RQ2: How have documentation practices evolved in the last few years regarding the dimensions the ML community demands? The work situates the analysis in the context of FAIR principles and the rise of data papers, assessing how existing scientific data documentation aligns with needs articulated by ML frameworks such as Datasheets for Datasets and Croissant.
Literature Review
Prior work emphasizes structured documentation for datasets to mitigate harms and improve model reliability. Datasheets for Datasets and related frameworks propose documenting recommended and non-recommended uses, generalization limits, social concerns, and maintenance policies, along with detailed provenance of collection and annotation processes. FAIR principles and institutional data management plans have promoted data sharing and documentation via data papers. Studies have examined peer review of datasets, data citation practices, and publisher guidelines, and highlighted the need to characterize the people involved in data creation (e.g., collection and annotation teams) and, when applicable, profiles of data subjects. In NLP, data statements encourage reporting speech context (dialect, modality), while Croissant and DescribeML propose machine-readable metadata for ML datasets. Community works also stress maintenance and deprecation practices, ethical issues, and licensing for ML uses.
Methodology
The authors analyzed 4041 open-access data papers published between 2015 and June 2023 from two interdisciplinary data journals: Nature’s Scientific Data (2549 papers) and Elsevier’s Data in Brief (1492 papers). Journals were selected for active status, interdisciplinary scope, and English-language data papers. Paper identification used OpenAlex to list publications and in-house scripts with web scraping to verify data paper types via publisher sites (DOI-based). Full manuscripts (PDFs) were obtained from publishers. Text preparation used SciPDF (GROBID) to parse manuscripts, excluding headers/footers/references, and chunked sections into passages (max ~1000 words, with overlap and section titles for context). Figures and tables were processed with Tabula.py; table captions and referenced paragraphs were combined and summarized via an LLM to aid detection of dimensions often present in tabular content. Dimension extraction followed a Retrieval-Augmented Generation approach tailored per dimension (custom retriever prompts and chains), with a final zero-shot classification step using a fine-tuned BART model on MNLI to determine presence/absence of each dimension. The pipeline also classified types of collection and annotation processes and team types via predefined categories. Topic analysis of dataset uses employed BERTopic with semi-supervised cleaning and a fine-tuned language model for representation. A comparative sample from NeurIPS Datasets & Benchmarks (232 dataset papers, 2021–2023) was collected via OpenReview; supplementary documentation attachments (data cards, etc.) were merged into manuscripts for analysis. Extraction validation included reported accuracy from prior work (uses: 88.26%, collection: 70%, annotation: 81.25%) and a manual evaluation on ~1% of the sample, finding high accuracy for presence detection of limits/social concerns (94.59%), ML-tested (97.30%), collection team profile (97.10%), and target profile (83.33%), with lower accuracy for identifying an annotation process (83.78%) and annotation validation methods (72.22%).
Key Findings
Sample size and diversity: 4041 data papers, with a growing publication trend from 2015–2023 (dip in 2021; partial 2023). 16.5% represent people (used for assessing social concerns and collection target profile). Diverse topics with prominent areas such as RNA sequencing (9.1%), medical imaging (8.1%), chemistry/physics (5.7%), material properties (4.9%), climate/hydrology (4.6), etc. Collection process types were varied (e.g., physical data collection 29%, direct measurement 13%, manual human curation 12%, software collection 11%, document analysis 10%). Teams: collection teams mainly internal (88.05%), external (11.20%), crowdsourcing (0.74%); annotation teams internal (86.83%), external (11.27%), crowdsourcing (1.90%). Only 42.28% of papers include a human-involved annotation process; common annotation types include text entity annotation (40%) and semantic segmentation (13%). Overall presence of dimensions (Fig. 5): collection sources & infrastructure 98.7%; collection description 97.40%; recommended uses 97.2%; annotation description 97.10%; annotation infrastructure 73.66%; profile of collection target (people datasets) 64.15%; annotation validation 30.46%; speech context (language datasets) 17.82%; social concerns (people datasets) 12.35%; tested using an ML approach 8.81%; generalization limits 8.14%; collectors’ profile 3.91%; annotators’ profile 3.56%; maintenance policies 0.42%. Trends (2015–2023): some improvement in documenting generalization limits (e.g., 3.3% in 2016 vs 12.35% in 2022) and ML-tested dimension (clear increase over time), while team profiling remains low and flat. Examples of documented limits include non-recommended uses, collection constraints, and annotation caveats. Social concerns disclosures (12.35% within people datasets) mainly mention social bias, privacy, and sensitivity issues. Topic variation: medical imaging papers more often report team profiles (annotators 8.64%; collectors 8.79%) and ML testing (19.22%); human movement recognition (30.11%) and agriculture/cropland (16.15%) also frequently report ML testing. Across journals (Fig. 7, text): Scientific Data vs Data in Brief show differences in less-documented dimensions—generalization limits 12.37% vs 0.68%; social concerns 17.11% vs 4.29%; ML-tested 11.98% vs ~3.2%; annotation validation notably higher in Scientific Data (39%) than Data in Brief (9.75%); collection-related dimensions are similar across venues. NeurIPS comparison (Fig. 9): relative to data journals, NeurIPS dataset papers document more of the ML-focused dimensions—annotation infrastructure 95.59% (vs 73.66%), annotation validation 61.23% (vs 30.46%), speech context 73.03% (vs 17.82%), social concerns 58.58% (vs 12.35%), ML-tested 93.53% (vs 8.81%), generalization limits 65.09% (vs 8.14%), maintenance policies 53.88% (vs 0.42%). Team profiling remains scarce in both (collectors’ profile 2.59% vs 3.91%; annotators’ profile 2.20% vs 3.56%).
Discussion
Findings indicate that dimensions explicitly requested by journal submission guidelines (recommended uses, collection and annotation descriptions, sources and infrastructure) are consistently present, addressing parts of RQ1. However, critical ML-relevant dimensions—generalization limits, social concerns, maintenance policies, ML-tested results, and detailed profiles of collection/annotation teams—are underreported, limiting transparency and fair use of datasets for ML. Trends over time show modest improvement in certain areas (e.g., generalization limits, ML-tested), partially addressing RQ2, but many gaps persist. Differences between venues suggest that aligning author guidelines with ML documentation frameworks (as at NeurIPS) significantly increases reporting of ML-critical dimensions (limits, social concerns, maintenance, speech context, annotation validation). The analysis supports strengthening submission guidelines and providing structured templates to capture underrepresented dimensions. It also highlights metadata shortcomings and the need for text analysis tools to enrich and validate dataset documentation. Improving machine-readable descriptions (e.g., Croissant/DescribeML) could enhance discoverability and suitability assessment for ML use cases.
Conclusion
The study provides a large-scale assessment of documentation practices in scientific data papers relative to ML needs, revealing strong coverage of commonly requested dimensions but substantial gaps in ML-critical aspects such as generalization limits, social concerns, maintenance policies, ML testing, and team profiling. The authors propose actionable recommendations and structured templates to improve submission guidelines, including standardized reporting of limitations, maintenance policies (maintainers, update schedules, errata, deprecation), ML experiment details (tasks, models, metrics, references), profiles of data participants (teams and targets), and annotation guidelines, infrastructure, and validation. They advocate integrating ML-oriented licensing (Montreal Data License) and machine-readable documentation (e.g., Croissant, DescribeML) within submission workflows to support long-term fair and transparent reuse, particularly in ML contexts. Future directions include developing a taxonomy of data limitation aspects to aid authors in systematically reasoning about generalization limits, enhancing metadata quality, and continuing to employ text analysis tools to complement structured metadata for better dataset discoverability.
Limitations
The analysis focuses on presence/absence of dimensions rather than completeness or depth. Extraction relies on LLM-based methods that may hallucinate or misclassify; reported accuracies from prior work are 88.26% (uses), 70% (collection), and 81.25% (annotation). Manual validation on ~1% of the sample showed high accuracy for presence detection of limits/social concerns (94.59%), ML-tested (97.30%), collection team profile (97.10%), and target profile (83.33%), with lower accuracy recognizing annotation processes (83.78%) and annotation validation methods (72.22%). The sample covers two journals and publications up to June 2023, which may limit generalizability. Some dimensions (social concerns, speech context) were evaluated only for relevant subsamples (people or language datasets). Metadata quality issues (e.g., funding) necessitate complementary text analysis; the extracted dataset should be used for statistical trend analysis rather than as ground truth.
Related Publications
Explore these studies to deepen your understanding of the subject.