Computer ScienceScientific Data

On the Readiness of Scientific Data Papers for a Fair and Transparent Use in Machine Learning

J. Giner-miguelez, A. Gómez, et al.

This study analyzes how scientific data documentation aligns with machine learning and regulatory needs for fairness and trustworthiness. By examining 4,041 data papers across domains and comparing them with NeurIPS D&B dataset descriptions, the authors identify coverage gaps and trends and propose practical recommendations to make datasets more transparent and ML-ready. Research conducted by Joan Giner-Miguelez, Abel Gómez, and Jordi Cabot.... show more

General Summary Metrics

Abstract

To ensure the fairness and trustworthiness of machine learning (ML) systems, recent legislative initiatives and relevant research in the ML community have pointed out the need to document the data used to train ML models. Besides, data-sharing practices in many scientific domains have evolved in recent years for reproducibility purposes. In this sense, academic institutions' adoption of these practices has encouraged researchers to publish their data and technical documentation in peer-reviewed publications such as data papers. In this study, we analyze how this broader scientific data documentation meets the needs of the ML community and regulatory bodies for its use in ML technologies. We examine a sample of 4041 data papers of different domains, assessing their coverage and trends in the requested dimensions and comparing them to those from an ML-focused venue (NeurIPS D&B), which publishes papers describing datasets. As a result, we propose a set of recommendation guidelines for data creators and scientific data publishers to increase their data's preparedness for its transparent and fairer use in ML technologies.

Publisher

Scientific Data

Published On

Jan 13, 2025

Authors

Joan Giner-Miguelez, Abel Gómez, Jordi Cabot

DOI

https://doi.org/10.1038/s41597-025-04402-4

Explore these studies to deepen your understanding

Adjacent work that informs or extends this paper's methodology and findings.

Psychology

Neuroimaging the effects of smartphone (over-)use on brain function and structure—a review on the current state of MRI-based findings and a roadmap for future research

C. Montag and B. Becker

Biology

Neuroimaging the effects of smartphone (over-)use on brain function and structure-a review on the current state of MRI-based findings and a roadmap for future research

C. Montag and B. Becker

Computer Science

Using the interest theory of rights and Hohfeldian taxonomy to address a gap in machine learning methods for legal document analysis

A. Izzidien

Medicine and Health

Recent Advancements and Perspectives in the Diagnosis of Skin Diseases Using Machine Learning and Deep Learning: A Review

J. Zhang, F. Zhong, et al.

Listen, Learn & Level Up

Over 10,000 hours of research content in 25+ fields, available in 22+ languages.

No more digging through PDFs, just hit play and absorb the world's latest research in your language, on your time.

listen to research audio papers with researchbunny