The increasing use of large language models trained on vast, inconsistently documented datasets raises significant legal and ethical concerns. This multidisciplinary study audits over 1,800 text datasets to improve data transparency. Tools and standards were developed to trace dataset lineage, including sources, creators, licenses, and usage. The analysis reveals a significant disparity in licensing for commercially viable versus non-commercial datasets, with restrictions impacting low-resource languages, creative tasks, and synthetic data. High rates of license miscategorization and omission on popular hosting sites were also observed. The study's findings highlight a crisis in data misattribution and offers tools, including the Data Provenance Explorer (www.dataprovenance.org), to improve dataset transparency and responsible AI development.
Publisher
Nature Machine Intelligence
Published On
Aug 30, 2024
Authors
Shayne Longpre, Robert Mahari, Anthony Chen, Naana Obeng-Marnu, Damien Sileo, William Brannon, Niklas Muennighoff, Nathan Khazam, Jad Kabbara, Kartik Perisetla, Xinyi (Alexis) Wu, Enrico Shippole, Kurt Bollacker, Tongshuang Wu, Luis Villa, Sandy Pentland, Sara Hooker
Tags
large language models
data transparency
dataset lineage
licensing issues
responsible AI
data misattribution
multidisciplinary study
Related Publications
Explore these studies to deepen your understanding of the subject.