logo
ResearchBunny Logo
Abstract
The increasing use of large language models trained on vast, inconsistently documented datasets raises significant legal and ethical concerns. This multidisciplinary study audits over 1,800 text datasets to improve data transparency. Tools and standards were developed to trace dataset lineage, including sources, creators, licenses, and usage. The analysis reveals a significant disparity in licensing for commercially viable versus non-commercial datasets, with restrictions impacting low-resource languages, creative tasks, and synthetic data. High rates of license miscategorization and omission on popular hosting sites were also observed. The study's findings highlight a crisis in data misattribution and offers tools, including the Data Provenance Explorer (www.dataprovenance.org), to improve dataset transparency and responsible AI development.
Publisher
Nature Machine Intelligence
Published On
Aug 30, 2024
Authors
Shayne Longpre, Robert Mahari, Anthony Chen, Naana Obeng-Marnu, Damien Sileo, William Brannon, Niklas Muennighoff, Nathan Khazam, Jad Kabbara, Kartik Perisetla, Xinyi (Alexis) Wu, Enrico Shippole, Kurt Bollacker, Tongshuang Wu, Luis Villa, Sandy Pentland, Sara Hooker
Tags
large language models
data transparency
dataset lineage
licensing issues
responsible AI
data misattribution
multidisciplinary study
Listen, Learn & Level Up
Over 10,000 hours of research content in 25+ fields, available in 12+ languages.
No more digging through PDFs, just hit play and absorb the world's latest research in your language, on your time.
listen to research audio papers with researchbunny