Interdisciplinary Studies
A large-scale audit of dataset licensing and attribution in AI
S. Longpre, R. Mahari, et al.
This groundbreaking study by Shayne Longpre, Robert Mahari, and others delves into the complexities of data transparency in large language models. With an audit of over 1,800 text datasets, it uncovers critical disparities in licensing and highlights the urgent need for tools that ensure responsible AI development.
~3 min • Beginner • English
Introduction
The paper investigates the crisis in data transparency, licensing, and attribution across widely used text datasets for training and finetuning language models. As model developers increasingly combine and repackage thousands of datasets with minimal documentation, the study asks how to systematically trace dataset lineage (sources, creators, licenses, subsequent uses) and quantify the state of licensing accuracy and availability in the ecosystem. The context includes rising legal scrutiny, diminished disclosure of training sources, and ethical risks such as data leakage, exposure of PII, and unintended biases. The purpose is to create tools, standards, and a large-scale audit to inform responsible dataset use and improve provenance transparency. The importance lies in mitigating legal/ethical risks, enabling informed dataset selection, and supporting accountability for the data driving modern AI breakthroughs.
Literature Review
The authors situate their work within efforts on dataset documentation and transparency, including datasheets for datasets, data statements, and model cards. They reference recent large-scale pretraining corpora and instruction-tuning collections (e.g., The Pile, RefinedWeb, FLAN, Super-Natural Instructions), as well as analyses of webtext documentation and “documentation debt.” They highlight trends toward repackaging diverse sources, the decline in disclosing training data in proprietary models, and prior calls for improved transparency and bias analysis in multilingual and multimodal datasets. The legal backdrop references scholarship on fair use and machine learning, derivative works, indirect liability, and ongoing litigation related to training on copyrighted materials. This synthesis underscores gaps in current practices and motivates a rigorous, scalable approach to provenance and licensing audits.
Methodology
Scope and selection: The initiative audits 44 widely used alignment/instruction-finetuning data collections comprising 1,858 individual text datasets. Collections were selected by legal and AI experts for their adoption and impact (many have from hundreds to millions of monthly downloads). Overlaps among collections were not deduplicated to preserve original formatting and curation choices.
Data Provenance Explorer (DPExplorer) and DPCollection: The team designed a schema and tools to collect and expose three categories of information per dataset: (1) identifiers (links/IDs across GitHub, Hugging Face, Papers with Code, Semantic Scholar, arXiv), (2) dataset characteristics (languages, task categories, topics, text length metrics, dialogue formats, time of collection), and (3) provenance (licenses and their conditions, original data sources, creators, attribution links, citation and download counts). Data are provided via an open-source repository and an interactive web interface (www.dataprovenance.org) that supports filtering by license conditions and auto-generates provenance cards.
Metadata acquisition pipeline: The collection used a mix of manual and automated procedures (Extended Data Fig. 3). Automated extraction retrieved license indicators from Hugging Face configurations and GitHub pages; the Semantic Scholar API provided release dates and citation counts. Text metrics (min/mean/max input and target lengths, dialogue turns) were computed in characters to avoid tokenizer biases across languages. GPT-4 was used to annotate topics by sampling 100 examples per dataset and to assist experts as an in-context retriever to extract mentions of dataset sources from arXiv papers.
License annotation protocol (human-in-the-loop): Legal experts and trained annotators followed a structured protocol: (1) aggregate all self-reported licenses from GitHub, arXiv, Hugging Face, Papers with Code, and collection sources; (2) search for explicit data licenses (distinct from code licenses) attributable to data authors; (3) identify license type (standard, custom, request form, unspecified), listing multiples if present; (4) categorize licenses along three practical axes—permitted use (commercial vs non-commercial/academic-only), attribution requirement, and share-alike—using the strictest applicable condition across multiple licenses; (5) determine original text sources, tracing back to the root dataset and including newly introduced sources across derivation stages; (6) collect additional provenance signals (e.g., prevalence of use, creators, potential competitor status) to support organization-specific risk tolerances. The authors note this process reflects best efforts and does not constitute legal advice; enforceability depends on circumstances discussed in the legal section.
Dataset characteristics and measures: Task categories (>20 types) and formats (zero-shot, few-shot, chain-of-thought, multi-turn dialogue, response ranking) were catalogued. Diversity was measured via normalized Shannon entropy for discrete features and differential entropy for continuous features. A language representation score per country was computed using population-weighted coverage of languages present across datasets.
Software: The analysis relied on a modern Python stack (datasets, huggingface-hub, pandas, numpy, pyarrow, openai, etc.) and JavaScript visualization libraries (observablehq, P5, D3) for DPExplorer. Repositories for data and code are openly available.
Key Findings
• Licensing omissions and errors are widespread on aggregators: Across GitHub, Hugging Face, and Papers with Code, 69–72% of dataset licenses were unspecified, compared to 30% after the authors’ re-annotation. The team reassigned 46–65% of dataset licenses per platform, substantially improving coverage. Aggregators labeled use cases too permissively in 27–29% (HF/GH) and 16% (PWC) of cases, often by confusing code licenses with data licenses. In Hugging Face, 66% of analyzed licenses differed in use category from the authors’ annotations, commonly marked more permissively than the original author’s intent.
• License type distribution: The most common were CC BY-SA 4.0 (15.7%), OpenAI Terms of Use (12.3%), and CC BY 4.0 (11.6%). A long tail of custom licenses accounted for 9.6%. Overall, 85% of dataset licenses request attribution and 30% include share-alike, creating challenges for large compilations that must meet numerous attribution and compatibility requirements.
• Commercially viable vs non-commercial/academic-only (NC/A-O) datasets: NC/A-O datasets exhibit significantly greater diversity of tasks, topics, and sources and have much longer target text lengths. Table 3 reports mean target lengths of 1,580.7±965.6 characters for NC/A-O vs 102.7±14.6 for commercial and 90.5±14.3 for unspecified; input lengths are similar across categories. Creative, brainstorming, explanation, logic, and math tasks are overrepresented in NC/A-O, whereas commercially viable sets skew toward short text generation, translation, and classification. Government and search query sources tend to be commercially usable; general web, exams, and model-generated sources are more restricted.
• Synthetic data and licensing: 45% of NC/A-O datasets are synthetic (often generated via commercial APIs, notably OpenAI), versus <14% in commercial/unspecified categories. The rise of synthetic datasets helps explain longer target lengths and broader task/topic diversity among NC/A-O data.
• Temporal shift toward restrictive licensing: In 2023, 61% of traced datasets were NC/A-O, and only 12% were unspecified—contrasting with pre-2022 years when 50–80% lacked licenses. This trend suggests increasing reliance on explicit, often restrictive licenses.
• Language coverage: Commercial datasets show slightly greater language variety overall, but many language families (Turkic, Sino-Tibetan, Japonic, Indo-European’s long tail) have >35% NC/A-O coverage. Code language datasets are predominantly commercially viable (78%). Geographically, dataset language representation is heavily skewed toward English and Western Europe, with sparse coverage for many countries in the Global South.
• Data sources and creators: Highly used sources include wikipedia.org (14.9%), undisclosed webpage crawls (7.0%), Reddit (6.2%), and Twitter (4.0%); least represented include commerce, reviews, legal, academic papers, and search queries. Major dataset contributors include A12 (12.3%), University of Washington (8.9%), and Facebook AI Research (8.4%).
Discussion
The findings demonstrate a systemic provenance and licensing gap, where widely used aggregators frequently omit or mislabel dataset licenses, often more permissively than original terms. This creates practical, ethical, and legal risk for practitioners, as training data is costly and largely irreversible. The audit and tooling enable developers to assess provenance, filter datasets by license constraints, and document sources through auto-generated provenance cards.
Legally, the paper underscores unresolved questions about training on copyrighted works (copying during crawling, derivative work status) and the evolving application of fair use in the United States. While fair use may protect training on copyrighted materials in some contexts, the authors argue supervised datasets created explicitly for ML are less likely to fall under fair use; thus, license terms govern their usage. They also highlight ambiguity around LLM-generated annotations (e.g., OpenAI Terms of Use potentially restricting use of outputs to train competing models) and possible indirect liability theories or contractual claims.
Given these uncertainties, provenance-aware practices can reduce risk: using only commercial-licensed data, negotiating permissions for restricted sets, or making informed, context-dependent risk assessments. The authors encourage dataset creators to select and communicate appropriate licenses and regulators to clarify license enforceability to promote responsible, inclusive, and transparent ML. The work provides actionable infrastructure—DPExplorer and the audited DPCollection—to attribute data, assess compliance, and support more rigorous documentation for multi-source training corpora.
Conclusion
This work delivers the most extensive public audit to date of text dataset provenance for alignment/finetuning: (1) a curated DPCollection of 1,858 datasets across 44 widely used collections with enriched licensing, source, and creator metadata; (2) the open DPExplorer interface and repositories to filter, download, and document data by provenance; and (3) a landscape analysis exposing a widening divide between commercially open and non-commercial/academic-only data, with creative, long-form, synthetic, and many lower-resource language datasets concentrated under restrictive terms. The authors’ re-annotation substantially reduces unspecified licenses and corrects permissive mislabeling, enabling more confident dataset selection.
They advocate for better licensing hygiene by dataset creators, broader adoption of data provenance cards for scalable attribution, and regulatory clarification on license enforceability and fair use boundaries. Future work includes extending the audit beyond text and finetuning to pretraining and multimodal data, improving automated provenance extraction, expanding coverage of non-Western and low-resource languages, and developing community standards for dataset license compatibility and documentation.
Limitations
• Scope and selection: The initial focus is on instruction/alignment finetuning datasets; findings may not generalize to all pretraining or domain-specific corpora. Collections were selected by expert judgment and popular adoption, introducing potential selection bias.
• No deduplication across collections: Overlapping datasets/examples were retained to preserve original design choices, which can complicate aggregate statistics and license interactions.
• License enforceability and legal interpretation: Annotations reflect best efforts and do not constitute legal advice; enforceability depends on jurisdiction, facts of creation/derivation, and whether dataset creators hold copyright in their compilations. Contractual terms (e.g., OpenAI Terms of Use) may have different implications for direct users vs third parties.
• Annotation constraints: Manual curation is time-intensive; some fields (e.g., sources, topics) used GPT-4 assistance and paper-based inference, which can introduce errors. Aggregator metadata and timestamps were used as proxies (e.g., for release timing), and topic/source identification may remain incomplete.
• Metrics and measures: Lengths measured in characters (to reduce tokenizer bias) are not directly comparable to token-based measures; entropy-based diversity metrics summarize but may not capture all aspects of dataset heterogeneity.
Related Publications
Explore these studies to deepen your understanding of the subject.

