Computer Science

On responsible machine learning datasets emphasizing fairness, privacy and regulatory norms with examples in biometrics and healthcare

S. Mittal, K. Thakral, et al.

This study highlights the critical need for responsible machine learning datasets, focusing on fairness, privacy, and compliance with regulations. Conducted by Surbhi Mittal, Kartik Thakral, Richa Singh, Mayank Vatsa, Tamar Glaser, Cristian Canton Ferrer, and Tal Hassner, it uncovers widespread issues within computer vision datasets, especially in biometrics and healthcare, and proposes a new evaluation rubric.

00:00

~3 min • Beginner • English

Index

Introduction

The paper addresses the question of how to systematically assess the responsibility of machine learning datasets by quantifying three critical dimensions: fairness, privacy, and regulatory compliance. Motivated by growing concerns over bias, privacy risks, and manipulation in AI systems and the data cascades that undermine performance and trust, the authors argue that data quality and stewardship are as important as algorithmic advances. They focus on human-centric data—particularly biometric (face) and healthcare imaging—where regulations and ethical stakes are prominent. The purpose is to develop and apply a quantitative audit rubric that evaluates datasets for inclusivity and diversity of demographic representation (fairness), presence of privacy-sensitive annotations (privacy), and adherence to regulatory norms such as institutional approval, consent, and data expungement (compliance). By auditing 60 datasets, the study aims to highlight systemic shortcomings, trends over time, and trade-offs like the fairness-privacy paradox, ultimately informing better dataset design and governance.

Literature Review

The authors survey evolving discourse on dataset quality, noting both qualitative and quantitative approaches. Prior qualitative frameworks include datasheets for datasets (Gebru et al.) and responsible data development processes (Hutchinson et al.), as well as work on documenting dataset context and archival-inspired stewardship addressing consent, inclusivity, and privacy. Quantitative efforts have largely focused on bias at the model level, audits of algorithms, and toolkits for bias analysis; fewer works quantify fairness and privacy jointly at the dataset level. Privacy quantification has leveraged human judgments of sensitive content, inference of locations from auxiliary data, and predictive models of privacy risk, alongside formal privacy metrics like k-anonymity, l-diversity, t-closeness, and M-invariance. Policy and regulatory literature highlights the GDPR’s role in governing biometric data, global proliferation of data protection laws, domain-specific statutes like HIPAA and BIPA, and the European AI Act’s implications for trustworthy AI, including facial processing. The review identifies gaps: limited joint, quantitative assessment of fairness, privacy, and regulatory compliance specifically at the dataset level, motivating the proposed rubric.

Methodology

The study introduces a responsible audit rubric to quantitatively assess datasets across three axes—fairness (F), privacy (P), and regulatory compliance (R)—and applies it to 60 publicly accessible datasets (52 face-based biometric and 8 healthcare chest X-ray) selected after screening over 100 datasets (excluding decommissioned, inaccessible, or very small datasets). Quantifying fairness: The framework evaluates three components for fairness—(1) inclusivity, (2) diversity, and (3) label reliability. Demographics D = {gender, skin tone, ethnicity, age} are considered, with predefined subgroup sets S for each (e.g., gender includes male, female, other; ethnicity follows FairFace with an added mixed-race category; age uses AgeDB groupings). Inclusivity for each demographic i is computed as the ratio of present subgroups to total predefined subgroups. Diversity is computed using Shannon’s diversity index over subgroup distributions (based on sample counts or individuals if sample counts are unavailable), yielding values normalized between 0 and 1 per demographic. Label reliability scores reflect the source of demographic annotations: self-reported (1.0), classifier-generated (0.67), or apparent (human-observed) labels (0.33); for healthcare labels, a score of 1 is used if a medical professional provides/validates annotations, else 0. The overall fairness score F aggregates inclusivity, diversity, and label reliability across demographics, with a maximum of 5 for the considered set; higher F indicates a fairer dataset. Quantifying privacy: The audit inspects dataset publications, websites, and repositories to detect the presence of six privacy-relevant annotation categories A = {A1: name identification; A2: sensitive/protected attributes such as gender, race; A3: accessories (e.g., hats, sunglasses); A4: critical objects (e.g., credit cards, signatures); A5: location information (coordinates, landmarks); A6: medical condition information}. A privacy leakage score PL sums the presence of these attributes (one point each). The privacy preservation score P is then derived to reflect lower risk with fewer present attributes; equal weights are assigned by default, with sensitivity analysis possible by reweighting attributes. Quantifying regulatory compliance: The regulatory score R is the sum (0–3) of three binary factors validated from dataset documentation: (1) institutional approval (e.g., IRB or equivalent), (2) individual consent obtained, and (3) data expungement/correction facility for subjects. Missing documentation defaults to zero. Analysis procedure: For each dataset, the authors compute the 3-tuple (F, P, R), visualize distributions and trends, and perform clustering to understand dataset groupings in the FPR space. For face datasets, a 3D scatter plot is augmented with a hypothetical ideal (perfect F, P, R), and DBSCAN clustering (eps = 1) is used to identify clusters and outliers. Weighted averages are also reported by normalizing each score by its maximum (F/5, P/6, R/3) and averaging to yield a responsibility score in [0,1]. All computations are reproducible with code and data provided via OSF and the project page.

Key Findings

- Across 60 datasets, the mean ± s.d. fairness score was 0.96 ± 0.64 (max 5), indicating low fairness overall; 91% of datasets scored ≤ 2/5 on fairness. - UTKFace had the highest fairness among listed datasets with F = 2.71 (of 5), still just over half the maximum. Average fairness was higher for healthcare datasets (1.34 ± 0.17) than for biometric datasets (0.90 ± 0.67). - Privacy preservation scores P (max 6) showed that few datasets achieved perfect preservation; DroneSURF contained no private information and achieved P = 6. Healthcare datasets de-identified identities but disclosed medical conditions and sometimes locations, reducing P. - Regulatory compliance scores R (max 3) averaged 0.58; FB Fairness Dataset (Casual Conversations) satisfied all three criteria (R = 3). 89% of datasets had R = 0 or 1. - The fairness-privacy paradox was observed: inclusion of sensitive attribute annotations supports fairness evaluation but increases privacy leakage risk. - Clustering in FPR space (DBSCAN, eps = 1) for face datasets yielded five clusters with centers at (F, P, R): (0.67, 5, 2), (1.14, 4, 1), (1.37, 3, 1), (0.69, 4.94, 0.28), and (1.45, 3, 0). Distances from the ideal FPR dataset were 4.56, 4.79, 5.11, 5.20, and 5.53, respectively. Two outliers—FB Fairness Dataset (1.56, 5, 3) and UTKFace (2.71, 5, 1)—were closest to the ideal. - Using a normalized weighted average responsibility score, the top datasets were FB Fairness Dataset (Casual Conversations), IISCIFD, and UTKFace. High R boosted FB Fairness and IISCIFD; high F boosted UTKFace. - Temporal trend: fairness and regulatory scores have generally increased for more recently published datasets. - Source trend: non–web-collected datasets achieved higher fairness and regulatory scores on average; many large web-curated datasets lacked institutional approval and often had unreliable annotations, adversely affecting fairness and compliance. - Table 2 provides detailed per-dataset F, P, R, and weighted averages for 52 biometric and 8 healthcare datasets, corroborating overall low fairness and compliance and moderately higher privacy preservation.

Discussion

The audit demonstrates that widely used biometric and healthcare imaging datasets frequently underperform on fairness and regulatory compliance, despite relatively better privacy preservation. These results directly address the study’s central aim of quantifying dataset responsibility across F, P, and R and underscore the pivotal role of data stewardship in trustworthy AI. The identified fairness-privacy paradox highlights a core tension: sensitive attribute labels are necessary for fairness assessment and debiasing but increase privacy risk, complicating design choices and regulatory interpretations. The analysis also reveals that datasets with explicit governance (institutional approvals, consent mechanisms, expungement) achieve higher overall responsibility scores, and that controlled, non-web collection improves fairness and compliance relative to web scraping. Trends indicate gradual improvement over time, suggesting that tightening regulations and community norms are positively influencing dataset practices. The findings advocate for integrating quantitative audits with qualitative documentation (datasheets) and for embedding regulatory-compliant processes to enhance dataset responsibility for downstream ML training.

Conclusion

Whereas contemporary research predominantly focuses on developing trustworthy machine learning algorithms, our work emphasizes assessing the integrity of AI by examining datasets through the lens of fairness, privacy and regulatory compliance. We conduct a large-scale audit of datasets, specifically those related to faces and chest X-rays, and propose recommendations for creating responsible ML datasets. Our objective is to initiate a dialogue on establishing quantifiable criteria for dataset responsibility, anticipating that these criteria will be further refined in subsequent studies. Such progress would facilitate effective dataset examination, ensuring alignment with responsible AI principles. As global data protection laws tighten, the scientific community must reconsider how datasets are crafted. We advocate for the implementation of quantitative measures, combined with qualitative datasheets and the proposed recommendations, to encourage the creation of responsible datasets. This initiative is vital for advancing responsible AI systems. We lay the groundwork for an ethical and responsible AI research and development framework by merging quantitative analysis with qualitative evaluations and practical guidance.

Limitations

- The fairness computation relies on label distributions across predefined demographic subgroups and does not account for image-level diversity (e.g., duplicates, quality) within subgroups. - For some applications, unequal subgroup distributions may be appropriate (e.g., allocating more data to harder-to-model groups), which the current fairness formulation does not capture. - The current F, P, and R definitions are tailored to human-centric datasets; object-centric datasets can face fairness issues but are not addressed, as regulations primarily concern impacts on people. - Intersectional fairness is not explicitly quantified due to combinatorial subgroup expansion; future work is needed to capture intersectionality. - Recommendations and datasheet guidance aim for high standards that may exceed current technological capabilities; they serve as aspirational goals. - Model unlearning and removal of training data remain open research problems; the framework does not solve post hoc data removal from models. - Design choices (e.g., Shannon diversity index, equal weighting of privacy attributes) are one of many possible options and may be adapted based on application needs.

Related Publications

Explore these studies to deepen your understanding of the subject.

Medicine and Health

A wearable sensor and machine learning estimate step length in older adults and patients with neurological disorders

A. Zadka, N. Rabin, et al.

Engineering and Technology

Exploiting redundancy in large materials datasets for efficient machine learning with less data

K. Li, D. Persaud, et al.

Linguistics and Languages

Stylistic and linguistic variations in compliments: an empirical analysis of children's gender schema development with machine learning algorithms

X. Liao and Y. Zhang

Biology

MArVD2: a machine learning enhanced tool to discriminate between archaeal and bacterial viruses in viral datasets

D. Vik, B. Bolduc, et al.

Listen, Learn & Level Up

Over 10,000 hours of research content in 25+ fields, available in 12+ languages.

No more digging through PDFs, just hit play and absorb the world's latest research in your language, on your time.

listen to research audio papers with researchbunny