← ML Research Wiki / 2506.17185

A Common Pool of Privacy Problems: Legal and Technical Lessons from a Large-Scale Web-Scraped Machine Learning Dataset

(2025)

Paper Information

arXiv ID

2506.17185

Contents

Abstract
Methods
Datasets
Results
Limitations
Related Work
External Resources

Abstract

We investigate the contents of web-scraped data for training AI systems, at sizes where human dataset curators and compilers no longer manually annotate every sample.Building off of prior privacy concerns in machine learning models, we ask: What are the legal privacy implications of web-scraped machine learning datasets?In an empirical study of a popular training dataset, we find significant presence of personally identifiable information despite sanitization efforts.Our audit provides concrete evidence to support the concern that any large-scale web-scraped dataset may contain personal data.We use these findings of a real-world dataset to inform our legal analysis with respect to existing privacy and data protection laws.We surface various privacy risks of current data curation practices that may propagate personal information to downstream models.From our findings, we argue for reorientation of current frameworks of "publicly available" information to meaningfully limit the development of AI built upon indiscriminate scraping of the internet.

Summary

This paper investigates the legal and technical implications of using a large-scale web-scraped dataset named DataComp CommonPool for training AI systems. Through an empirical audit, the authors reveal significant privacy concerns, particularly the presence of personally identifiable information (PII) despite cleaning efforts. The audit uncovers various types of sensitive data, including credit card and passport numbers, as well as resumes linked to identifiable individuals. The authors argue that current data curation practices are insufficient and highlight the necessity for more stringent privacy regulations. They advocate for a reorientation in the understanding of ‘publicly available’ information, emphasizing the potential legal pitfalls and ethical considerations surrounding the use of such datasets in AI. The findings underscore the essential need for meaningful regulatory frameworks to address the privacy risks intrinsic to the development of AI models based on scraped internet data.

Methods

This paper employs the following methods:

Legal Analysis
Dataset Audit
Empirical Study

Models Used

None specified

Datasets

The following datasets were used in this research:

DataComp CommonPool
LAION-5B

Evaluation Metrics

None specified

Results

Instances of personal information found in DataComp CommonPool
Current cleaning methods are insufficient for privacy protection
Critique of existing privacy frameworks based on audit findings

Limitations

The authors identified the following limitations:

Incomplete sanitization of personal data
Challenges in identifying and controlling personal data due to scale
Inability to audit all individual samples effectively

Technical Requirements

Number of GPUs: None specified
GPU Type: None specified
Compute Requirements: None specified

Papers Using Similar Methods

External Resources

References: 133

A Common Pool of Privacy Problems: Legal and Technical Lessons from a Large-Scale Web-Scraped Machine Learning Dataset

Abstract edit

Summary

Methods add

Models Used add

Datasets add

Evaluation Metrics add

Results add

Limitations add

Technical Requirements edit

Related Papers