We investigate the contents of web-scraped data for training AI systems, at sizes where human dataset curators and compilers no longer manually annotate every sample.Building off of prior privacy concerns in machine learning models, we ask: What are the legal privacy implications of web-scraped machine learning datasets?In an empirical study of a popular training dataset, we find significant presence of personally identifiable information despite sanitization efforts.Our audit provides concrete evidence to support the concern that any large-scale web-scraped dataset may contain personal data.We use these findings of a real-world dataset to inform our legal analysis with respect to existing privacy and data protection laws.We surface various privacy risks of current data curation practices that may propagate personal information to downstream models.From our findings, we argue for reorientation of current frameworks of "publicly available" information to meaningfully limit the development of AI built upon indiscriminate scraping of the internet.
This paper investigates the legal and technical implications of using a large-scale web-scraped dataset named DataComp CommonPool for training AI systems. Through an empirical audit, the authors reveal significant privacy concerns, particularly the presence of personally identifiable information (PII) despite cleaning efforts. The audit uncovers various types of sensitive data, including credit card and passport numbers, as well as resumes linked to identifiable individuals. The authors argue that current data curation practices are insufficient and highlight the necessity for more stringent privacy regulations. They advocate for a reorientation in the understanding of ‘publicly available’ information, emphasizing the potential legal pitfalls and ethical considerations surrounding the use of such datasets in AI. The findings underscore the essential need for meaningful regulatory frameworks to address the privacy risks intrinsic to the development of AI models based on scraped internet data.
This paper employs the following methods:
- Legal Analysis
- Dataset Audit
- Empirical Study
The following datasets were used in this research:
- DataComp CommonPool
- LAION-5B
- Instances of personal information found in DataComp CommonPool
- Current cleaning methods are insufficient for privacy protection
- Critique of existing privacy frameworks based on audit findings
The authors identified the following limitations:
- Incomplete sanitization of personal data
- Challenges in identifying and controlling personal data due to scale
- Inability to audit all individual samples effectively
- Number of GPUs: None specified
- GPU Type: None specified
- Compute Requirements: None specified