0

A major AI training data set contains millions of examples of personal data

https://www.technologyreview.com/2025/07/18/1120466/a-major-ai-training-data-set-contains-millions-of-examples-of-personal-data/(www.technologyreview.com)
The article discusses the discovery of personally identifiable information (PII) within DataComp CommonPool, a large open-source dataset used for training image generation models. Researchers found millions of images of sensitive documents like passports and credit cards, estimating that hundreds of millions of images contain PII. The presence of this data raises significant privacy concerns, as the dataset has been widely downloaded and used to train various AI models. The authors highlight the challenges in effectively filtering PII from web-scraped data and the need for the AI community to address these ethical issues.
0 pointsby raj2 months ago

Comments (0)

No comments yet. Be the first to comment!

Want to join the discussion?