Recently, the German research organization LAION released a new dataset called Re-LAION-5B, which claims to have been thoroughly cleaned of known links to suspected child sexual abuse material (CSAM). This dataset is actually a re-release of an old dataset, LAION-5B, with fixes implemented based on recommendations from various organizations such as the Internet Watch Foundation and Human Rights Watch.
The Clean-Up Process
LAION stated that they have been committed to removing illegal content from their datasets from the beginning and have implemented measures to achieve this promptly. The new dataset is available in two versions – Re-LAION-5B Research and Re-LAION-5B Research-Safe, which removes additional NSFW content. These versions were filtered for links to thousands of known and likely CSAM, as confirmed by LAION.
Evaluation of LAION’s Datasets
It is important to note that LAION’s datasets do not contain images but are indexes of links to images and image alt text that LAION curated from a different dataset called the Common Crawl. However, there have been instances where LAION’s datasets contained links to illegal content, such as pornographic imagery and racist slurs.
The release of Re-LAION-5B follows an investigation by the Stanford Internet Observatory in December 2023, which found that LAION-5B included links to illegal images scraped from social media posts and adult websites. The report also highlighted the presence of inappropriate content in the dataset, leading to concerns about the ethical implications of using such data.
Ethical Concerns and Recommendations
The Stanford report recommended that models trained on LAION-5B should be deprecated, and distribution ceased where feasible. While LAION took LAION-5B offline temporarily, there are still lingering concerns about the impact of CSAM on the output of AI models trained on such datasets. Moreover, AI companies like Runway have taken steps to address these issues by removing models trained on controversial datasets.
LAION emphasized that the metadata in the new Re-LAION-5B dataset can be used by third parties to clean existing copies of LAION-5B by removing any matching illegal content. The organization reiterates that its datasets are intended for research purposes and not for commercial use. However, there are instances where organizations have used LAION’s datasets for training image-generating models, raising questions about the ethical use of such data.
The release of LAION’s new dataset raises important ethical considerations regarding the use of AI training data and the responsibility of organizations to ensure that their datasets are free from illegal or harmful content. As AI technologies continue to advance, it is crucial for researchers and companies to prioritize ethical practices and transparency in data collection and usage.