Major AI Image Dataset is Back Online After Being Pulled Over CSAM

2 weeks ago 7

The open-source LAION-5B dataset used to train AI image generators has been re-released after it was pulled last year when child sex abuse material (CSAM) was discovered among the billions of pictures.

LAION, a German research company, says it has worked with the Stanford Internet Observatory — which discovered the CSAM — and the nonprofits Internet Watch Foundation, Human Rights Watch, and the Canadian Center for Child Protection to cleanse the dataset of harmful imagery.

The newly-released dataset is called Re-LAION-5B and is available to download in two versions: Re-LAION-5B research and Re-LAION-5B research-safe, with the latter removing further NSFW content. Thousands of CSAM links have been filtered out of both sets. Both datasets are available under the Apache 2.0 license.

“LAION has been committed to removing illegal content from its datasets from the very beginning and has implemented appropriate measures to achieve this from the outset,” LAION writes in a blog post. “LAION strictly adheres to the principle that illegal content is removed ASAP after it becomes known.”

As Tech Crunch notes, LAION never actually hosted thse images. The dataset works by providing a curated index of links to images and the corresponding image alt text, all of which come from Common Crawl — a different dataset.

LAION said that in total, 2,236 links were removed from LAION-5B which contains 5.5 billion image pairs.

The action followed a study from the Stanford Internet Observatory in December last year. At the time, the chief technologist David Thiel condemned the practice of scraping billions of images from the open web and making them available to AI image companies, accusing generative AI products of “rushing to market.”

“Taking an entire internet-wide scrape and making that dataset to train models is something that should have been confined to a research operation, if anything, and is not something that should have been open-sourced without a lot more rigorous attention,” Theil said at the time.

The Stanford report recommended that AI image generators trained on LAION-5B “should be deprecated and distribution ceased where feasible”. Tech Crunch reports that Runway — which partnered with Stability AI — recently removed the Stable Diffusion 1.5 model from the AI hosting platform Hugging Face.

LAION says that its dataset is for research and not for commercial purposes. However, Google once confirmed it used LAION to build its first iteration of the Imagen model and it’s widely suspected that most AI image companies have employed LAION’s services.

Read Entire Article