The Large-scale Artificial Intelligence Open Network (LAION) released LAION-5B, an AI training dataset containing over five billion image-text pairs. LAION-5B contains images and captions scraped from the internet and is 14x larger than its predecessor LAION-400M, making it the largest freely available image-text dataset.
The release was announced on the LAION blog. LAION-5B was collected by parsing files in the Common Crawl dataset to find image tags with alt-text values. The corresponding images were downloaded and filtered using CLIP to keep only those images whose content resembled their alt-text description. Overall, the dataset contains 2.32 billion images with English text, 2.26 billion with text from other languages, and 1.27 billion whose text language could not be determined unambiguously. The release also includes several nearest-neighbor indices of the data, a web demo using the data for semantic search, and a reproduction of CLIP trained on the data. LAION team members hope that releasing LAION-5B will democratize multi-modal AI research:
By releasing an updated version of an openly available dataset that contains 5 billion image-text pairs, we have set new standards for the scale of openly available datasets and enable researchers from all over the world to train state-of-the-art language-vision models like GLIDE or Turing Bletchley...This dataset extends the possibilities in multi-language large-scale training and research of language-vision models, that were previously restricted to those having access to proprietary large datasets, to the broad community.
Multi-modal AI models, especially those trained on combined image and text data, have made impressive gains in recent years, driven in part by large datasets. In 2021, OpenAI published their paper on Contrastive Language–Image Pre-training (CLIP), which was pre-trained on 400 million image-text pairs and was able to achieve high performance on a variety of multi-modal benchmarks with no fine-tuning. Although OpenAI did open-source the CLIP code and model weights, they did not make their dataset publicly available. This prompted LAION to attempt to replicate OpenAI's dataset collection, which was released last year. This dataset, LAION-400M, contains 413M image-text pairs and has subsequently been used "in many papers and experiments."
The new dataset, LAION-5B, was collected using a three-stage pipeline. First, a distributed cluster of worker machines parsed datafiles from Common Crawl to collect all html image tags that had alt-text attributes. Language detection was performed on the alt-text; in cases where the language detection confidence was low, the language was recorded as "unknown." Raw images were downloaded from the tagged urls and passed, along with the alt-text, to a CLIP model to calculate embeddings for both. A similarity score was calculated for the two embeddings; pairs with low similarity were discarded. Additional filtering removed duplicates as well as samples where the text was less than five characters or the image resolution was too large.
LAION engineer Romain Beaumont joined a Hacker News discussion about the release. In response to a criticism that the dataset was un-curated, Beaumont replied,
Non annotated datasets are the base of self supervised learning, which is the future of machine learning. Image/text with no human label is a feature, not a bug. We provide safety tags for safety concerns and watermark tags to improve generations...It also so happens that this dataset collection method has been proven by using LAION-400M to reproduce clip models. (And by a bunch of other models trained on it)
The LAION-5B dataset can be downloaded from the HuggingFace website.