InfoQ Software Architects' Newsletter

A monthly overview of things you need to know as an architect or aspiring architect.

Enter your e-mail address

Select your country

We protect your privacy.

InfoQ Homepage News Google's JEST Algorithm Automates AI Training Dataset Curation and Reduces Training Compute

AI, ML & Data Engineering

Google's JEST Algorithm Automates AI Training Dataset Curation and Reduces Training Compute

Jul 30, 2024 2 min read

Write for InfoQ

Feed your curiosity. Help 550k+ global
senior developers
each month stay ahead.Get in touch

Google DeepMind recently published a new algorithm for curating AI training datasets: multimodal contrastive learning with joint example selection (JEST), which uses a pre-trained model to score the learnability of batches of data. Google's experiments show that image-text models trained with JEST-curated data require 10x less computation than baseline methods.

JEST tries to solve the problem of curating training datasets; that is, filtering the dataset to choose the specific examples that will be most effective in training a model. However, because manually curating datasets is time-consuming, JEST automates the process by using a pre-trained reference model to select the best batches of samples based on their learnability score, which combines the loss from both the reference model and the learner model being trained. The goal is to find batches that have a high loss for the learner but a low one for the reference, which means that the data is both "unlearned and learnable." According to Google,

[W]e find that central to the performance of our framework is the ability to steer the curation process towards the distribution of smaller, well-curated datasets...Crucially, we find this process [enables] strong data quality bootstrapping: a reference model trained on a small curated dataset can effectively guide the curation of a much larger dataset, allowing the training of a model which strongly surpasses the quality of the reference model on many downstream tasks.

JEST is applied during the training process. Given a large super-batch of training data, JEST selects chunks or sub-batches based iteratively by calculating their joint learnability conditioned on the sub-batches previously sampled. The research team found that this improves the quality of the batches, similar to the concept of hard negatives.

Because the learnability score is computed online during training, it imposes some additional compute cost. To address this, JEST uses model approximation for efficient scoring; for example, the vision component of the reference model can drop layers or image patches. The researchers also improved efficiency by training the learner at different image resolutions.

The DeepMind team ran several experiments to evaluate JEST. They first trained an image-text reference model on a curated dataset based on the Web Language Image (WebLI) dataset. They trained learner models using both JEST and compared to models trained using a baseline uniform batch selection. Models trained using JEST achieved the same benchmark performance as baseline models, while requiring 10x fewer training FLOPS.

In a discussion on Hacker News, several users praised DeepMind's work. One wrote:

So the paper itself is pretty significant, I think, from looking at it. The general methodology seems to be: train small model as a discriminatory scoring model on very high quality data...This turns out to be significant FLOPs and quality win, even counting for the initial model training and scoring part of it...As always, appreciate the publishing from DeepMind - this looks like great work.

Another user pointed out that JEST was similar to another method called Cappy, which also uses a "pretrained small scorer." Other related techniques include RHO-LOSS, which inspired JEST and is open-source. Google has not open-sourced JEST.

About the Author

Anthony Alford

Anthony is a Senior Director, Development at Genesys where he is working on several AI and ML projects related to customer experience. He has over 20 years experience in designing and building scalable software. Anthony holds a Ph.D. degree in Electrical Engineering with specialization in Intelligent Robotics Software and has worked on various problems in the areas of human-AI interaction and predictive analytics for SaaS business optimization.

Show moreShow less

The InfoQ Newsletter

A round-up of last week’s content on InfoQ sent out every Tuesday. Join a community of over 250,000 senior developers. View an example

We protect your privacy.

InfoQ Software Architects' Newsletter

Login with:

Don't have an InfoQ account?

Google's JEST Algorithm Automates AI Training Dataset Curation and Reduces Training Compute

Write for InfoQ

About the Author

Anthony Alford

This content is in the AI, ML & Data Engineering topic

Related Topics:

Related Editorial

Related Sponsored Content

Popular across InfoQ

The InfoQ Newsletter