Google Text Embedding Model Gecko Distills Large Language Models for Improved Performance

Gecko is a text embedding model that Google created by distilling knowledge from large language models into a general-purpose model. Gecko is trained using a novel approach on a variety of tasks, including document retrieval, semantic similarity, and classification, and aims to be as general-purpose and highly performant.

To train Gecko, Google created a fine-tuning dataset, dubbed FRet (Few-shot Prompted Retrieval), using queries generated from LLMs. LLMs are also used to mine both negative and positive passages associated with the queries.

More in detail, the approach followed by Google with Gecko consists of two steps, each of which relies on LLMs to first generate and then rank data:

Starting with a large corpus of (unlabeled) passages, we use a few-shot prompted LLM to generate a relevant task and query for each passage [...]. We then embed the concatenated task and query using a pretrained embedding model to obtain nearest neighbor passages, use an LLM to rerank the passages, and obtain positive and negative passages based on the LLM scores.

The first step produces a diverse set of (query, passage) pairs associated with diverse tasks. In the second step, an existing embedding model is used to retrieve the top N most similar passages that answer the query and the LLM is used to rank each of them, giving a positive target and negative targets.

The use of FRet is, according to Google, what distinguishes their approach and grants Gecko its performance advantage, specifically the use of LLMs to re-rank passages.

The re-ranking step is key to enhance the quality as we discover that the best passage to answer the generated query often differs from the original source passage.

Other key factors include generating queries and passages for a diverse set of tasks and carefully formatting the training data.

According to Google, Gecko achieves the best performance in its class on the MTEB benchmark and competes with other systems based on 7x larger models or 5x higher dimensional embeddings. On the lower end of the spectrum, Gecko, with 256-dimensional embeddings, outperforms all existing entries with 768 embedding size, making it a very appealing option as a compact text embedding model, says Google.

Text embeddings are a foundational tool in natural language processing. They convert unstructured text into vector representations associated with meaning, semantics, and relationships and are used in a variety of applications, including document retrieval, text clustering, semantic search, similarity scoring, and text classification.

Google has not open-sourced Gecko at the moment, and it is not clear yet how it will be made available to the public.

About the Author

Sergio De Simone

Show moreShow less

InfoQ Software Architects' Newsletter

Login with:

Don't have an InfoQ account?

Write for InfoQ

About the Author

Sergio De Simone

Rate this Article

This content is in the AI, ML & Data Engineering topic

Related Topics:

Related Editorial

Related Sponsored Content

Popular across InfoQ

The InfoQ Newsletter