At the recent QCon San Francisco conference, Sam Partee, principal engineer at Redis, gave a talk about Retrieval Augmented Generation (RAG). In his talk, he discussed Generative Search, which combines the power of large language models (LLMs) with vector databases to improve the efficiency and security of information retrieval. Partee discussed several innovative tricks, such as Hypothetical Document Embeddings (HyDE), semantic caching, and using pre/post-filtering.
Large Language Models and Vector Similarity Search
There is a lot of talk about using generative AI applications in production. Unfortunately, directly generating outputs with a Large Language Model (LLM) in production presents several challenges, such as cost, quality control (such as inaccurate content, often referred to as 'hallucinations'), performance speed, and security. One approach could be LLM fine-tuning, but the infrastructure costs and the need to safeguard sensitive data often present obstacles.
Partee instead proposed to replace the generation task with a retrieval task. The task of retrieval of relevant documents can be improved using LLMs and vector similarity search. The goal here is to create so-called "embeddings" which capture the meaning of parts of a document. Creating these embeddings can be done using a LLM. When someone wants to search for a specific document, they can retrieve all documents which are close together in the embedding space of what they are searching for.
Retrieval Augmented Generation(RAG)
Partee introduced the concept of Retrieval Augmented Generation (RAG), a fast and efficient alternative to fine-tuning. Rather than generating answers, engineers can pre-compute all possible answers or relevant data in a knowledge base and index these using their embedding vectors. This approach allows for real-time updates to the knowledge base (by adding new vectors for the information) while also ensuring that sensitive data is not used in both model training or fine-tuning. RAG can be employed effectively across several use cases, including question answering by retrieving relevant documents, summarizing relevant retrieved passages, and retrieving information needed for customer service.
Partee went on to provide practical guidance on implementing a RAG system, discussing two key abstraction levels: the offline process (for creating and storing vectors) and the online process (for generating prompts based on user queries). During the offline process, an engineer processes all of their text to create the vectors which encode the meaning of individual chunks of text. He mentioned two approaches here: either using chunks of documents to summarize these individually or using sentence level with the context around each sentence. The goal during offline vector generation is to ensure the specificity of each individual chunk while also keeping the context into account.
Practical Tips and Tricks
During his talk, Partee discussed multiple tips and tricks for setting up a vector search system. The first trick was using hybrid search, which involves traditional search methods such as pre-filtering and post-filtering. With hybrid search, the practice of vector search is still used to find the most relevant information, but information which can easily be filtered out doesn't have to be considered anymore. This functionality is present in the Redis Vector Library (RVL), a feature Redis has recently launched.
The second trick Partee discussed was using the Hypothetical Document Embeddings (HyDE) approach. In that case, an engineer would not search for a vector close to the question being posed but instead for a vector close to a hypothetical answer to the query. The hypothetical answer could, in this case, be generated by a LLM. He detailed how a generated answer is often closer to the embedding one searches for rather than the question one asks.
The last piece of advice given was more on the system-level service. This includes updating vector embeddings, document-level records and "hot swapping" the database if the first two approaches are not possible. He also described so-called Semantic Caching. If one has already generated an answer to a very similar question, it's possible to return that answer instead of creating a new answer. Although one has to find whether they already answered a query close to the new query, retrieving old answers is faster than generating a new answer from scratch. This is a smart way to boost LLMs' Query per Second (QPS) performance.