Vector Search and Retrieval Augmented Generation (RAG)

Paul Yang - December 15th, 2023

Vector search and retrieval employ the power of mathematical vectors to efficiently locate and retrieve data items with shared characteristics. Each piece of content, be it text, image, or sound, is mapped into a vector within a multi-dimensional space, where the 'location' or 'coordinate' of the item is determined by its features.

Imagine a bustling city, where every building has an address based on its location. Each address is like a vector: it tells you where to find the building amongst the maze of streets. Similarly, in vector space, each item's 'address' (its vector) helps you navigate the complex environment to find what you're looking for.

One way to assign the coordinate would be by semantic meaning. Buildings closer to each other contain relevant information to each other, and if I wanted to know something about a particular topic, I can simply find the right location on the map and go there. This has been made possible by the ability of large language models (LLMs) to better encode the semantic relevance of text and images into these vector representations. This in turn, becomes the building block of “long-term memory” for LLMs, as it allows us to on-the-fly summon the correct piece of relevant context to incorporate into a prompt.

Creating the Vectors

Creating vectors from raw data is a fundamental step in transforming unstructured items into a format that can be processed and understood by computer systems. In its most basic approach, this process may involve simple feature extraction techniques that assign numerical values to different attributes of the data. For instance, in text analysis, this could begin with a 'bag-of-words' model, where each unique word is represented as a dimension in the vector space and the value in each dimension is the frequency of the word's occurrence in the text. Alternatively, for image data, features might include pixel intensity, color histograms, or edges. These manual and often simplistic strategies are a stepping stone to the more advanced representations employed today.

The rise of LLMs, such as GPT (Generative Pre-trained Transformer) and BERT (Bidirectional Encoder Representations from Transformers), has dramatically improved the capacity for creating rich, nuanced vectors, especially in the realm of Natural Language Processing (NLP). LLMs leverage deep learning and vast amounts of training data to capture complex language constructs, context, and even sentiment. They generate embeddings – high-dimensional vectors that encode the meaning of words, phrases, or entire documents – with a level of sophistication that previous methods can rarely achieve. Their understanding of semantics translates into vector spaces where geometric relationships between vectors accurately represent the linguistic relationships between the items they encode, allowing for more intelligent retrieval systems.

Once created, these vectors are typically stored in databases or search indexes optimized for high-dimensional data. Traditional relational databases are not ideal for vector storage and retrieval because they are not optimized for high-dimensional similarity searches. Instead, specialized systems equipped with vector indexing capabilities are used. These storage solutions often employ techniques like Approximate Nearest Neighbor (ANN) algorithms which sacrifice a small degree of accuracy for significant improvements in search efficiency. In practice, most modern SQL systems including Postgres and MySQL now include storage types that accommodate vectors, so the use of a specialized vector DB is no longer strictly necessary.

Facilitating Retrieval

Euclidean distance and cosine similarity are two popular metrics used to compare items and determine their level of resemblance. Euclidean distance is the straight-line distance between two points in vector space, which corresponds to the geometric notion of how far apart two items are. Cosine similarity, alternatively, measures the cosine of the angle between two vectors, providing an assessment of their orientation irrespective of their magnitude.

There are a few reasons to prefer cosine similarity. For traditional approaches, Cosine similarity was helpful in text analysis where frequency counts (like term frequencies in documents) result in vectors with varying magnitudes – long text tends to have higher counts of every word and so when vectors were based on how many times specific words occurred, the normalization provided by cosine similarity ensures that a longer document does not have advantage over a shorter one. For modern LLM-generated vectors, Euclidean distance can become less meaningful due to the curse of dimensionality, where the distance between any two points (i.e., vectors) tends to be about the same in high-dimensional space. Cosine similarity, focusing on the angle, can often give more nuanced differentiation between vectors in these spaces.

Retrieval Augmented Generation

Vector comparison can power semantic search, which goes beyond the limitations of traditional keyword-based search by not just looking at the presence or absence of words, but by grasping the underlying intent and contextual meaning within a search query. The results with most vector similarity are returned, and this results in a more intuitive search experience, akin to asking a knowledgeable friend. By leveraging the context and broader semantic relationships, semantic search connects users with information that may not match the search terms verbatim but matches them conceptually, hence providing a richer, more human-like understanding of queries.

Recently, and importantly, the concept of long-term memory for LLMs is based on the correct on-the-fly retrieval of relevant context from vectorized representations of the content, and incorporating it into the input prompt to the LLM. This is called Retrieval-Augmented Generation (RAG). In RAG systems, when presented with a query, the model first retrieves relevant context or documents from a vast corpus of knowledge using semantic search techniques. It then uses the sourced content to generate accurate and context-aware responses. For example, the infamous example of a lawyer relying on hallucinated cases is mitigated if actual cases are retrieved and GPT-4 responses are formed with real cases included in the input prompt.

This retrieval-augmented approach essentially marries the best of both worlds—the extensive knowledge encoded in data with the sophisticated language understanding and generation capabilities of state-of-the-art LLMs.

About

Einblick is an AI-native data science platform that provides data teams with an agile workflow to swiftly explore data, build predictive models, and deploy data apps. Founded in 2020, Einblick was developed based on six years of research at MIT and Brown University. Einblick is funded by Amplify Partners, Flybridge, Samsung Next, Dell Technologies Capital, and Intel Capital. For more information, please visit www.einblick.ai and follow us on LinkedIn and Twitter.