Supercharging LLMs with Custom Data via Retrieval Augmented Generation (RAG)
by Tobias Abdon
Introduction: Unlocking the Potential of Retrieval Augmented Generation (RAG) in Enterprise Data Retrieval
Large Language Models (LLMs) like ChatGPT have changed knowledge work forever. With a well thought out prompt, you can generate code, marketing content, or letters home to mom. It truly is a major leap forward that is helping people be more productive on a global scale.
While these LLMs are extremely helpful, they do have one weakness: static data. When these models are made, they are trained on a specific dataset. As time moves forward, as it tends to do, the dataset gets stale.
As of this writing, this is what ChatGPT had to say about it’s training data:
But what if you want a LLM like chat experience with up to date data? Or, what if you want the experience with private data?
That’s where a very useful technique called Retrieval Augmented Generation (RAG) comes in. In this article we will discuss what RAG is and how it works.
RAG Helps Fill in the Knowledge Gaps
RAG is a way of using LLMs that lets them pull in information from outside databases or documents to generate a custom response. Unlike the usual LLMs, which can only use the information they were originally trained on, RAG can use extra, up-to-date information to make responses that are more accurate and relevant to the current situation.
How Does RAG Work?
There are two parts to implementing RAG. The first part is preparing and setting up your data sources. The second part is building a chat app that is RAG capable.
Let’s talk about those two parts now.
However, before we get into that, a quick word on text embeddings. Text embeddings refer to the transformation of textual data into numerical vectors, often within a multi-dimensional space, encapsulating the semantic and syntactical essence of the original text in a format that's easy for computers to process. These vectors, or embeddings, serve as numerical proxies for words, phrases, sentences, or even larger text chunks, encapsulating their contextual and semantic meanings within the numerical realm, thereby enabling computational models to discern, compare, and operate upon textual meanings and relationships.
Part 1: Preparing The Data Source
In this part, our goal is to create text embeddings of our data and store it in a vector database. Your data can be virtually any text such as a website, a database, or custom docs. It could even be all of these things combined.
- Collect Source Data: The first step is to collect the source data. This could be as simple as locating all of your files in one directory, scraping a website, or getting access to a DB.
- Chunk Your Data: Next, we will create chunks of data. A chunk of data is a collection of part of the text. A chunk could be a page in a PDF, a paragraph, or even the length of an average sentence.
- Create Embeddings of Each Chunk: Next, we transform our chunks of data into numerical vectors, otherwise known as embeddings. There are different tools to do this. If you’re using OpenAI, you’ll want to use their Embeddings API. If you’re using another LLM, check with their docs to see how you should proceed.
- Upload to a Vector Database: Great, so now we have all of our vectors. What do we do with them? We save them to a vector database! There are many vector database. Pinecone is often used because of how easy it is to setup. Redis is a great enterprise grade version. Chroma and FAISS are great open source options.
Part 2: Building a Chat App
Now that we have our text properly converted to embeddings and uploaded to a vector database, we can build a chat app that uses those embeddings.
At a high level, the chat app will take complete these steps:
- User Gives a Prompt: This step is obvious, but listing it here for clarity. The chat app needs to take in a user prompt.
- Convert Query to Embeddings: The user prompt is converted to an embedding, using the same embedding tool used to create the vectors in Part 1.
- Vector DB is Queried with User Embeddings: Next, the user prompt that’s been converted to an embedding is submitted to the vector DB as a query. This is where the vector DB magic comes in; it finds the best match in its database to the user prompt. It returns the best match, which is one or more of the data chunks that were created in Part 1, Step 2 (refer above).
- Send the User Prompt + Vector DB Response to the LLM: Then, the user prompt and resulting vector DB response are sent to the LLM, where it generates a response.
- Receive the Response: The chat app receives the response and renders it to the user.
Mixing LLMs and Ever-Changing Data
When used in a business, RAG allows LLMs to answer questions with insights from the company’s own, always-changing data, such as customer chats, transaction history, or special research data. This data shapes the LLMs’ ability to create responses that are not just clear and make sense but also contain specific information relevant to the question, coming from the external data.
Security Considerations
In the above application, you’ll notice there are several systems that will have access to potentially confidential data. First, the developers computer will need access, that’s a given. Second, the vector database will store the embeddings of data. You need to make sure vector database is secured properly.
Lastly, the LLM will handle the confidential data. If you’re using a LLM from a provider like OpenAI, be sure to review the data policies to ensure they meet your requirements. At the time of this writing, if you use the OpenAI API they will not store your data or use it for training. Other LLMs might have different policies; it’s always best to verify for yourself.
Conclusion
In this article we introduced you to the concept of retrieval augmented generation (RAG) as a way to get large language model responses using private data. This method makes it possible to create chat apps that interface with enterprise data. You learned about the process of preparing data for upload to a vector database, and how the user-side chat mechanics work. Finally, we covered some important security considerations.
If this was helpful, please follow on Twitter @tobiasbdon.