Improve LLM RAG responses using search data
How to use all the data in your Elastic cluster as context for your RAG implementation using large language models on Amazon Bedrock
Foundational large language models (LLMs) are trained on huge datasets. To put this into context, consider that GPT-4o was trained on 13 trillion tokens! But there’s a caveat to that huge amount of data—it’s all public data and not up to date. And, most importantly, it does not include any data that is private or unique to your organization.
And the thing is, your organization already holds huge amounts of extremely valuable data that would make the inferences of foundational LLMs much more relevant to your users. We’re going to call this your organizational body of knowledge.
In this article, we’ll talk about using an organization’s existing body of knowledge as added context to foundational LLMs using retrieval augmented generation, and how Elastic makes that a whole lot simpler when you don’t have to be moving data around.
Let’s dig in!
Working with code and data
For those coming from a developer background, code is pretty much the primary artifact that one is used to working with. Yes, we need data to validate that our read and write operations are working as expected. And data is of course involved in making sure our business logic works as we anticipate.
Data, however, serves a very different purpose in the workflows of those working with machine learning.
Data and code are inseparably linked to the outcome which the developer is looking to achieve. Models require access to data, whether for training or as additional context in Retrieval-Augmented Generation (RAG) implementations working with foundational models.
If we look at the RAG scenario specifically, this historically meant getting data from different sources, transforming it, and then generating vector representations of it, This had to be stored in purpose-specific data repositories with support for that very specific data type, and which supported the required query capabilities the RAG implementation needs access to, so that it can find the right data relevant to the response the model is looking to generate.
Here’s a sample architecture of the traditional flow of storing vectors from multiple data sources:
Most organizations already hold lots of centralized knowledge
Most organizations already have the processes in place to consolidate all the relevant knowledge that users are looking to access. And this knowledge is usually made available through search-engine capabilities and knowledge bases that users have access to using traditional query methods—the good old “search” feature that most applications make available as a standard capability today.
And this knowledge is flowing from different sources already: documents stored in JSON format and data in relational databases as well as files in object storage that require more intricate processes for parsing and integration into a query-able, central repository.
And all of these sources, after being fetched and transformed, are stored and indexed into a system that provides specialized capabilities to enable fast, human-friendly searches on lots of indexed data: search engines.
Serving knowledge through chat and search
For those that are paying close attention, I’m sure you’ve already noticed how there’s a lot of similarities to both diagrams above:
- They both get data from application specific sources.
- They both are using something (EMR, in the examples above) to fetch, process, and store that data somewhere.
- They’re both storing the processed data in some repository (Amazon MemoryDB in the first example and Elastic Cloud in the second one).
- They both make the stored data available for consumption to end users.
The fundamental difference between both scenarios is the fact that processed data will end up in a different format and will be made available to users using a different mechanism (in the first example, most likely through a chatBot with context provided by the data in Amazon MemoryDB; in the second scenario via an API).
Putting both things together
Let’s put these two diagrams together without any optimization and see how they look:
We have the same data sources, but now we’re running two different sets of pipelines with Amazon EMR, one to push data into Elastic Cloud for API based querying, and another pushing data for vector generation and eventual access using RAG—exemplified by using LangChain in the diagram above. The end user querying two separate data stores depending on the mechanism by which they choose to query, whether the chatbot or the traditional search functionality.
Now, let’s make this simpler!
And now we get to the revolutionary concept of this article, one that we can accomplish thanks to Elastic Cloud’s Elastic Learned Sparse Encoder (ELSER).
ELSER is a retrieval model that enables you to perform semantic searches on data stored in your ElasticSearch cluster. And not only does it simplify your architecture but, also, given the associative and weighted mechanism that the model employs to search, provides overall better data to the responses generated by the LLM.
Let’s make our architecture now one where we remove all unnecessary components and optimize to use Elastic as the single place to store our organizational knowledge, regardless of the mode of consumption:
Easier, right? Let’s look at some of the changes that make this solution using Elastic Cloud’s ELSER capabilities so much more efficient:
- We have removed Amazon MemoryDB from the architecture, since Elastic serves all our requirements.
- We can use the same pipelines that already kept Elastic up to date with data for API consumption for splitting our data into passages that make ELSER work much more efficiently.
- LangChain can actually index and store vectors directly into Elastic as well, completely eliminating the need for additional data repositories.
- Amazon Bedrock dramatically reduces the effort in running a foundational LLM by providing fully serverless capabilities and off-the-shelf access to the most relevant large language models available today.
Let’s look at some benefits
This solution has many benefits that stem from the reduction of moving pieces and consolidation of data.
Since we’re using the same pipelines and a single data repository, we have a relevant assurance that the data available both via traditional API queries as well as data included in LLM responses will represent the same source of truth at the same point in time.
We’re also reducing the cost and effort in data storage and transport, as well as making it easier for data engineers to ensure data remains consistent when source schema changes, by requiring them to modify and update fewer moving components.
Elastic Cloud’s ELSER capabilities provide our solution with improved accuracy and relevance in responses generated by the LLM.
And, last but not least, by switching to services like Amazon Bedrock and using Elastic Cloud we’re close to eliminating all effort related to operating and scaling the underlying infrastructure while being able to access leading-edge capabilities.
This is awesome! What do I do next?
Look for Elastic Cloud in AWS Marketplace so you can start trying this solution out using the available free trial. Getting it in AWS Marketplace will make it very easy to configure Elastic to work with your AWS account.
And, finally, be on the lookout for a hands-on lab we’ll be releasing soon, where you’ll be able to get step-by-step guidance on how to actually build this solution on your own AWS environment.
See you all soon!
Get hands on
Why AWS Marketplace?
Try SaaS products free with your AWS account to establish your proof-of-concept then pay-as-you-go in production with AWS Billing.