Build a multi-tenant chatbot with RAG using Amazon Bedrock and Amazon EKS
With the availability of Generative AI models, many customers are exploring ways to build chatbot applications that can cater to a wide range of their end-customers, with each instance of chatbot specializing on a specific tenant’s contextual information, and run such multi-tenant applications at scale with a cost-efficient infrastructure familiar to their development teams. Amazon Bedrock, a fully managed cloud service, tackles this challenge head-on by offering a range of powerful foundation models (FMs) through an Application Programming Interface (API), without managing infrastructure, streamlining the development of highly adaptable and cost-effective Generative AI chatbots.
Amazon Elastic Kubernetes Service (Amazon EKS) addresses several challenges around building chatbot applications such as multi-tenancy, scaling challenges, resiliency, and cost. Amazon EKS is commonly adopted by AWS customers for large scale deployment of container-based multi-tenant applications that cater to both internal and external audiences. Various surveys and reports including from Cloud Native Computing Foundation (CNCF) Dynatrace, Splunk, and Datadog also indicate the growth of Kubernetes as the container orchestration solution of choice in enterprise environments.
This post delves into a solution for building a multi-tenant chatbot with Retrieval Augmented Generation (RAG). RAG is a common pattern where a general-purpose language model is queried with a user question along with additional contextual information extracted from private documents. We have provided step-by-step deployment instructions and sample code for the solution in a GitHub repository.
As depicted in Figure 1, the solution uses Amazon EKS as the foundation for running the containerized chatbot application, with Kubernetes namespaces serving as a logical unit of compute workload per tenant. Istio is an open-source service mesh commonly used with Kubernetes for deploying their multi-tenant applications, and provides features such as traffic management, security, and observability at the pod level. We have used Istio service mesh for ingress control and routing functions as it supports external authorization, JWT validation at the edge, and request routing based on various HTTP parameters. We use an Open ID Connect (OIDC) proxy to relay authorization requests to an Identity Provider which in our case is Amazon Cognito User Pools. Requests from clients are received through a Network Load Balancer (NLB). For each tenant we use the DNS subdomain pattern to direct traffic from the end user, where the DNS subdomain maps to the IP addresses of the NLB. The NLB forwards all traffic to an Istio Ingress Gateway. A detailed discussion of how Istio supports multi-tenancy on Amazon EKS can be found in the post SaaS Identity and Routing with Istio Service Mesh and Amazon EKS, along with sample code.
Figure 1 – High-level architecture
The private documents that serve as the knowledge base for the RAG implementation, are converted into vector embeddings and are indexed and stored in the in-memory Facebook AI Similarity Search (FAISS) vector database index for efficient similarity search. We use the Titan Embeddings model to convert textual data to vector embeddings, and Claude Instant for generating the response to user queries. Both models are available through Amazon Bedrock. RAG is implemented with FAISS, an open-source library for implementing in-memory storage of vector embeddings and similarity search. We use FAISS for its simplicity and smaller footprint that suit demo use cases. For enterprise implementations, vector databases such as pgvector, Amazon OpenSearch, and other AWS Partner products, would be the preferred options, supporting scalability, persistence, and ubiquitous access. Vector embeddings are generated using the Titan Embeddings model. The LangChain library provides the functionality for running queries against the text model and for managing chat history by saving it in Amazon DynamoDB. It helps orchestrate various steps during the processing of user input, including chat history retrieval, requesting the vector embedding of the chat history combined with user input from the embeddings model, using the received embeddings to search the vector database, and sending the retrieved documents along with the chat history and user input to the Claude Instant model to receive the output.
Containerization of chatbot and RAG components
The chatbot application consists of two component microservices: a frontend chat interface that interacts with the client and a RAG API. Both the microservices are built as Docker container images for scheduling as Kubernetes pods on Amazon EKS.
The Dockerfile for the chat interface is shown in the following snippet.
We use a lightweight Python base image from Amazon ECR Public Gallery, and use the multistage build approach. In the first stage, we build and install the required Python packages into the user home directory. In the second stage, we create a non-root Linux user and group for the Streamlit-based chat interface to run, copy the installed Python packages from the first stage image, and also copy the application code along with all the associated modules into the application working directory in the container image.
The Dockerfile for the RAG API Docker image is shown in the following code snippet.
For the RAG API, we use the same Python base image as the frontend, and follow similar steps as outlined for the frontend. For serving the API, we use the FastAPI framework, and a combination of Gunicorn/Uvicorn to run an efficient ASGI web server.
Once the container images are built, they are pushed to Amazon Elastic Container Registry (Amazon ECR) and are ready for deployment. The container images from the ECR are pulled during the deployment of the chatbot pod on Amazon EKS.
As shown in Figure 2, a Kubernetes namespace is created for each tenant deployment, into which the chatbot pod is deployed with environment variables configured to tenant-specific settings. The chatbot pod consists of two containers, with the main container being the frontend, while the RAG API acts as a sidecar container serving the API required for the Q & A task. We also create an AWS Identity and Access Management (AWS IAM) role per tenant that has the necessary permission for the tenant-specific Amazon Simple Storage Service (Amazon S3) bucket and Amazon DynamoDB table. In order to simplify the use case, we’ve created per-tenant dedicated Amazon S3 and Amazon DynamoDB resources. The AWS IAM role is associated with a Kubernetes Service Account that’s also provisioned along with the pod, and allows the containers in the pod to access required data resources.
Figure 2 – Amazon EKS pod architecture
It is also a common pattern to comingle tenant data into fewer buckets and tables vis-à-vis the number of tenants, with IAM conditions to generate dynamic credentials using AWS Security Token Service (STS) in order to restrict access to tenant and user specific data per incoming request. Details and sample code for implementing the dynamic credentials pattern can be found in the AWS SaaS Microservices Workshop, and the post on SaaS Data Isolation with Dynamic Credentials Using HashiCorp Vault in Amazon EKS.
The Kubernetes manifest used to deploy the service account, deployment and service is listed as follows.
As can be seen in the container definitions, several environment variables are passed to the containers that are used by the application to locate tenant-specific data resources and API endpoints.
It is also worth noting that the locations for container images are also parametrized that can be passed during the manifest deployment, which in our case are set to the Amazon ECR endpoints for each image.
Ingesting data with FAISS and Titan embeddings model
The contextual data that the chatbot uses for generating responses to user queries is ingested into a FAISS index as part of the deployment. As shown in Figure 3, contextual data is first read from the tenant’s Amazon S3 bucket.
Figure 3 – Ingestion of contextual data
The ingestion mechanism uses LangChain’s CharacterTextSplitter to chunk the ingested data into reasonable sizes, and the FAISS library converts each of the chunks into a vector embedding by calling the Titan embeddings model, creating an index of the embeddings. This indexing improves the response time for similarity search. The set of indexes so created are output in a binary formatted file and uploaded back into the Amazon S3 bucket, which is accessed by the RAG-API microservice during bootstrap and loaded into its memory. With larger vector databases, the index is created and managed by the database instance and can be queried by the application using database-specific APIs. Having a dedicated vector database decouples the loading, updating, and indexing of context data from the application, as well as the scaling of the database instance.
The following code snippet show the implementation details of the above sequence.
Integration with Claude Instant
The RAG API microservice receives the user queries from the frontend, converts it into a prompt formatted for the appropriate FM, and interacts with the model using the LangChain library. The LangChain library is initialized to maintain chat history in a tenant-specific Amazon DynamoDB table. When a query is received, LangChain is called to retrieve similarity search results from the vector database, and send the prompt along with the matching results from the search and the previous chat history as context to the Claude Instant model. The response received from the FM is returned by the RAG API to the client through the chat interface.
The flow of a chat conversation is shown in Figure 4. When the user initiates a connection for the first time to the chat interface by pointing a browser to its Fully Qualified Domain Name (FQDN) such as tenant.example.com, the request will be intercepted by the Istio Ingress Gateway and forwarded to the OIDC proxy for authorization. As the request doesn’t contain the cookie associated with the authorization token. This redirects the user’s browser to the Hosted UI of the tenant’s User Pool. Once the user signs in, the request receives a code from the User Pool that is passed on to the OIDC Proxy which exchanges it with the User Pool for the user’s Identity Token (formatted as JSON Web Token [JWT]). The OIDC Proxy caches the JWT and returns a cookie back to the user’s browser to be used in future requests. When the request reaches the tenant’s pod, the JWT is verified against its Issuer URL to ensure that the JWT is genuine. It is also verified that the request coming from the Istio Ingress Gateway. After passing through these checks, the request reached the chat interface deployed in the tenant’s namespace.
Figure 4 – Chat conversation flow
The following code snippet shows that there are two Amazon DynamoDB tables created for each tenant during deployment. The Sessions table keeps track of user interactions, and the ChatHistory table is used by LangChain for recording the history of chat conversation.
The TenantId and UserEmail are extracted from the authentication JWT by the Istio proxy sidecar and injected as headers X-Auth-Request-Tenantid and X-Auth-Request-Email, respectively. The chat application, while receiving user input, creates a SessionId associated with the TenantId and UserEmail. It also records the UNIX time of the last user interaction. The SessionId is a Version 4 UUID string. A concatenated string of TenantId:UserEmail:SessionId is sent to the RAG API along with the chat query, to be used a session identifier by the backend, so that chat histories can be tied back to specific users that may provide rich contextual information for downstream usage such as product recommendations, customer relations, technical support, etc. The following code snippet shows this processing within the application.
Since the chat interface and RAG API microservices are collocated within the same Kubernetes pod, the chat interface communicates with the RAG API over the localhost (127.0.0.1) network interface.
When the RAG API microservice starts, it loads the FAISS index into its memory, and starts listening for connections. When a request is received, it initializes an AWS boto3 client for Bedrock containing credentials, and passes the request data along with FM parameters and the Bedrock client data structure to LangChain. The LangChain library is configured to save chat conversations in a Amazon DynamoDB table, and starts a ConversationalRetrievalChain, which automatically searches for context in FAISS index, retrieves the saved history, formats the query based on a template definition, bundles the context together formatted query, sends it to the Claude Instant model, and saves the response from the model. The RAG API returns the response received from LangChain to the frontend application for presentation to the user. The sequence of events managed by the RAG API is detailed in the following code snippet.
In this post, we showed you how to build a multi-tenant chatbot with RAG using Amazon Bedrock and Amazon EKS. The RAG implementation uses private data as context for large language models to produce predictable and relevant responses to chat queries. While this post demonstrates the chatbot use case, this approach and components can also be applied to use cases and tasks such as question answering, text summarization, and content creation.
Amazon EKS provides native multi-tenancy allowing workload isolation among tenants of users enabling efficient sharing of resources. We encourage you to explore Amazon Titan models and Amazon Bedrock for your Generative AI use cases and build a resilient and cost-effective solution with Amazon EKS as the underlying orchestration platform.
Checkout step-by-step deployment instructions and sample code for the solution discussed in the post, in the GitHub repository here.