Building self-managed RAG applications with Amazon EKS and Amazon S3 Vectors

Retrieval-Augmented Generation (RAG) is a technique that optimizes large language model (LLM) outputs by referencing authoritative knowledge bases outside of the model’s training data before generating responses. This addresses common limitations of traditional LLMs, such as outdated knowledge, hallucinated facts, and misinterpreted terminology. Organizations can implement RAG to enhance their generative AI applications with current, accurate, and domain-specific information without the significant costs of retraining models. Beyond improving output quality, RAG is also more cost-effective than retraining foundation models (FMs). It offers up-to-date answers by integrating frequently refreshed data sources, improves user trust through source attribution, and gives developers fine-grained control over how information is retrieved and used. These benefits make RAG especially valuable for enterprise use cases, such as customer support, internal knowledge retrieval, and content generation tools that demand high accuracy and domain relevance.

Although managed RAG solutions such as Amazon Bedrock Knowledge Bases streamline deployment and reduce operational overhead, many organizations are also exploring self-managed RAG architectures to meet more specialized needs. This approach is particularly valuable for customers who:

Are deeply experienced with or want to use open source frameworks to build their generative AI solutions.
Need to fine-tune performance parameters and scale infrastructure to maximize performance.
Operate in areas where managed services aren’t currently offered or need specific capabilities not currently available.
Want to deploy solutions across multiple on-premises and cloud environments.

In this post, we provide a reference architecture for building and deploying a self-managed RAG application using Amazon Elastic Kubernetes Service (Amazon EKS), Amazon S3 Vectors, and open source tools such as Ray, Hugging Face, and LangChain. This implementation allows organizations to maintain control over their AI infrastructure while using the scalability and flexibility of Amazon Web Services (AWS) services.

Benefits of using S3 Vectors for RAG

S3 Vectors is a cost-effective, scalable solution for storing and querying vector embeddings, and it is the first cloud object store to offer built-in support for vector data. One of the core benefits of S3 Vectors is its ability to elastically scale from zero to tens of millions of vector embeddings within a single vector index without needing the provisioning or management of database infrastructure. Its pay-per-use pricing model eliminates the need for idle database instances, making it a more economical option for developers and organizations working with large-scale AI workloads.

S3 Vectors also offers intuitive APIs that integrate into existing applications, providing fast onboarding and flexibility across use cases. These advantages make S3 Vectors particularly well-suited for RAG applications, where sub-second access to vector embeddings and efficient storage are critical for delivering accurate, high-performance responses.

Architecture overview

The architecture is designed for developers and architects who want to rapidly build and deploy RAG-capable generative AI applications using open source tools and models on AWS. It assumes familiarity with Amazon EKS and S3 Vectors, along with a conceptual understanding of AI, machine learning, and LLMs. The reference architecture consists of two main subsystems:

Response generation layer handles the request-response flow between users and the application.
Knowledge processing layer enables RAG capabilities by processing and vectorizing data.

First, we examine a high-level conceptual view of how RAG works, independent of any specific implementation technologies, as shown in the following figure:

A high-level conceptual view of how RAG works, independent of any specific implementation technologies.

This conceptual diagram shows the fundamental RAG process: user prompts are enhanced with relevant context extracted from a knowledge base through vector similarity searches, allowing the LLM to generate more accurate and informed responses. This reference architecture implements a batch processing approach where all documents in the S3 bucket are processed together in a single job, rather than processing files individually as they are uploaded.

Next, we explore how this conceptual flow is implemented using AWS services, container technologies, and open source products, as shown in the following figure:

Figure showing how a high-level conceptual view of RAG is implemented using AWS services, container technologies, and open source products

AWS services:

Amazon EKS provides the managed Kubernetes environment where we deploy the front end server, inference server, and embedding service. Amazon EKS handles the container orchestration, scaling, and management so that we can focus on application development rather than infrastructure maintenance.
Amaz on S3 general purpose buckets serve as the initial data lake for raw documents from both internal and external sources.
S3 Vectors buckets store and manage the vector embeddings with built-in similarity search capabilities. This purpose-built solution removes the need for separate vector databases.
AWS Identity and Access Management (IAM) with service accounts secures access between Amazon EKS workloads and AWS services.
Amazon CloudWatch provides comprehensive monitoring and observability for the entire application stack on Amazon EKS.
Amazon Elastic Container Registry (Amazon ECR) stores and manages the container images used by the EKS cluster.

Open source products:

Hugging Face Text Generation Inference (TGI): This is a toolkit for deploying and serving LLMs.
Ray: This is a unified compute framework for scaling AI and Python workloads that enables distributed processing of documents across the cluster with intelligent allocation of CPU, GPU, and memory resources. Although other approaches such as KEDA with standard deployments could handle basic embedding tasks, Ray provides superior scalability for large document collections (from thousands to millions), built-in fault tolerance mechanisms, and support for both batch and streaming workloads within the same framework. Ray’s architecture facilitates future extensibility for advanced features such as automated vector database refreshing, periodic re-embedding with newer models, and complex preprocessing pipelines. This makes it particularly well-suited for enterprise-scale RAG implementations where processing requirements may evolve over time.
LangChain: This is a framework for developing applications powered by LLMs.

In this architecture, the knowledge processing layer manages the following data processing workflows:

Data ingestion: Data from external and internal sources is uploaded to a standard S3 bucket, either manually or programmatically.
Batch processing: The embedding service runs as a one-time job that processes all documents currently stored in the S3 bucket.
Embedding generation: The embedding service, deployed on Amazon EKS, performs the following functions:
- Retrieves data from Amazon S3 using the AWS SDK or Amazon S3 API
- Preprocesses the data using Ray Data for chunking and transformation
- Runs Ray jobs to create vectorized embeddings using open source models (for example Qwen3-Embedding-0.6B)
- Stores the vectorized embeddings in S3 Vector buckets, which are optimized for vector operations. In terms of ingesting vectors, refer to the Amazon S3 Best Practices.

Meanwhile, the response generation layer handles the following user interactions:

Request: Users submit natural language requests through a web-based chat interface served by the frontend application running on Amazon EKS.
Context augmentation: The frontend server runs LangChain processes that:
- Convert user queries to embeddings using the same model as the embedding service
- Perform vector similarity searches against the S3 Vector index to retrieve relevant information
- Construct contextualized prompts by combining user queries with retrieved information
- Send these enhanced prompts to the inference server
Inference: The inference server uses Hugging Face TGI to serve open source LLMs such as Llama-7B models, generating responses based on the contextualized prompts. Although vLLM is another excellent option for model serving with its PagedAttention mechanism offering superior throughput for many models, we selected TGI for this architecture due to the following aspects:
- Broader model support: TGI works with a wider range of model architectures out-of-the-box.
- Streamlined deployment: TGI needs less configuration for basic deployment scenarios.
- Production-ready features: Built-in support for multiple model loading strategies, tensor parallelism, and quantization.

For production deployments where maximum throughput is critical, especially with very large batch sizes, vLLM may offer performance advantages. Both TGI and vLLM are excellent choices, and this architecture can be adapted to use vLLM by substituting the inference container and client implementation.

Response processing: Generated responses are processed and returned to the user.

Walkthrough

You can follow these steps to deploy the preceding architecture in your environment:

1. Environment setup with prerequisites
2. Configure EKS cluster
3. Setting up S3 Vector buckets
4. Deploy Ray on Amazon EKS for distributed processing
5. Deploy Ray job for embedding generation
6. Deploy the inference server
7. Query S3 Vector buckets with LangChain
8. Frontend application deployment
9. Monitoring and troubleshooting

Step 1: Environment setup with prerequisites

1. Go to AWS CloudShell or in your Terminal. Use the following code:

# Create a working directory for our RAG application
mkdir -p rag-application
cd rag-application

# Set up environment variables
export ACCOUNT_ID=$(aws sts get-caller-identity --query Account --output text)
export REGION=us-west-2
export CLUSTER_NAME="rag-eks-cluster"
export DATA_BUCKET_NAME="rag-data-${ACCOUNT_ID}"
export VECTOR_BUCKET_NAME="rag-vectors-${ACCOUNT_ID}"

# Verify AWS CLI is configured with proper permissions
aws sts get-caller-identity

2. Install the necessary AWS Command Line Interface (AWS CLI) tools if not already available:

# Install eksctl
curl --silent --location "https://github.com/weaveworks/eksctl/releases/latest/download/eksctl_Linux_amd64.tar.gz" | tar xz -C /tmp
sudo mv /tmp/eksctl /usr/local/bin

# Install kubectl
curl -o kubectl https://s3.us-west-2.amazonaws.com/amazon-eks/1.33 2023-11-14/bin/linux/amd64/kubectl
chmod +x ./kubectl
sudo mv ./kubectl /usr/local/bin

# Install Helm
curl https://raw.githubusercontent.com/helm/helm/main/scripts/get-helm-3 | bash

# Verify tool installation
eksctl version
kubectl version --client
helm version

3. Create a standard S3 bucket for data storage:

# Create standard S3 bucket for raw document storage
aws s3 mb s3://${DATA_BUCKET_NAME} --region ${REGION}

# Verify bucket creation
aws s3 ls | grep ${DATA_BUCKET_NAME}

Step 2: Configure EKS cluster

1. Go to AWS CloudShell or in your Terminal. Use the following code:

# Create EKS cluster config
cat <<EOF > eks-cluster.yaml
apiVersion: eksctl.io/v1alpha5
kind: ClusterConfig
metadata:
  name: ${CLUSTER_NAME}
  region: ${REGION}
  version: "1.33"
vpc:
  clusterEndpoints:
    publicAccess: true
managedNodeGroups:
  - name: cpu-nodes
    instanceType: m5.xlarge
    desiredCapacity: 2
    minSize: 1
    maxSize: 3
  - name: gpu-nodes
    instanceType: g4dn.xlarge
    desiredCapacity: 1
    minSize: 1
    maxSize: 2
    volumeSize: 100
EOF

# Create the EKS cluster (this will take 15-20 minutes)
eksctl create cluster -f eks-cluster.yaml

# Update kubeconfig to access the cluster
aws eks update-kubeconfig --name ${CLUSTER_NAME} --region ${REGION}

# Verify cluster is running and nodes are available
kubectl get nodes

2. Install the NVIDIA device plugin for GPU support:

# Install NVIDIA device plugin for Kubernetes
kubectl apply -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.14.0/nvidia-device-plugin.yml

# Verify GPU nodes and plugin
kubectl get pods -n kube-system | grep nvidia

Step 3: Setting up S3 Vector buckets

1. Go to AWS CloudShell or in your Terminal. Use the following code:

# Create a Vector Bucket
aws s3vectors create-vector-bucket \
    --vector-bucket-name "${VECTOR_BUCKET_NAME}" \
    --region ${REGION}

# Verify vector bucket creation
aws s3vectors list-vector-buckets --region ${REGION}

# First create the S3 Vector Bucket
aws s3vectors create-vector-bucket \
    --vector-bucket-name "\${VECTOR_BUCKET_NAME}" \
    --region ${REGION}

# Create the IAM policy and service account first
aws iam create-policy \
  --policy-name EKSVectorBucketPolicy \
  --policy-document '{
    "Version": "2012-10-17",
    "Statement": [
      {
        "Effect": "Allow",
        "Action": [
          "s3vectors:PutVectors",
          "s3vectors:GetVectors",
          "s3vectors:QueryVectors"
        ],
        "Resource": "arn:aws:s3vectors:'${REGION}':'${ACCOUNT_ID}':bucket/'${VECTOR_BUCKET_NAME}'"
      },
      {
        "Effect": "Allow",
        "Action": [
          "s3:GetObject",
          "s3:ListBucket"
        ],
        "Resource": [
          "arn:aws:s3:::\${DATA_BUCKET_NAME}",
          "arn:aws:s3:::\${DATA_BUCKET_NAME}/*"
        ]
      }
    ]
  }'
  
  
# Associate the OIDC provider with your cluster
eksctl utils associate-iam-oidc-provider \
  --region=${REGION} \
  --cluster=${CLUSTER_NAME} \
  --approve

# Create service account with IAM role
eksctl create iamserviceaccount \
  --name vector-processing-sa \
  --namespace default \
  --cluster ${CLUSTER_NAME} \
  --attach-policy-arn arn:aws:iam::${ACCOUNT_ID}:policy/EKSVectorBucketPolicy \
  --approve

# Get the IAM role ARN that was created for this service account
SA_ROLE_ARN=$(kubectl get serviceaccount vector-processing-sa -o jsonpath='{.metadata.annotations.eks\.amazonaws\.com/role-arn}')

# Now set up the Vector Bucket policy using this role
cat <<EOF > vector-bucket-policy.json
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Principal": {
        "AWS": "${SA_ROLE_ARN}"
      },
      "Action": [
        "s3vectors:PutVectors",
        "s3vectors:GetVectors",
        "s3vectors:QueryVectors"
      ],
      "Resource": "arn:aws:s3vectors:${REGION}:${ACCOUNT_ID}:bucket/${VECTOR_BUCKET_NAME}"
    }
  ]
}
EOF

aws s3vectors put-vector-bucket-policy \
    --vector-bucket-name "${VECTOR_BUCKET_NAME}" \
    --policy file://vector-bucket-policy.json

# Create a Vector Index
aws s3vectors create-index \
    --vector-bucket-name "${VECTOR_BUCKET_NAME}" \
    --index-name "documents" \
    --data-type "float32" \
    --dimension 384 \
    --distance-metric "cosine"

# Verify index creation
aws s3vectors list-indexes --vector-bucket-name ${VECTOR_BUCKET_NAME}

# Verify service account creation
kubectl get serviceaccount vector-processing-sa

Step 4: Deploy Ray on Amazon EKS for distributed processing

1. Go to AWS CloudShell or in your Terminal. Use the following code:

# Clone the KubeRay repository
git clone https://github.com/ray-project/kuberay.git
cd kuberay/helm-chart

helm install ray-operator ./kuberay-operator --namespace ray-system --create-namespace

# Wait for the operator to start
kubectl wait --for=condition=available --timeout=90s deployment/kuberay-operator -n ray-system

# Create Ray cluster configuration
cat <<EOF > ray-cluster.yaml
apiVersion: ray.io/v1
kind: RayCluster
metadata:
  name: ray-cluster
spec:
  rayVersion: '2.9.0'
  headGroupSpec:
    rayStartParams:
      dashboard-host: '0.0.0.0'
    template:
      spec:
        serviceAccountName: vector-processing-sa
        containers:
        - name: ray-head
          image: rayproject/ray:2.9.0
          resources:
            limits:
              cpu: '2'
              memory: '8G'
  workerGroupSpecs:
  - groupName: worker-group
    replicas: 2
    minReplicas: 1
    maxReplicas: 4
    rayStartParams: {}
    template:
      spec:
        serviceAccountName: vector-processing-sa
        containers:
        - name: ray-worker
          image: rayproject/ray:2.9.0
          resources:
            limits:
              cpu: '2'
              memory: '8G'
              nvidia.com/gpu: '1'
EOF

# Create the Ray cluster
kubectl apply -f ray-cluster.yaml

# Verify the Ray cluster is running
kubectl get rayclusters

# Check the status of the Ray pods
kubectl get pods -l ray.io/cluster=ray-cluster

2. You can access the Ray dashboard by port-forwarding to the Ray head pod:

export RAY_HEAD_POD=\$(kubectl get pods -l ray.io/cluster=ray-cluster,ray.io/node-type=head -o jsonpath='{.items[0].metadata.name}')
kubectl port-forward \$RAY_HEAD_POD 8265:8265

3. Then, go to http://localhost:8265 in your browser to access the Ray dashboard.

Step 5: Deploy Ray job for embedding generation

This implementation performs bulk processing of all documents currently in your S3 bucket. It is designed for initial data loading or periodic batch updates, not the real-time processing of individual files. Each time you run the embedding job, it processes all JSON documents in the s3://your-bucket/documents/ folder. For production use cases needing real-time processing of new documents, you would need to implement an event-driven architecture with Amazon EventBridge and modify the processing logic to handle individual files.

1. Create a Dockerfile for the embedding service:

# Create a directory for the embedding service
mkdir -p ray-embeddings/app
cd ray-embeddings

2. Create a Dockerfile with the following content:

# Create the Dockerfile
cat <<EOF > Dockerfile
# Dockerfile for ray-embeddings
FROM rayproject/ray:2.9.0-py310-gpu

# Install additional dependencies
RUN pip install --no-cache-dir \
    boto3 \
    transformers \
    torch \
    pandas \
    huggingface_hub

# Set up working directory
WORKDIR /app

# Copy the embedding script
COPY app/embedding_job.py /app/

# Default command - will be overridden by Kubernetes job spec
CMD ["python", "/app/embedding_job.py"]
EOF

3. Create the embedding script:

# Create the embedding script
mkdir -p app
cat <<EOF > app/embedding_job.py
# app/embedding_job.py
import ray
from ray import data
import boto3
import os
from transformers import AutoTokenizer, AutoModel
import torch
import json

ray.init()

# Environment variables
data_bucket = os.environ.get('DATA_BUCKET', 'my-rag-data')
vector_bucket = os.environ.get('VECTOR_BUCKET', 'my-rag-vectors')

# Load model once and share it across tasks
@ray.remote(num_gpus=1)
class EmbeddingModel:
    def __init__(self, model_name):
        self.tokenizer = AutoTokenizer.from_pretrained(model_name)
        self.model = AutoModel.from_pretrained(model_name)
        self.model.to('cuda' if torch.cuda.is_available() else 'cpu')
        
    def get_embedding(self, text):
        inputs = self.tokenizer(text, return_tensors="pt", padding=True, truncation=True, max_length=512)
        if torch.cuda.is_available():
            inputs = {k: v.to('cuda') for k, v in inputs.items()}
        with torch.no_grad():
            outputs = self.model(**inputs)
        embeddings = outputs.last_hidden_state.mean(dim=1)
        return embeddings[0].cpu().numpy().tolist()

# Initialize embedding model
embedding_model = EmbeddingModel.remote("intfloat/multilingual-e5-small")

# Read documents from standard S3 bucket
print(f"Reading documents from s3://{data_bucket}/documents/")
ds = ray.data.read_json(f"s3://{data_bucket}/documents/")

# Process and embed documents
def process_document(doc):
    text = doc["text"]
    embedding = ray.get(embedding_model.get_embedding.remote(text))
    
    # Prepare metadata for S3 Vector Bucket
    metadata = {
        "id": doc["id"],
        "title": doc.get("title", ""),
        "source": doc.get("source", ""),
        "text": text
    }
    
    return {"embedding": embedding, "metadata": metadata}

# Process documents and get embeddings
print("Processing documents and generating embeddings...")
embedded_ds = ds.map(process_document)

# Write embeddings to S3 Vector Bucket
def write_to_vector_bucket(batch, s3vectors_client):
    vectors_to_write = []
    
    for item in batch:
        vector_embedding = item["embedding"]
        metadata = item["metadata"]
        
        vectors_to_write.append({
            "key": metadata['id'],
            "data": {"float32": vector_embedding},
            "metadata": metadata
        })
    
    # Write in batches to S3 Vector Bucket
    print(f"Writing {len(vectors_to_write)} vectors to {vector_bucket}")
    s3vectors_client.put_vectors(
        vectorBucketName=vector_bucket,
        indexName="documents",
        vectors=vectors_to_write
    )
    
    return len(batch)

s3vectors_client = boto3.client('s3vectors')
result = embedded_ds.map_batches(
    lambda batch: write_to_vector_bucket(batch, s3vectors_client),
    batch_format="pandas",
	Batch_size=500
)

print(f"Successfully processed {result.count()} documents")
EOF

4. Build and push the Docker image to Amazon ECR:

# Build and push the Docker image to Amazon ECR
cd ..
export IMAGE_REPO="ray-embeddings"

# Create ECR repository if it doesn't exist
aws ecr create-repository --repository-name ${IMAGE_REPO} --region ${REGION} || true

# Login to ECR
aws ecr get-login-password --region ${REGION} | docker login --username AWS --password-stdin ${ACCOUNT_ID}.dkr.ecr.${REGION}.amazonaws.com

# Build the Docker image
docker build -t ${IMAGE_REPO}:latest ray-embeddings/

# Tag the image for ECR
docker tag ${IMAGE_REPO}:latest ${ACCOUNT_ID}.dkr.ecr.${REGION}.amazonaws.com/${IMAGE_REPO}:latest

# Push the image to ECR
docker push ${ACCOUNT_ID}.dkr.ecr.${REGION}.amazonaws.com/${IMAGE_REPO}:latest

5. Now that the image is built and pushed to Amazon ECR, run the following block to save the following job manifest to a file named embedding-job.yaml:

# Create a Kubernetes job to run the embedding process
cat <<EOF > embedding-job.yaml
apiVersion: batch/v1
kind: Job
metadata:
  name: embedding-job
spec:
  template:
    spec:
      serviceAccountName: vector-processing-sa
      containers:
      - name: embedding
        image: ${ACCOUNT_ID}.dkr.ecr.${REGION}.amazonaws.com/${IMAGE_REPO}:latest
        command: ["python", "/app/embedding_job.py"]
        resources:
          limits:
            nvidia.com/gpu: 1
        env:
        - name: DATA_BUCKET
          value: "${DATA_BUCKET_NAME}"
        - name: VECTOR_BUCKET
          value: "${VECTOR_BUCKET_NAME}"
      restartPolicy: Never
  backoffLimit: 1
EOF

6. Apply the job manifest to start the embedding generation process:

# Deploy the embedding job
kubectl apply -f embedding-job.yaml

# Monitor the job progress
kubectl get jobs embedding-job
kubectl logs -f job/embedding-job

Step 6: Deploy the inference server

1. Create Kubernetes deployment for the Hugging Face TGI server:

cat <<EOF > tgi-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: inference-server
spec:
  replicas: 1
  selector:
    matchLabels:
      app: inference-server
  template:
    metadata:
      labels:
        app: inference-server
    spec:
      containers:
      - name: tgi
        image: ghcr.io/huggingface/text-generation-inference:latest
        args:
        - --model-id=mistralai/Mistral-7B-Instruct-v0.2
        - --num-shard=1
        ports:
        - containerPort: 8080
        resources:
          limits:
            nvidia.com/gpu: 1
            memory: 16Gi
          requests:
            cpu: 2
            memory: 8Gi
---
apiVersion: v1
kind: Service
metadata:
  name: inference-server
spec:
  selector:
    app: inference-server
  ports:
  - port: 8080
    targetPort: 8080
  type: ClusterIP
EOF

Step 7: Query S3 Vector buckets with LangChain

1. Go to AWS CloudShell or in your Terminal. Use the following code:

cat <<EOF > query.py
import boto3
import json
import numpy as np
from transformers import AutoTokenizer, AutoModel
import torch
from langchain.chains import RetrievalQA
from langchain.llms import HuggingFaceTextGenInference
from langchain.schema import Document
from langchain.retrievers import BaseRetriever
from typing import List

class S3VectorBucketRetriever(BaseRetriever):
    def init(self, vector_bucket_name, index_name, embedding_function):
        self.s3vectors_client = boto3.client('s3vectors')
        self.vector_bucket_name = vector_bucket_name
        self.index_name = index_name
        self.embedding_function = embedding_function

    def get_relevant_documents(self, query: str) -> List[Document]:
        # Generate embedding for the query
        query_embedding = self.embedding_function(query)
        
        # Search S3 Vector Bucket
        response = self.s3vectors_client.query_vectors(
            vectorBucketName=self.vector_bucket_name,
            indexName=self.index_name,
            queryVector={"float32": query_embedding},
            topK=5,
            returnDistance=True,
            returnMetadata=True
        )
        
        # Process results
        documents = []
        for item in response['vectors']:
            documents.append(
                Document(
                    page_content=item['metadata'].get("text", ""),
                    metadata={
                        "id": item['metadata'].get("id", ""),
                        "title": item['metadata'].get("title", ""),
                        "source": item['metadata'].get("source", ""),
                        "similarity": item['distance']
                    }
                )
            )
        
        return documents

# Initialize embedding function
tokenizer = AutoTokenizer.from_pretrained("intfloat/multilingual-e5-small")
model = AutoModel.from_pretrained("intfloat/multilingual-e5-small")

def get_embedding(text):
    inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True, max_length=512)
    with torch.no_grad():
        outputs = model(**inputs)
    embeddings = outputs.last_hidden_state.mean(dim=1)
    return embeddings[0].numpy().tolist()

# Initialize retriever
retriever = S3VectorBucketRetriever(
    vector_bucket_name="my-rag-vectors",
    index_name="documents",
    embedding_function=get_embedding
)

# Initialize LLM (using Hugging Face TGI server)
llm = HuggingFaceTextGenInference(
    inference_server_url="http://inference-server:8080/generate",
    max_new_tokens=512,
    top_k=10,
    top_p=0.95,
    typical_p=0.95,
    temperature=0.7,
    repetition_penalty=1.03
)

# Create QA chain
qa = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",
    retriever=retriever,
    return_source_documents=True
)

# Example query
query = "What are the key benefits of RAG applications?"
result = qa(query)
print(result['answer'])
print("\nSources:")
for doc in result['source_documents']:
    print(f"- {doc.metadata['title']} (Similarity: {doc.metadata['similarity']})")
EOF

Step 8: Frontend application deployment

1. Go to AWS CloudShell or in your Terminal. Use the following code:

# Create a directory for the frontend application
mkdir -p rag-frontend/app
cd rag-frontend

# Create the Dockerfile
cat <<EOF > Dockerfile
# Dockerfile for rag-frontend
FROM python:3.10-slim

# Install dependencies
RUN pip install --no-cache-dir \
    streamlit \
    langchain \
    boto3 \
    transformers \
    torch \
    requests \
    python-dotenv \
    langchain-community \
    langchain-huggingface

# Set up working directory
WORKDIR /app

# Copy application files
COPY app/ /app/

# Expose Streamlit port
EXPOSE 8501

# Start the Streamlit application
CMD ["streamlit", "run", "app.py", "--server.port=8501", "--server.address=0.0.0.0"]
EOF

# Create the frontend application files
mkdir -p app
cat <<EOF > app/app.py
# app/app.py
import os
import streamlit as st
from langchain.chains import RetrievalQA
from langchain.llms import HuggingFaceTextGenInference
from langchain.schema import Document
from langchain.retrievers import BaseRetriever
from typing import List
import boto3
import torch
from transformers import AutoTokenizer, AutoModel
import numpy as np

# Configure page
st.set_page_config(page_title="RAG Demo", page_icon="🔍", layout="wide")
st.title("RAG Application with S3 Vector Buckets")

# Environment variables
vector_bucket_name = os.environ.get("VECTOR_BUCKET_NAME", "my-rag-vectors")

# Load embedding model
@st.cache_resource
def load_embedding_model():
    tokenizer = AutoTokenizer.from_pretrained("intfloat/multilingual-e5-small")
    model = AutoModel.from_pretrained("intfloat/multilingual-e5-small")
    return tokenizer, model

tokenizer, model = load_embedding_model()

# Define embedding function
def get_embedding(text):
    inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True, max_length=512)
    with torch.no_grad():
        outputs = model(**inputs)
    embeddings = outputs.last_hidden_state.mean(dim=1)
    return embeddings[0].numpy().tolist()

# Define S3 Vector Bucket retriever
class S3VectorBucketRetriever(BaseRetriever):
    def __init__(self, vector_bucket_name, index_name, embedding_function):
        self.s3vectors_client = boto3.client('s3vectors')
        self.vector_bucket_name = vector_bucket_name
        self.index_name = index_name
        self.embedding_function = embedding_function
    
    def get_relevant_documents(self, query: str) -> List[Document]:
        # Generate embedding for the query
        query_embedding = self.embedding_function(query)
        
        # Search S3 Vector Bucket
        response = self.s3vectors_client.query_vectors(
            vectorBucketName=self.vector_bucket_name,
            indexName=self.index_name,
            queryVector={"float32": query_embedding},
            topK=5,
            returnDistance=True,
            returnMetadata=True
        )
        
        # Process results
        documents = []
        for item in response['vectors']:
            documents.append(
                Document(
                    page_content=item['metadata'].get("text", ""),
                    metadata={
                        "id": item['metadata'].get("id", ""),
                        "title": item['metadata'].get("title", ""),
                        "source": item['metadata'].get("source", ""),
                        "similarity": item['distance']
                    }
                )
            )
        
        return documents

# Initialize retriever
retriever = S3VectorBucketRetriever(
    vector_bucket_name=vector_bucket_name,
    index_name="documents",
    embedding_function=get_embedding
)

# Initialize LLM client
def get_llm():
    inference_server_url = os.environ.get("INFERENCE_SERVER_URL", "http://inference-server:8080/generate")
    
    llm = HuggingFaceTextGenInference(
        inference_server_url=inference_server_url,
        max_new_tokens=512,
        top_k=10,
        top_p=0.95,
        typical_p=0.95,
        temperature=0.7,
        repetition_penalty=1.03
    )
    return llm

# Create QA chain
@st.cache_resource
def create_qa_chain():
    llm = get_llm()
    qa_chain = RetrievalQA.from_chain_type(
        llm=llm,
        chain_type="stuff",
        retriever=retriever,
        return_source_documents=True
    )
    return qa_chain

qa_chain = create_qa_chain()

# Create the chat interface
if "messages" not in st.session_state:
    st.session_state.messages = []

# Display chat history
for message in st.session_state.messages:
    with st.chat_message(message["role"]):
        st.markdown(message["content"])
        if "sources" in message and message["sources"]:
            with st.expander("View Sources"):
                for source in message["sources"]:
                    st.markdown(f"**{source['title']}** (Similarity: {source['similarity']:.2f})")
                    st.markdown(f"Source ID: {source['id']}")

# Get user input
if prompt := st.chat_input("Ask a question about our knowledge base"):
    # Add user message to chat history
    st.session_state.messages.append({"role": "user", "content": prompt})
    
    # Display user message
    with st.chat_message("user"):
        st.markdown(prompt)
        
    # Display assistant response
    with st.chat_message("assistant"):
        with st.spinner("Thinking..."):
            # Get response from QA chain
            result = qa_chain({"query": prompt})
            response = result['result']
            source_docs = result.get('source_documents', [])
            
            # Extract source information
            sources = []
            for doc in source_docs:
                sources.append({
                    "id": doc.metadata.get("id", ""),
                    "title": doc.metadata.get("title", ""),
                    "similarity": doc.metadata.get("similarity", 0)
                })
            
            # Display response
            st.markdown(response)
            
            # Display sources
            if sources:
                with st.expander("View Sources"):
                    for source in sources:
                        st.markdown(f"**{source['title']}** (Similarity: {source['similarity']:.2f})")
                        st.markdown(f"Source ID: {source['id']}")
    
    # Add assistant response to chat history
    st.session_state.messages.append({
        "role": "assistant", 
        "content": response,
        "sources": sources
    })
EOF

# Create Kubernetes ConfigMap for environment variables
cat <<EOF > frontend-config.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: rag-frontend-config
data:
  INFERENCE_SERVER_URL: "http://inference-server:8080/generate"
  VECTOR_BUCKET_NAME: "${VECTOR_BUCKET_NAME}"
EOF

# Build and push the Docker image to Amazon ECR
cd ..
export FRONTEND_REPO="rag-frontend"

# Create ECR repository if it doesn't exist
aws ecr create-repository --repository-name ${FRONTEND_REPO} --region ${REGION} || true

# Login to ECR
aws ecr get-login-password --region ${REGION} | docker login --username AWS --password-stdin ${ACCOUNT_ID}.dkr.ecr.${REGION}.amazonaws.com

# Build the Docker image
docker build -t \${FRONTEND_REPO}:latest rag-frontend/

# Tag the image for ECR
docker tag ${FRONTEND_REPO}:latest ${ACCOUNT_ID}.dkr.ecr.${REGION}.amazonaws.com/${FRONTEND_REPO}:latest

# Push the image to ECR
docker push ${ACCOUNT_ID}.dkr.ecr.${REGION}.amazonaws.com/${FRONTEND_REPO}:latest

# Create the frontend deployment
cat <<EOF > frontend-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: rag-frontend
spec:
  replicas: 2
  selector:
    matchLabels:
      app: rag-frontend
  template:
    metadata:
      labels:
        app: rag-frontend
    spec:
      serviceAccountName: vector-processing-sa
      containers:
      - name: frontend
        image: ${ACCOUNT_ID}.dkr.ecr.${REGION}.amazonaws.com/${FRONTEND_REPO}:latest
        ports:
        - containerPort: 8501
        envFrom:
        - configMapRef:
            name: rag-frontend-config
        resources:
          limits:
            cpu: 1
            memory: 2Gi
          requests:
            cpu: 0.5
            memory: 1Gi
---
apiVersion: v1
kind: Service
metadata:
  name: rag-frontend
spec:
  selector:
    app: rag-frontend
  ports:
  - port: 80
    targetPort: 8501
  type: LoadBalancer
EOF

# Apply the ConfigMap and deployment
kubectl apply -f frontend-config.yaml
kubectl apply -f frontend-deployment.yaml

# Check the deployment status
kubectl get deployment rag-frontend

# Check when the service is ready with an external IP
kubectl get service rag-frontend

# Get the external endpoint (this may take a few minutes to provision)
export FRONTEND_URL=$(kubectl get service rag-frontend -o jsonpath='{.status.loadBalancer.ingress[0].hostname}')
echo "Your RAG application is available at: http://${FRONTEND_URL}"

Step 9: Monitoring and troubleshooting

1. Set up CloudWatch in AWS Console for Amazon EKS monitoring:

# Install CloudWatch agent for EKS
kubectl apply -f https://raw.githubusercontent.com/aws-samples/amazon-cloudwatch-container-insights/latest/k8s-deployment-manifest-templates/deployment-mode/daemonset/container-insights-monitoring/quickstart/cwagent-fluentd-quickstart.yaml

# Verify CloudWatch agent installation
kubectl get pods -n amazon-cloudwatch

Check your application logs:

# Check frontend logs
kubectl logs -l app=rag-frontend

# Check inference server logs
kubectl logs -l app=inference-server

# Check embedding job logs
kubectl logs job/embedding-job

Cleaning up resources (optional)

When you’re done experimenting, you can clean up all resources to avoid incurring charges:

# Delete the EKS cluster
eksctl delete cluster --name ${CLUSTER_NAME} --region ${REGION}

# Delete S3 Vector Bucket
aws s3vectors delete-vector-bucket --vector-bucket-name ${VECTOR_BUCKET_NAME} --region ${REGION}

# Delete S3 bucket (empty it first)
aws s3 rm s3://${DATA_BUCKET_NAME} --recursive
aws s3 rb s3://${DATA_BUCKET_NAME}

# Delete IAM policy
aws iam delete-policy --policy-arn arn:aws:iam::\${ACCOUNT_ID}:policy/EKSVectorBucketPolicy

Conclusion

This reference architecture has demonstrated a practical implementation of a self-managed RAG application using Amazon EKS and Amazon S3 Vectors. We have used purpose-built vector storage with S3 Vectors to remove the need for a separate vector database while gaining the performance benefits of a solution designed specifically for vector operations. The combination of Amazon EKS for container orchestration and S3 Vectors for optimized vector storage and retrieval provides a solid foundation for building sophisticated AI applications that can use both the structure of your enterprise data and the capabilities of modern LLMs.

This self-managed approach offers distinct advantages over fully managed alternatives, particularly for organizations that:

Need complete control over model selection, fine-tuning, and deployment parameters
Need to maintain data sovereignty or meet specific compliance requirements
Desire the flexibility to customize any component of the RAG pipeline to match specific use cases
Already have Amazon EKS expertise and want to integrate RAG capabilities into existing container workflows

As you implement this architecture, consider your specific requirements for security, performance, and business needs to make appropriate adjustments to the design. The broad service portfolio of AWS enables you to extend this foundation with more capabilities such as streaming data processing with Amazon Kinesis, model monitoring with Amazon SageMaker Model Monitor, or enhanced security controls.

Ready to enhance your AI applications with better context and accuracy while maintaining control over your implementation? Start building your self-managed RAG solution today using this reference architecture as your guide. Share your experiences in the AWS Community Forum, and for implementation assistance or architectural guidance, reach out to your AWS account team or an AWS Solutions Architect.

AWS Storage Blog

Building self-managed RAG applications with Amazon EKS and Amazon S3 Vectors

Benefits of using S3 Vectors for RAG

Architecture overview

Walkthrough

Step 1: Environment setup with prerequisites

Step 2: Configure EKS cluster

Step 3: Setting up S3 Vector buckets

Step 4: Deploy Ray on Amazon EKS for distributed processing

Step 5: Deploy Ray job for embedding generation

Step 6: Deploy the inference server

Step 7: Query S3 Vector buckets with LangChain

Step 8: Frontend application deployment

Step 9: Monitoring and troubleshooting

Cleaning up resources (optional)

Conclusion

Resources

Follow

Learn

Resources

Developers

Help