This Guidance demonstrates question answering using Retrieval Augmented Generation (RAG) with foundation models in Amazon SageMaker JumpStart. Generative AI is powered by large language models (LLMs), commonly referred to as foundation models, that are pre-trained on vast amounts of data. This Guidance shows how to solve a question answering task with Amazon SageMaker LLMs and embedding endpoints so you can build models that generate text based on specific, enterprise data rather than generic data. This can help you automate tasks, enhance your applications, and improve information retrieval.

Please note: [Disclaimer]

Architecture Diagram

Download the architecture diagram PDF 

Well-Architected Pillars

The AWS Well-Architected Framework helps you understand the pros and cons of the decisions you make when building systems in the cloud. The six pillars of the Framework allow you to learn architectural best practices for designing and operating reliable, secure, efficient, cost-effective, and sustainable systems. Using the AWS Well-Architected Tool, available at no charge in the AWS Management Console, you can review your workloads against these best practices by answering a set of questions for each pillar.

The architecture diagram above is an example of a Solution created with Well-Architected best practices in mind. To be fully Well-Architected, you should follow as many Well-Architected best practices as possible.

  • The services in this Guidance collectively support operational excellence by automating tasks, improving security, enhancing scalability, and streamlining management and operations of the generative AI application. For example, SageMaker JumpStart simplifies machine learning (ML) model deployment, API Gateway provides secure and scalable API access, Lambda automates processing and response formatting, OpenSearch Service improves data retrieval, and Fargate automates resource provisioning for indexing jobs.

    Read the Operational Excellence whitepaper 
  • Amazon Cognito helps ensure that only authenticated and authorized users can access the application. It manages user identities through multi-factor authentication (MFA) options. Amazon Virtual Private Cloud (Amazon VPC) isolates resources, such as SageMaker endpoints and Lambda functions, within a private network. This isolation protects communication between components of the application, enhancing data privacy and security. Amazon VPC also allows for the implementation of network security measures, such as security groups and network access control lists (NACLs). These services help you safeguard sensitive data and maintain the confidentiality, integrity, and availability of the application.

    Read the Security whitepaper 
  • SageMaker JumpStart simplifies the deployment and management of ML models, including model versioning and monitoring. This simplification reduces the risk of model deployment errors and helps ensure that models are consistently available and reliable for inference. Additionally, Lambda functions process user input and invoke SageMaker endpoints. Lambda is serverless and automatically handles scaling and availability so that the application can reliably process user requests without the need for manual scaling or managing servers.  

    Fargate initiates indexing jobs for embeddings and automates resource provisioning and container management, so that indexing jobs are completed reliably and at scale. This automation reduces the risk of resource limitations or failures during indexing processes.

    Read the Reliability whitepaper 
  • In a generative AI application where tasks may involve complex ML inference, data processing, and retrieval, efficiency is crucial to delivering a responsive and high-performing user experience. By using SageMaker JumpStart, Lambda, OpenSearch Service, and Fargate, this Guidance efficiently manages workloads, enables quick response times, and scales to meet performance demands, ultimately enhancing the user's experience with improved application responsiveness and efficiency.

    SageMaker JumpStart optimizes model deployment and monitoring so that ML inferences are initiated efficiently, leading to faster response times and better performance for users. Lambda functions automatically scale to handle concurrent requests so the application can maintain performance efficiency, even during periods of high user demand. OpenSearch Service indexes and searches embeddings, enhancing the application's information retrieval capabilities and enabling users to quickly access the information they need. Fargate invokes indexing jobs for embeddings. It automates resource provisioning, allowing the application to efficiently process and index large amounts of data without manual intervention.

    Read the Performance Efficiency whitepaper 
  • SageMaker JumpStart provides pre-built ML models and workflows, reducing the time and resources required to develop and train models from scratch. This can lead to cost savings by accelerating the development cycle. Lambda follows a pay-as-you-go pricing model, meaning you only pay for the compute time used when your function is invoked. OpenSearch Service allows you to easily scale your cluster based on your search and analytics workloads. You can optimize costs by adjusting the resources to match your actual usage. Fargate automatically manages the underlying infrastructure, which means you don't need to provision or manage servers. This eliminates the need to pay for unused server capacity, resulting in cost savings.  

    Read the Cost Optimization whitepaper 
  • Services such as Lambda, SageMaker, and Fargate contribute to sustainability by optimizing resource usage. They automatically scale resources based on workload demand, reducing unnecessary energy consumption during periods of low activity. For example, as a serverless compute infrastructure, Fargate runs containerized application workloads and minimizes your overall resource footprint. Similarly, SageMaker JumpStart helps in preventing idle overprovisioned resources by automatically adjusting computing resources to match workload needs.

    Read the Sustainability whitepaper 

Implementation Resources

The sample code is a starting point. It is industry validated, prescriptive but not definitive, and a peek under the hood to help you begin.

Machine Learning

Question answering using Retrieval Augmented Generation with foundation models in Amazon SageMaker JumpStart

This blog post describes RAG and its advantages, and demonstrates how to quickly get started by using a sample notebook to solve a question answering task using RAG implementation with LLMs in Jumpstart.


The sample code; software libraries; command line tools; proofs of concept; templates; or other related technology (including any of the foregoing that are provided by our personnel) is provided to you as AWS Content under the AWS Customer Agreement, or the relevant written agreement between you and AWS (whichever applies). You should not use this AWS Content in your production accounts, or on production or other critical data. You are responsible for testing, securing, and optimizing the AWS Content, such as sample code, as appropriate for production grade use based on your specific quality control practices and standards. Deploying AWS Content may incur AWS charges for creating or using AWS chargeable resources, such as running Amazon EC2 instances or using Amazon S3 storage.

References to third-party services or organizations in this Guidance do not imply an endorsement, sponsorship, or affiliation between Amazon or AWS and the third party. Guidance from AWS is a technical starting point, and you can customize your integration with third-party services when you deploy the architecture.

Was this page helpful?