Overview
This Guidance shows how you can optimize your generative AI models using Amazon SageMaker, a service where you can build, train, and deploy large language models (LLMs) at scale. By using advanced optimization techniques like speculative decoding, quantization, and compilation, you can achieve significant performance improvements. These optimizations can deliver higher throughput, lower latency, and reduced inference costs. The streamlined interface in SageMaker abstracts away the complex research and experimentation normally required to optimize generative AI models, allowing you to rapidly develop and deploy high-performing, cost-effective generative AI applications.
The provided diagrams illustrate the steps required to configure this application. The first diagram demonstrates how data scientists can optimize large language models (LLMs) within SageMaker. The second diagram outlines the deployment of those optimized LLMs.
How it works
Optimization for data scientists
This architecture diagram shows how data scientists can optimize Large Language Models (LLMs) within Amazon SageMaker to deliver responses that are not only faster, but also more accurate and cost-effective. The subsequent tab outlines the deployment of the optimized LLMs in Amazon SageMaker.

LLM deployment in applications
This architecture diagram shows how to deploy the optimized LLMs in SageMaker, including using AWS CloudFormation to provision all the necessary application resources.

Get Started
Well-Architected Pillars
The architecture diagram above is an example of a Solution created with Well-Architected best practices in mind. To be fully Well-Architected, you should follow as many Well-Architected best practices as possible.
Related Content
Disclaimer
Did you find what you were looking for today?
Let us know so we can improve the quality of the content on our pages