Select your cookie preferences

We use essential cookies and similar tools that are necessary to provide our site and services. We use performance cookies to collect anonymous statistics, so we can understand how customers use our site and make improvements. Essential cookies cannot be deactivated, but you can choose “Customize” or “Decline” to decline performance cookies.

If you agree, AWS and approved third parties will also use cookies to provide useful site features, remember your preferences, and display relevant content, including relevant advertising. To accept or decline all non-essential cookies, choose “Accept” or “Decline.” To make more detailed choices, choose “Customize.”

Skip to main content

Overview

This Guidance demonstrates how to streamline access to numerous large language models (LLMs) through a unified, industry-standard API gateway based on OpenAI API standards. By deploying this Guidance, you can simplify integration while gaining access to tools that track LLM usage, manage costs, and implement crucial governance features. This allows easy switching between models, efficient management of multiple LLM services within applications, and robust control over security and expenses.

How it works

These technical details feature an architecture diagram to illustrate how to effectively use this solution. The architecture diagram shows the key components and their interactions, providing an overview of the architecture's structure and functionality step-by-step.

Deploy with confidence

Everything you need to launch this Guidance in your account is right here

We'll walk you through it

Dive deep into the implementation guide for additional customization options and service configurations to tailor to your specific needs.

Open guide

Let's make it happen

Ready to deploy? Review the sample code on GitHub for detailed deployment instructions to deploy as-is or customize to fit your needs. 

Go to sample code

Well-Architected Pillars

The architecture diagram above is an example of a Solution created with Well-Architected best practices in mind. To be fully Well-Architected, you should follow as many Well-Architected best practices as possible.

LiteLLM application logs are stored in S3 buckets for audit and analysis purposes. Amazon ECS and Amazon EKS feature built-in tools and plugins to monitor health and performance of their respective clusters, streaming log data to Amazon CloudWatch for event data analysis. These managed services reduce the operational burden of deploying and maintaining application platform infrastructure. CloudWatch Logs provide comprehensive insights into both infrastructure and application levels of Amazon ECS and Amazon EKS clusters, enabling effective troubleshooting and analysis.

Read the Operational Excellence whitepaper

ACM provides managed SSL/TLS certificates for secure communication and automatically manages these certificates to prevent vulnerabilities. AWS WAF protects web applications from common exploits and provides real-time monitoring and custom rule creation capabilities. Additionally, Amazon ECS and Amazon EKS clusters operate with public and private networks for additional security and isolation. AWS Identity and Access Management (IAM) roles and policies follow the least-privilege principle for both deployment of the Guidance and cluster operations, while Secrets Manager stores external model provider credentials and other sensitive settings securely.

Read the Security whitepaper

Amazon ECS and Amazon EKS provide container orchestration, automatically handling task placement and recovery across multiple Availability Zones for LiteLLM proxy and API/middleware containers. Amazon ElastiCache enables multi-tenant distribution of application settings and prompt caching. Together, these services enable highly available applications that can maintain operational SLAs even if individual components fail, offering auto-recovery capabilities.

Read the Reliability whitepaper

ElastiCache enhances performance by providing sub-millisecond latency for frequently accessed data through in-memory caching. ALB effectively distributes incoming application traffic across multiple targets based on advanced routing rules and health checks. Amazon ECS on Fargate and Amazon EKS provide on-demand efficient infrastructure for running application containers, offering auto-scaling based on workload demands. The native integration of LiteLLM with ElastiCache and Amazon RDS significantly reduces database load and improves application response times by serving cached content and efficiently routing requests.

Read the Performance Efficiency whitepaper

Amazon RDS offers automated backups, patching, and scaling, reducing operational overhead and cost of operation. These services provide options for reserved instances or savings plans, allowing you to significantly reduce costs for predictable workloads compared to on-demand pricing. Amazon ECS and Amazon EKS allow you to run containers on efficient compute Amazon Elastic Compute Cloud (Amazon EC2) instances, such as AWS Graviton, or in a serverless Fargate infrastructure. This helps optimize compute costs by right-sizing resources and only paying for what you use.

Read the Cost Optimization whitepaper

Amazon EKS and Amazon ECS container orchestration engines enable multiple applications to share underlying compute resources (including efficient compute EC2 instances), maximizing resource utilization and reducing idle capacity. As a managed service, Amazon Bedrock eliminates the need for dedicated GPU infrastructure by sharing pre-trained models across multiple users. This shared infrastructure approach reduces the overall hardware resource footprint and energy consumption compared to running separate dedicated environments.

Read the Sustainability whitepaper

Disclaimer

The sample code; software libraries; command line tools; proofs of concept; templates; or other related technology (including any of the foregoing that are provided by our personnel) is provided to you as AWS Content under the AWS Customer Agreement, or the relevant written agreement between you and AWS (whichever applies). You should not use this AWS Content in your production accounts, or on production or other critical data. You are responsible for testing, securing, and optimizing the AWS Content, such as sample code, as appropriate for production grade use based on your specific quality control practices and standards. Deploying AWS Content may incur AWS charges for creating or using AWS chargeable resources, such as running Amazon EC2 instances or using Amazon S3 storage.