Skip to main content

Guidance for Low Latency, High Throughput Inference using Efficient Compute on Amazon EKS

Overview

This Guidance demonstrates how to deploy a machine learning inference architecture on Amazon Elastic Kubernetes Service (Amazon EKS). It addresses the basic implementation requirements as well as ways you can pack thousands of unique PyTorch deep learning (DL) models into a scalable architecture. PyTorch is an open-source machine learning framework that can help accelerate your machine learning journey from prototyping to deployment. We also explore a mix of Amazon Elastic Compute Cloud (Amazon EC2) instance families to develop an optimal design using efficient compute (such as AWS Graviton and AWS Inferentia) that allows you to scale inferences efficiently and cost effectively.

How it works

Infrastructure

This infrastructure diagram provides a way to setup an Amazon Elastic Kubernetes Service (Amazon EKS) cluster that is compatible with this Guidance. Optionally, a pre-existing Amazon EKS cluster can be used. To learn more about running inference workloads on this infrastructure, open the Architecture tab.

This architecture diagram illustrates a low-latency, high-throughput inference solution on AWS using Amazon EKS (Elastic Kubernetes Service). It shows interaction flows with Git repositories, kubectl CLI, application APIs, load balancing, auto scaling groups, IAM, KMS, and managed cluster VPCs containing EKS control planes and compute nodes across multiple availability zones for scalable, efficient compute.

Architecture

This diagram provides a simple, scalable, and highly available architecture for running machine learning (ML) inference workloads on AWS. It uses a standard Amazon Elastic Kubernetes Service (Amazon EKS) infrastructure that can be deployed across multiple Availability Zones for high availability. For instructions to setup an Amazon EKS cluster compatible with this Guidance, open the Infrastructure tab.

Architecture diagram illustrating a low latency, high throughput inference deployment using efficient compute on Amazon EKS. The workflow includes users, automation framework, Git repository, Amazon ECR, Amazon S3 or other model repository, Elastic Load Balancing, and compute node groups with multiple EC2 instance types (Inf2, G4dn, C5, C7g) across availability zones.

Well-Architected Pillars

The architecture diagram above is an example of a Solution created with Well-Architected best practices in mind. To be fully Well-Architected, you should follow as many Well-Architected best practices as possible.

Amazon EKS, Amazon ECR, and a test automation framework are used in this Guidance to enhance your operational excellence. It helps you visualize, customize, and understand the concept of serving ML models using a FastAPI framework, providing you the flexibility to choose the Amazon EKS node compute instances of your choice in order to optimize performance and costs. Amazon EKS and Amazon ECR are managed Kubernetes and image repository services, respectively, and fully support API-based automation of all phases of the machine learning operations (MLOps) cycle. We also show how you can automatically deploy and run a large number of customized machine learning models, as well as automate load and scale testing of those models' performance using an automation framework.

Read the Operational Excellence whitepaper

Amazon EKS, Amazon VPC, IAM roles and policies, and Amazon ECR work in tandem to protect your information and systems. The Amazon EKS cluster resources are deployed into a VPC that provides a logical isolation of its resources from the public internet. A VPC  supports a variety of security features, such as security groups and network access control lists (ACLs), which are used to control inbound and outbound traffic to resources, as well as IAM roles and policies for authorization to limit access. The Amazon ECR image registry provides additional container-level security features, such as vulnerability scanning. 

Read the Security whitepaper

Amazon EKS and Amazon ECR are used throughout this Guidance to help your workloads perform their intended functions correctly and consistently. Amazon EKS deploys the Kubernetes control plane (the instances that control how, when, and where your containers run) and the compute planes (the instances where your containers run) across multiple Availability Zones (AZs) in AWS Regions. This ensures that both the control and compute planes are always available, even if one AZ goes down. Also, Elastic Load Balancing (ELB)  will route application traffic to functional nodes. Additionally, the Amazon EKS cluster components are sending metrics to an Amazon CloudWatch portal, where events can be configured to invoke alerts in case certain thresholds are crossed. 

Read the Reliability whitepaper

Amazon ECR, Amazon EKS, and Amazon EC2 were used in this Guidance to support a structured and streamlined allocation of IT and computing resources. The compute nodes within the Amazon EKS cluster (that are Amazon EC2 instances) can be scaled up and down based on the application's workload requirement while conducting the tests. Moreover, Amazon ECR and Amazon EKS are highly available services, optimized for scalability and performance of containerized applications. This Guidance leverages those and other services (such as Amazon S3, and the GitHub open-source software) to monitor and optimize performance characteristics of machine learning inference workloads through customization and automation.

Read the Performance Efficiency whitepaper

Amazon ECR is a managed service that optimizes the costs of both storing and serving container image applications that are deployed on Amazon EKS. The compute nodes of the Amazon EKS cluster can scale up or down, based on projected workloads, when performing tests. Also, Amazon EKS node groups can be efficiently scaled, helping you to identify the most cost-efficient compute node configuration for running ML inferences at scale.

Read the Cost Optimization whitepaper

Amazon EKS with the Amazon EC2 compute node instances deployed into the VPC and Amazon ECR do not use custom hardware. Meaning, you do not need to purchase or manage any physical servers. Instead, this Guidance uses managed services that run on the AWS infrastructure. Furthermore, by supporting the use of energy-efficient processor instance types, like AWS Graviton Processors, this architecture provides increased sustainability. Using Graviton running in Amazon EC2 can improve the performance of your workloads with less resources and thereby decreasing your overall resource footprint.

Read the Sustainability whitepaper

Deploy with confidence

Everything you need to launch this Guidance in your account is right here

We'll walk you through it

Dive deep into the implementation guide for additional customization options and service configurations to tailor to your specific needs.

Open guide

Let's make it happen

Ready to deploy? Review the sample code on GitHub for detailed deployment instructions to deploy as-is or customize to fit your needs.

Go to sample code

Disclaimer

The sample code; software libraries; command line tools; proofs of concept; templates; or other related technology (including any of the foregoing that are provided by our personnel) is provided to you as AWS Content under the AWS Customer Agreement, or the relevant written agreement between you and AWS (whichever applies). You should not use this AWS Content in your production accounts, or on production or other critical data. You are responsible for testing, securing, and optimizing the AWS Content, such as sample code, as appropriate for production grade use based on your specific quality control practices and standards. Deploying AWS Content may incur AWS charges for creating or using AWS chargeable resources, such as running Amazon EC2 instances or using Amazon S3 storage.