Guidance for Low Latency, High Throughput Inference using Efficient Compute on Amazon EKS

This Guidance demonstrates how to deploy a machine learning inference architecture on Amazon Elastic Kubernetes Service (Amazon EKS). It addresses the basic implementation requirements as well as ways you can pack thousands of unique PyTorch deep learning (DL) models into a scalable architecture. PyTorch is an open-source machine learning framework that can help accelerate your machine learning journey from prototyping to deployment. We also explore a mix of Amazon Elastic Compute Cloud (Amazon EC2) instance families to develop an optimal design using efficient compute (such as AWS Graviton and AWS Inferentia) that allows you to scale inferences efficiently and cost effectively.

Please note: [Disclaimer]

Architecture Diagram

Download the architecture diagram PDF

Infrastructure
Architecture

Infrastructure
This infrastructure diagram provides a way to setup an Amazon Elastic Kubernetes Service (Amazon EKS) cluster that is compatible with this Guidance. Optionally, a pre-existing Amazon EKS cluster can be used. To learn more about running inference workloads on this infrastructure, open the Architecture tab.

Optional
To deploy this Guidance, you need an Amazon Elastic Kubernetes Service (Amazon EKS) cluster provisioned. These steps show how to provision an Amazon EKS cluster using “provision” part of the project code.

Step 1
Administrator or DevOps user obtains Infrastructure as Code (IaC) code with Amazon EKS specification from Git repository.

Step 2
Amazon Elastic Compute Cloud (Amazon EC2) Management Instance provisioning is started by Admin/DevOps user via the AWS CloudFormation code obtained from the Git repo.

Step 3
Management Instance userData script starts Amazon EKS cluster resource deployment processes against target AWS environment (using eksctl command and cluster specification).

Step 4
Required AWS Identity and Access Management (IAM) roles, polices, and AWS Key Management Service (AWS KMS) keys are created.

Step 5
The Amazon EKS virtual private cloud (VPC) for the control plane component is deployed.

Step 6
The Amazon EKS cluster control plane components are deployed into the Amazon EKS VPC. The cluster control plane is provisioned across multiple Availability Zones and fronted by Elastic Load Balancing (ELB).

Step 7
Cluster VPC is deployed for the Amazon EKS compute plane.

Step 8
Public and Private subnets and other networking components are deployed in cluster VPCs.

Step 9
The Amazon EKS compute plane node groups. containing Amazon Elastic Compute Cloud (Amazon EC2) node instances in auto scaling groups, are deployed into the cluster VPC and join the Amazon EKS cluster.

Step 10
The Amazon EKS cluster is available for application deployment. The Kubernetes API is accessible for the command line interface (CLI) clients and applications through an ELB.

Click to enlarge

Optional
To deploy this Guidance, you need an Amazon Elastic Kubernetes Service (Amazon EKS) cluster provisioned. These steps show how to provision an Amazon EKS cluster using “provision” part of the project code.

Step 1
Administrator or DevOps user obtains Infrastructure as Code (IaC) code with Amazon EKS specification from Git repository.

Step 2
Amazon Elastic Compute Cloud (Amazon EC2) Management Instance provisioning is started by Admin/DevOps user via the AWS CloudFormation code obtained from the Git repo.

Step 3
Management Instance userData script starts Amazon EKS cluster resource deployment processes against target AWS environment (using eksctl command and cluster specification).

Step 4
Required AWS Identity and Access Management (IAM) roles, polices, and AWS Key Management Service (AWS KMS) keys are created.

Step 5
The Amazon EKS virtual private cloud (VPC) for the control plane component is deployed.

Step 6
The Amazon EKS cluster control plane components are deployed into the Amazon EKS VPC. The cluster control plane is provisioned across multiple Availability Zones and fronted by Elastic Load Balancing (ELB).

Step 7
Cluster VPC is deployed for the Amazon EKS compute plane.

Step 8
Public and Private subnets and other networking components are deployed in cluster VPCs.

Step 9
The Amazon EKS compute plane node groups. containing Amazon Elastic Compute Cloud (Amazon EC2) node instances in auto scaling groups, are deployed into the cluster VPC and join the Amazon EKS cluster.

Step 10
The Amazon EKS cluster is available for application deployment. The Kubernetes API is accessible for the command line interface (CLI) clients and applications through an ELB.
Architecture
This diagram provides a simple, scalable, and highly available architecture for running machine learning (ML) inference workloads on AWS. It uses a standard Amazon Elastic Kubernetes Service (Amazon EKS) infrastructure that can be deployed across multiple Availability Zones for high availability. For instructions to setup an Amazon EKS cluster compatible with this Guidance, open the Infrastructure tab.

Step 1
The Amazon EKS cluster has several compute node groups with one Amazon Elastic Compute Cloud (Amazon EC2) instance family per node group. Each node group can support different instance types, such as AWS Graviton Processors (c7g) or AWS Inferentia processors (inf2)-based instances deployed across Availability Zones (AZs).

Step 2
The natural language processing (NLP) models, serving application and machine learning (ML) framework dependencies, are built by users as container images and use an automation framework. These images are uploaded to Amazon Elastic Container Registry (Amazon ECR). Decoupling the model container images from the model data reduces the size of the model container images.

Step 3
Using the automation framework, the model container images customized for each compute node instance, are obtained from the respective Amazon ECR repositories. They are deployed to the Amazon EKS cluster using generated deployment manifests via Kubernetes API exposed through Elastic Load Balancing (ELB).

Step 4
ML model application containers download the model artifacts from the model repository, such as Amazon Simple Storage Service (Amazon S3), or other repositories upon their initialization. This component of the architecture decouples the model data from its service definition. ML inference services are available in the Amazon EKS cluster.

Step 5
Load testing of the deployed ML inference services is performed using containerized test clients deployed using images in the Amazon ECR repository. The client sends simultaneous requests to the ML model service pool running in the Amazon EKS cluster. Performance Test results metrics are obtained and aggregated.

Click to enlarge

Step 1
The Amazon EKS cluster has several compute node groups with one Amazon Elastic Compute Cloud (Amazon EC2) instance family per node group. Each node group can support different instance types, such as AWS Graviton Processors (c7g) or AWS Inferentia processors (inf2)-based instances deployed across Availability Zones (AZs).

Step 2
The natural language processing (NLP) models, serving application and machine learning (ML) framework dependencies, are built by users as container images and use an automation framework. These images are uploaded to Amazon Elastic Container Registry (Amazon ECR). Decoupling the model container images from the model data reduces the size of the model container images.

Step 3
Using the automation framework, the model container images customized for each compute node instance, are obtained from the respective Amazon ECR repositories. They are deployed to the Amazon EKS cluster using generated deployment manifests via Kubernetes API exposed through Elastic Load Balancing (ELB).

Step 4
ML model application containers download the model artifacts from the model repository, such as Amazon Simple Storage Service (Amazon S3), or other repositories upon their initialization. This component of the architecture decouples the model data from its service definition. ML inference services are available in the Amazon EKS cluster.

Step 5
Load testing of the deployed ML inference services is performed using containerized test clients deployed using images in the Amazon ECR repository. The client sends simultaneous requests to the ML model service pool running in the Amazon EKS cluster. Performance Test results metrics are obtained and aggregated.

Well-Architected Pillars

The AWS Well-Architected Framework helps you understand the pros and cons of the decisions you make when building systems in the cloud. The six pillars of the Framework allow you to learn architectural best practices for designing and operating reliable, secure, efficient, cost-effective, and sustainable systems. Using the AWS Well-Architected Tool, available at no charge in the AWS Management Console, you can review your workloads against these best practices by answering a set of questions for each pillar.

The architecture diagram above is an example of a Solution created with Well-Architected best practices in mind. To be fully Well-Architected, you should follow as many Well-Architected best practices as possible.

Operational Excellence

Amazon EKS, Amazon ECR, and a test automation framework are used in this Guidance to enhance your operational excellence. It helps you visualize, customize, and understand the concept of serving ML models using a FastAPI framework, providing you the flexibility to choose the Amazon EKS node compute instances of your choice in order to optimize performance and costs. Amazon EKS and Amazon ECR are managed Kubernetes and image repository services, respectively, and fully support API-based automation of all phases of the machine learning operations (MLOps) cycle. We also show how you can automatically deploy and run a large number of customized machine learning models, as well as automate load and scale testing of those models' performance using an automation framework.

Read the Operational Excellence whitepaper
Security

Amazon EKS, Amazon VPC, IAM roles and policies, and Amazon ECR work in tandem to protect your information and systems. The Amazon EKS cluster resources are deployed into a VPC that provides a logical isolation of its resources from the public internet. A VPC supports a variety of security features, such as security groups and network access control lists (ACLs), which are used to control inbound and outbound traffic to resources, as well as IAM roles and policies for authorization to limit access. The Amazon ECR image registry provides additional container-level security features, such as vulnerability scanning.

Read the Security whitepaper
Reliability

Amazon EKS and Amazon ECR are used throughout this Guidance to help your workloads perform their intended functions correctly and consistently. Amazon EKS deploys the Kubernetes control plane (the instances that control how, when, and where your containers run) and the compute planes (the instances where your containers run) across multiple Availability Zones (AZs) in AWS Regions. This ensures that both the control and compute planes are always available, even if one AZ goes down. Also, Elastic Load Balancing (ELB) will route application traffic to functional nodes. Additionally, the Amazon EKS cluster components are sending metrics to an Amazon CloudWatch portal, where events can be configured to invoke alerts in case certain thresholds are crossed.

Read the Reliability whitepaper
Performance Efficiency

Amazon ECR, Amazon EKS, and Amazon EC2 were used in this Guidance to support a structured and streamlined allocation of IT and computing resources. The compute nodes within the Amazon EKS cluster (that are Amazon EC2 instances) can be scaled up and down based on the application's workload requirement while conducting the tests. Moreover, Amazon ECR and Amazon EKS are highly available services, optimized for scalability and performance of containerized applications. This Guidance leverages those and other services (such as Amazon S3, and the GitHub open-source software) to monitor and optimize performance characteristics of machine learning inference workloads through customization and automation.

Read the Performance Efficiency whitepaper
Cost Optimization

Amazon ECR is a managed service that optimizes the costs of both storing and serving container image applications that are deployed on Amazon EKS. The compute nodes of the Amazon EKS cluster can scale up or down, based on projected workloads, when performing tests. Also, Amazon EKS node groups can be efficiently scaled, helping you to identify the most cost-efficient compute node configuration for running ML inferences at scale.

Read the Cost Optimization whitepaper
Sustainability

Amazon EKS with the Amazon EC2 compute node instances deployed into the VPC and Amazon ECR do not use custom hardware. Meaning, you do not need to purchase or manage any physical servers. Instead, this Guidance uses managed services that run on the AWS infrastructure. Furthermore, by supporting the use of energy-efficient processor instance types, like AWS Graviton Processors, this architecture provides increased sustainability. Using Graviton running in Amazon EC2 can improve the performance of your workloads with less resources and thereby decreasing your overall resource footprint.

Read the Sustainability whitepaper

Implementation Resources

A detailed guide is provided to experiment and use within your AWS account. Each stage of building the Guidance, including deployment, usage, and cleanup, is examined to prepare it for deployment.

The sample code is a starting point. It is industry validated, prescriptive but not definitive, and a peek under the hood to help you begin.

Open implementation guide

Open sample code on GitHub

Architecture Diagram

Well-Architected Pillars

Implementation Resources

Related Content

Title

Disclaimer

Was this page helpful?

Guidance for Low Latency, High Throughput Inference using Efficient Compute on Amazon EKS

Architecture Diagram

Well-Architected Pillars

Implementation Resources

Related Content

Title

Disclaimer

Was this page helpful?

Ending Support for Internet Explorer