AWS Open Source Blog

Running TorchServe on Amazon Elastic Kubernetes Service

This article was contributed by Josiah Davis, Charles Frenzel, and Chen Wu.

TorchServe is a model serving library that makes it easy to deploy and manage PyTorch models at scale in production environments. TorchServe removes the heavy lifting of deploying and serving PyTorch models with Kubernetes. TorchServe is built and maintained by AWS in collaboration with Facebook, and is available as part of the PyTorch open source project. It delivers lightweight model serving with low latency, so you can deploy your models for high-performance inference. TorchServe supports any machine learning environment including Amazon Elastic Kubernetes Service (EKS), Amazon’s managed Kubernetes service. TorchServe provides a Management API that allows you to easily register new model versions, which are then immediately accessible for making predictions through the Inference API. For more details, see the TorchServe GitHub repository and the documentation.

In this post, we will demonstrate how to deploy TorchServe on an Amazon EKS cluster for inference. This allows you to quickly deploy a pre-trained machine learning model as a scalable, fault-tolerant web service for low latency inference.

EKS workflow

Getting started

To get started, you will first need to install the required packages on your local machine or on an Amazon Elastic Compute Cloud (Amazon EC2) instance. To learn more, see Getting Started with Amazon EC2. If you are using your local machine, credentials must first be configured, which documentation explains how to do. If you are using an Amazon EC2 instance, an AWS Identity and Access Management (IAM) role, which we are providing in this blog, must be attached. You can learn more about IAM roles in the documentation. This post was tested using an Amazon EC2 G4 instance.

Before beginning to set up the Amazon EKS cluster, you must first install the required command-line tools. To follow the steps in this post, you will need to have Docker, AWS Command Line Interface (AWS CLI), kubectl, eksctl, and AWS IAM Authenticator installed to deploy TorchServe to Amazon EKS. Refer to the GitHub repo for installation instructions.

Set up environment variables

To configure the set-up process, you must first set the following global environment variables. Use variable names to pre-populate templates for Amazon EKS so that you can set up everything automatically via manifest files.

export AWS_ACCOUNT=<ACCOUNT ID>
export AWS_REGION=<AWS REGION>
export K8S_MANIFESTS_DIR=<Absolute path to store manifests>
export AWS_CLUSTER_NAME=<Name for the AWS EKS cluster>
export PT_SERVE_NAME=<Name of TorchServe in the EKS>

Set up Git repository

First, git clonethe GitHub code repository.

git clone https://github.com/aws-samples/torchserve-eks
cd torchserve-eks

The directory structure of the Git repository is illustrated below.

├── LICENSE                                 
├── README.md
├── cloud_watch_util.sh                     # Script to set up CloudWatch logs
├── delete_cluster.sh                       # Script to tear down the EKS cluster
├── img
│   ├── EKSCTL.png
│   └── TorchServeOnAWS.png
├── installation.md                         # How to install command line tools
├── instructions.md                         # Step-by-step setup instructions
├── pt_serve_util.sh                        # Script to auto-gen manifest files
└── template                                # A directory with all template files
    ├── cloud_watch_policy.json             # IAM CloudWatch policy template            
    ├── cluster.yaml                        # EKS cluster manifest template
    ├── eks_ami_policy.json                 # IAM user policy template 
    └── pt_inference.yaml                   # TorchServe manifest template

Creating EKS manifest files

Once you have all dependencies installed, you will produce manifest files for Amazon EKS to create the cluster and for Kubernetes to deploy your TorchServe service. These files are in YAML format, and the GitHub code repository provides example YAML templates and a bash script to automatically generate them. Run the pt_serve_util.sh bash script to auto-generate manifest files in the specified directory under the environment variable $K8S_MANIFESTS_DIR.

./pt_serve_util.sh

This bash script will generate the manifest files based on the environment variables entered in the previous step. The files include an IAM policy, an EKS cluster for the underlying infrastructure, and a TorchServe manifest file for Kubernetes service and deployment.

The pt_serve_util.sh bash script accomplishes the following tasks:

  • Checks that command-line tools, such as AWS CLI, kubectl, eksctl, and aws-iam-authenticator are installed properly
  • Checks all environment variables listed above are set properly
  • Generates cluster.yaml and pt_inference.yaml in directory $K8S_MANIFESTS_DIR
  • Updates the eks_ami_policy.json IAM policy file with environment variables AWS_ACCOUNT and AWS_REGION

Set up IAM roles and policies

You’ll need a permissive IAM user policy to create the underlying infrastructure of TorchServe’s EKS service and deployment. This policy should include eksctl minimum IAM policies permissions and permissions for retrieving the Amazon EKS-optimized AMI ID.

For more information, refer to the Adding and Removing IAM Identity Permissions documentation.

A single-node EKS cluster on GPU

eksctl is a command-line tool CLI that creates clusters on EKS. The CLI runs the AWS CloudFormation piece, with options passed to cluster.yaml. In this post, we make use of eksctl GPU support, using a single G4 instance defined by the passed global variables for cluster name and region. You can find additional configuration options available with eksctl in the documentation.

apiVersion: eksctl.io/v1alpha5
kind: ClusterConfig
metadata:
  name: ${AWS_CLUSTER_NAME}
  region: ${AWS_REGION}

nodeGroups:
  - name: ng-1
    instanceType: g4dn.xlarge
    desiredCapacity: 1

Use kubetctl to create a Service and Deployment

The TorchServe manifest file is created to be run with kubectl with GPU inference, a CLI for controlling Kubernetes clusters. The manifest file named pt_inference.yaml consists of definitions for the Service and Deployment. The Service section of the file opens port 8080 for the Inference API, port 8081 for the Model Management API, and specifies that we would like to deploy service type LoadBalancer. Note that if you do not specify the service type, it will default to a ClusterIP in which the service is only accessible from within the same VPC. The Deployment portion of the file sets the replica set, specifies from which container to build the deployment image, sets the same ports, and applies resource limits.

After running pt_serve_util.sh, the Kubernetes application names populate in the pt_inference.yaml. Inside the YAML file, under Deployment section, image directly links to the TorchServe image registered in the Docker Hub. Once run, the template variable for the Kubernetes Service name your_service_name will be set to the environment variable $PT{PT_SERVE_NAM}

---
kind: Service
apiVersion: v1
metadata:
  name: your_service_name
  labels:
    app: your_service_name
spec:
  ports:
  - name: preds
    port: 8080
    targetPort: ts
  - name: mdl
    port: 8081
    targetPort: ts-management
  type: LoadBalancer
  selector:
    app: your_service_name
---
kind: Deployment
apiVersion: apps/v1
metadata:
  name: your_service_name
  labels:
    app: your_service_name
spec:
  replicas: 1
  selector:

    matchLabels:
      app: your_service_name
  template:
    metadata:
      labels:
        app: your_service_name
    spec:
      containers:
      - name: your_service_name
        image: "pytorch/torchserve:latest-gpu"
        ports:
        - name: ts
          containerPort: 8080
        - name: ts-management
          containerPort: 8081
        imagePullPolicy: IfNotPresent
        resources:
          limits:
            cpu: 4
            memory: 4Gi
            nvidia.com/gpu: 1
          requests:
            cpu: "1"
            memory: 1Gi

Subscribe to EKS-optimized AMI with GPU support in the AWS Marketplace

To run Amazon EKS with a GPU, you must first subscribe to Amazon EKS-optimized AMI with GPU support from the console using your AWS account. The Amazon EKS-optimized AMI with GPU support builds on top of the standard Amazon EKS-optimized AMI, and configures to serve as the base image for Amazon P2, P3, and G4 instances in Amazon EKS Clusters. Following the link and clicking subscribe will ensure that the EKS node creation step succeeds.

AWS Marketplace console

Creating an EKS cluster

Now that the required command-line tools and a permissive policy are set up, you can begin creating your cluster. For this post, we will use eksctl to launch an automation script that makes use of AWS CloudFormation based on a pre-configured YAML file, cluster.yaml, to stand up the underlying infrastructure on which TorchServe will run. The YAML file contains the naming for the cluster, region, and instance type. In this tutorial, you will only set a single node to run, but you can edit the file further based on needs as described in the eksctl documentation. To do so, run the below command to build up an EKS cluster with a single node EC2 instance:

eksctl create cluster -f ${K8S_MANIFESTS_DIR}/cluster.yaml

Running the TorchServe container on EKS

Install NVIDIA device plugin for Kubernetes

Because the pre-trained PyTorch model will be making use of a GPU, you will need to install the Nvidia device plugin.

With kubectl set up, enter the following command:

kubectl apply -f https://raw.githubusercontent.com/NVIDIA/k8s-device-
plugin/master/deployments/static/nvidia-device-plugin.yml

You can then verify that the plugin successfully installed by running the following command:

kubectl get daemonset -n kube-system

Deploy pods to EKS cluster

Next you will set up a namespace for a Kubernetes cluster and then apply a Kubernetes manifest file to the cluster. The manifest file called pt_inference.yaml creates the Kubernetes deployment for the pod, service, and ingresses. In particular, it points to the container image for TorchServe registered in Docker Hub and the ports 8080:8081 from which the service is queryable when pushed live.

NAMESPACE=pt-inference; kubectl create namespace ${NAMESPACE}
kubectl -n ${NAMESPACE} apply -f ${K8S_MANIFESTS_DIR}/pt_inference.yaml

After this is complete, you can confirm that the deployment is set up and in service by running the following command:

kubectl get pods -n ${NAMESPACE}

Set up logging on Amazon CloudWatch

Run the following script to enable Amazon CloudWatch log groups:

./cloud_watch_util.sh

The cloud_watch_util.sh bash script accomplishes the following tasks:

  • Uses eksctl to obtain the IAM role of the Amazon EKS cluster, and saves that in environment variable NODE_INSTANCE_ROLE_NAME
  • Updates cloud_watch_policy.json IAM policy file with environment variables $AWS_ACCOUNT and AWS_REGION
  • Attaches inline policies defined in cloud_watch_policy.json to the EKS cluster role $NODE_INSTANCE_ROLE_NAME
  • Deploys ContainerInsights on EKS by setting up CloudWatch Agent and FluentD DaemonSet

Once the bash script executes successfully, perform the Inference on the endpoint step below. Then, on the AWS Management Console, we navigate to CloudWatch, Logs, Log groups.

Here, we can check TorchServe logs at:

/aws/containerinsights/${AWS_CLUSTER_NAME}/application/${PT_SERVE_NAME}*

These log entries are either performance-related (e.g., CPU utilization) or access-related, such as inference requests.

Register models with TorchServe

Get the external IP for the service and store it in a variable:

EXTERNAL_IP=`kubectl get svc -n ${NAMESPACE} -o
jsonpath='{.items[0].status.loadBalancer.ingress[0].hostname}'`

Here, we register a publicly available model. For more details on the required contents of the model file, read the docs for the model-archiver utility, which is provided with TorchServe.

response=$(curl --write-out %{http_code} --silent --output /dev/null --
retry 5 -X POST
"http://${EXTERNAL_IP}:8081/models?url=https://torchserve.s3.amazonaws.
com/mar_files/resnet-18.mar&initial_workers=1&synchronous=true")

if [ ! "$response" == 200 ]
then
    echo "failed to register model with torchserve"
else
    echo "successfully registered model with torchserve"
fi

Note: If you do not specify a LoadBalancer as the type, the default type will be the ClusterIP and the endpoint will only be accessible within the internal VPC. In that case, you can use port forwarding as follows:

kubectl port-forward -n ${NAMESPACE} `kubectl get pods -n ${NAMESPACE}
--selector=app=densenet-service -o
jsonpath='{.items[0].metadata.name}'` 8080:8080 8081:8081 &

Inference on the endpoint

There are multiple ways in which to invoke inference from the cluster. In this post, we will query it directly by using the curl method as demonstrated in the TorchServe’s model serving example.

# Save the image locally
curl -O
https://raw.githubusercontent.com/pytorch/serve/master/docs/images/kitt
en_small.jpg

# Send the image for inference
curl -X POST http://${EXTERNAL_IP}:8080/predictions/resnet-18 -T
kitten_small.jpg

# List out models currently registered
curl -X GET http://${EXTERNAL_IP}:8081/models/

Running the above should return ImageNet classes in a JSON format.

[
  {
    "tiger_cat": 0.46933549642562866
  },
  {
    "tabby": 0.4633878469467163
  },
  {
    "Egyptian_cat": 0.06456148624420166
  },
  {
    "lynx": 0.0012828214094042778
  },
  {
    "plastic_bag": 0.00023323034110944718
  }
]

Cleaning up

To remove the cluster completely and tear down the associated infrastructure, run the following command:

./delete_cluster.sh

Conclusion

This post showed how to set up TorchServe on Amazon EKS using a variety of related command-line tools, such as kubectl and eksctl. Although demonstrated on a single model and on a single node cluster, this type of deployment is scalable to multiple nodes and extrapolates to more advanced deployments. For example, you can stack models on top of each other on a single node with TorchServe to reduce cost and increase resource utilization. Moreover, you can subdivide the GPU into multiple containers using Bin Packing to distribute the workload and schedule them across the namespace. TorchServe makes it easier to deploy and manage these types of workloads, and much more.

Josiah Davis

Josiah Davis

Josiah Davis is a Senior Data Scientist with AWS where he engages with customers to solve applied problems in Machine Learning. Outside of work, he enjoys reading and travelling with his family. He holds a master's degree in Statistics from UC Berkeley.

Charles Frenzel

Charles Frenzel

Charles is a Senior Data Scientist for Professional Services based in Tokyo, Japan. He works directly with AWS customers to build machine learning models for production. In his spare time he enjoys biking with his children, kettlebell training, and drinking matcha tea.

Chen Wu

Chen Wu

Dr. Chen Wu is a Principal Applied Scientist at AWS based in Western Australia. Chen works directly with customers to solve their data science and machine learning problems in various industries such as logistics, mining, automotive, transportation, pharmacology, digital design, and manufacturing. Prior to joining AWS, Chen worked in the field of astronomy and high-performance computing.