Reduce ML inference costs on Amazon SageMaker for PyTorch models using Amazon Elastic Inference
Today, we are excited to announce that you can now use Amazon Elastic Inference to accelerate inference and reduce inference costs for PyTorch models in both Amazon SageMaker and Amazon EC2.
PyTorch is a popular deep learning framework that uses dynamic computational graphs. This allows you to easily develop deep learning models with imperative and idiomatic Python code. Inference is the process of making predictions using a trained model. For deep learning applications that use frameworks such as PyTorch, inference accounts for up to 90% of compute costs. Selecting the right instance for inference can be challenging because deep learning models require different amounts of GPU, CPU, and memory resources. Optimizing for one of these resources on a standalone GPU instance usually leads to under-utilization of other resources. Therefore, you might pay for unused resources.
Amazon Elastic Inference solves this problem by enabling you to attach the right amount of GPU-powered inference acceleration to any Amazon SageMaker or EC2 instance, or Amazon ECS task. You can choose any CPU instance in AWS that is best suited to your application’s overall compute and memory needs, and separately attach the right amount of GPU-powered inference acceleration needed to satisfy your application’s latency requirements. This allows you to use resources more efficiently and lowers inference costs. Today, PyTorch joins TensorFlow and Apache MXNet as a deep learning framework supported by Elastic Inference. The released version as of this writing is 1.3.1.
Amazon SageMaker is a fully managed service that provides every developer and data scientist with the ability to build, train, and deploy machine learning (ML) models quickly on deep learning frameworks such as TensorFlow, Apache MXNet, and PyTorch. Amazon SageMaker makes it easy to generate predictions by providing everything you need to deploy machine learning models in production and monitor model quality.
This post demonstrates how you can use Elastic Inference to lower costs and improve latency for your PyTorch models on Amazon SageMaker.
TorchScript: Bridging the gap between research and production
We now discuss TorchScript, which is a way to create serializable and optimizable models from PyTorch code. You must convert your models to TorchScript in order to use Elastic Inference with PyTorch.
PyTorch’s use of dynamic computational graphs greatly simplifies the model development process. However, this paradigm presents unique challenges for production model deployment. In a production context, it is beneficial to have a static graph representation of the model. This not only enables you to use the model in Python-less environments, but also allows for performance and memory optimizations.
TorchScript bridges this gap by providing the ability to compile and export models to a Python-free graph-based representation. You can run your models in any production environment by converting PyTorch models into TorchScript. TorchScript also performs just-in-time graph-level optimizations, providing a performance boost over standard PyTorch.
To use Elastic Inference with PyTorch, you have to convert your models into TorchScript format and use the inference API for Elastic Inference. This post provides an example of how to compile models into TorchScript and benchmark end-to-end inference latency with Elastic Inference-enabled PyTorch. This post concludes by comparing performance and cost metrics for a variety of instance and accelerator combinations to standalone CPU and GPU instances.
Compiling and serializing models with TorchScript
Scripting a model is usually the preferred method of compiling to TorchScript because it preserves all model logic. However, as of this writing, the set of scriptable models with PyTorch 1.3.1 is smaller than the set of traceable models. Your model may be traceable, but not scriptable — or not traceable at all. You may need to modify your model code to make it compatible with TorchScript.
Due to the way that Elastic Inference currently handles control-flow operations in PyTorch 1.3.1, inference latency may be suboptimal for scripted models that contain many conditional branches. Try both tracing and scripting to see how your model performs with Elastic Inference. With the 1.3.1 release, a traced model likely performs better than its scripted version.
For more information, see the Introduction to TorchScript tutorial on the PyTorch website.
Scripting performs direct analysis of the source code to construct a computation graph and preserve control flow. The following example shows how to compile a model using scripting. It uses TorchVision’s pre-trained weights for ResNet-18. You can save the resulting scripted model to a file and then load it with
torch.jit.load using Elastic Inference-enabled PyTorch. See the following code:
Tracing uses a sample input to record the operations performed when you run the model on that input. This means that control-flow might be erased because you are compiling the graph by tracing the code with just a single input. For example, a model definition might have code to pad images of a particular size x. If you trace the model with an image of a different size y, future inputs of size x fed to the traced model are not padded. This happens because not all code paths were executed while tracing with the given input.
The following example shows how to compile a model using tracing with a randomized tensor input. It also uses TorchVision’s pre-trained weights for ResNet-18. You must use the
torch.jit.optimized_execution context block with a second parameter for device ordinal to use traced models with Elastic Inference. This modified function definition, which accepts two parameters, is only available through the Elastic Inference-enabled PyTorch framework.
If you are tracing your model with the standard PyTorch framework, omit the
torch.jit.optimized_execution block. You can still save the resulting traced model to a file and load it with
torch.jit.load using Elastic Inference-enabled PyTorch. See the following code:
Saving and loading a compiled model
The output of tracing and scripting is a ScriptModule, which is the TorchScript analog of standard PyTorch’s nn.Module. Serializing and deserializing a TorchScript module is as easy as calling torch.jit.save() and torch.jit.load(), respectively. This is the JIT analog of saving and loading a standard PyTorch model using
torch.load(). See the following code:
Saved TorchScript models are not bound to specific classes and code directories, unlike saved standard PyTorch models. You can directly load saved TorchScript models without instantiating the model class first. This allows you to use TorchScript models in environments without Python.
End-to-end inference benchmarking in Amazon SageMaker with Elastic Inference PyTorch
This post walks you through the process of benchmarking Elastic Inference-enabled PyTorch inference latency for DenseNet-121 using an Amazon SageMaker hosted endpoint. DenseNet-121 is a convolutional neural network (CNN) that has achieved state-of-art results in image classification on a variety of datasets. Its architecture is loosely based on ResNet, another popular CNN for image classification.
Amazon SageMaker hosting makes it possible to deploy your models to HTTPS endpoints, which makes your model available to perform inference via HTTP requests.
This walkthrough uses an EC2 instance as the client for launching and interacting with Amazon SageMaker hosted endpoints. This client instance does not have an accelerator attached; you will launch an endpoint that provisions a hosting instance with an accelerator attached. To complete the walkthrough, you must first complete the following prerequisites:
- Create an IAM role and add the
AmazonSageMakerFullAccessThis policy gives permissions to use Amazon Elastic Inference and Amazon SageMaker.
- Make sure that
sagemaker.amazonaws.comis a trusted entity under “Trust Relationships”.
- For more information, see Configuring an Instance Role with an Elastic Inference Policy, and Amazon SageMaker Roles.
- Make sure that
- Launch an m5.large CPU EC2 instance.
- Use either the Linux or Ubuntu Deep Learning AMI (DLAMI) v27.
This post uses the built-in Elastic Inference-enabled PyTorch Conda environment from the DLAMI, only to access the Amazon SageMaker SDK and save DenseNet-121 weights using PyTorch 1.3.1. When Amazon SageMaker Notebook support is released, you may use the Notebook kernel instead.
The hosted instance and accelerator uses Elastic Inference-enabled PyTorch through the AWS DL Container. Your choice of environment for the client instance is only to facilitate easy usage of the Amazon SageMaker SDK and save model weights using PyTorch 1.3.1.
Complete the following steps:
- Log in to the instance that you created.
- Activate the Elastic Inference-enabled PyTorch Conda environment with the following command:
- Create an empty file called
script.py.This file serves as the entry point for the hosted container. The file is empty in order to trigger the default
predict_fnis available for both standard PyTorch and Elastic Inference-enabled PyTorch. Default
model_fn, however, is only implemented for Elastic Inference-enabled PyTorch. If you are benchmarking standard PyTorch, then you need to implement your own
- In the same directory, create a script called
create_sm_tarball.pywith the following code:
This script creates a tarball following the naming convention that Amazon SageMaker uses (
model.ptby default). The model weights for DenseNet-121 are ImageNet-pretrained and pulled from TorchVision. For more information, see Using PyTorch with the SageMaker Python SDK.
- Run the script to create the tarball with the following command:
- Create a script called
create_sm_endpoint.pywith the following code:
You need to modify the script to include your AWS account ID, region, and IAM ARN role. The script uses your previously created tarball and blank entry point script to provision an Amazon SageMaker hosted endpoint. This example code benchmarks a ml.c5.large hosting instance with ml.eia2.medium accelerator attached.
You do not have to provide the image directly in order to create an endpoint, but this post does so for clarity. For more information about available Docker containers for other frameworks, see Deep Learning Containers Images.
- Run the script to create a hosted endpoint with ml.c5.large and ml.eia2.medium attached, using the following command:
- Go to the SageMaker console and wait for your endpoint to finish deploying. This should take approximately 10 minutes. You are now ready to invoke the endpoint to do inferences.
- Create a script called
benchmark_sm_endpoint.pywith the following code:
The script uses a tensor of size 1 x 3 x 224 x 224 (standard in image classification). It first runs a series of 100 warmup inferences, and then runs 1000 inferences. Latency percentiles are only reported from these 1000 inferences.
This post uses the latency metric
ModelLatency. This metric is emitted to Amazon CloudWatch and captures inference latency within the Amazon SageMaker system. For more information, see Monitor Amazon SageMaker with Amazon CloudWatch.
You must compile your model with TorchScript and save it as
model.ptin the tarball.
When you invoke an endpoint with the accelerator attached to the hosting instance, Amazon SageMaker will invoke the default
predict_fnby default. If you are using PyTorch in Amazon SageMaker without an accelerator, you need to provide your own implementation of
model_fnthrough the entry point script.
torch.jit.load('model.pt')to load the model weights because it assumes that you previously serialized the model with TorchScript, and adhered to the file name convention. When an accelerator is attached, the default
torch.jit.optimized_executionblock, which specifies that the model should be optimized to run on the attached Elastic Inference accelerator. Otherwise,
predict_fndoes inference in the standard PyTorch way. Note that multi-attach is not supported for Amazon SageMaker as of this writing. Thus, the device ordinal is always set to
If you decide to implement your own
predict_fnwhile using Elastic Inference, you must remember to use the
torch.jit.optimized_executioncontext, or your inference will run entirely on the hosting instance and will not use the attached accelerator. For more information, see Using PyTorch with the SageMaker Python SDK.
The default handlers are available on GitHub.
- Run the benchmark script with the following command:
You should see output similar to the following:
Selecting the right instance
When you deploy new inference workloads, you have many instance types to choose from. You should consider the following key parameters:
- Memory – You need to select a host instance and accelerator combination that provides sufficient CPU and accelerator memory for your application. You can lower bound the runtime memory requirement as the sum of your input tensor sizes and model size. However, runtime memory usage is usually significantly higher than this lower bound for any model, and also varies by framework. You should only use this guideline to help roughly guide your choice of CPU instance and Elastic Inference accelerator(s).
- Latency requirements – After you have a set of host instances and accelerators with sufficient memory, you can further narrow down your choices to those that satisfy the application’s latency requirements. This post considers latency per inference as the key metric to assess performance. Inference throughput in images or words processed per unit time is another commonly used metric.
- Cost – After you have a set of hardware combinations that satisfy both your memory and latency requirements, you can optimize cost efficiency by selecting the combination that gives the lowest price per inference. You can compute this metric as (price / second * average latency per inference call). To make numbers more concrete, this post provides cost per 100,000 inferences. You can compare cost efficiency for your workload and pick the optimal hardware by doing this for each hardware combination. This post uses the price per hour for the US West (Oregon) region.
You are now ready to apply this process to select the optimal instance for running DenseNet-121. First, assess the memory and CPU requirements of your applications, and shortlist a subset of host instances and accelerators that satisfy those requirements.
Next, look at latency performance. This post used the same tensor input and TorchVision ImageNet pretrained weights for DenseNet-121 on each instance. We ran 1,000 inferences on the model using this input, collected latency per run, and reported the average latencies and the 90th percentile latencies (P90 latencies). This post requires the P90 latency to be less than 80 milliseconds — 90% of all inference calls should have a latency lower than 80 ms.
We attached Amazon Elastic Inference accelerators to three types of CPU host instances and ran the preceding performance test for each. The following table lists the price per hour, average latency per inference call, and cost per 100,000 inferences. All combinations below meet the latency threshold.
|Client Instance Type||Elastic Inference Accelerator Type||Cost per Hour||Infer Latency (P90) [ms]||Average Infer Latencies [ms]||Cost per 100,000 Inferences (with Average Latency)|
You can see the effect of different host instances on latency. For the same accelerator type, using more powerful host instances does not improve latency significantly. However, attaching a larger accelerator lowers latency because the model runs on the accelerator, and a larger accelerator has more resources such as GPU compute and memory. You should choose the cheapest host instance type that provides enough CPU memory for your application. An ml.m5.large or ml.c5.large is sufficient for many use cases, but not all.
Based on the preceding criteria, this post chose the two lowest cost options that met the latency requirement, which are ml.c5.large with ml.eia2.medium, and ml.m5.large with ml.eia2.medium. Both are viable for this use case.
Comparing different instances for inference in SageMaker
This post also collected latency and cost performance data for standalone CPU and GPU host instances and compared against the preceding Elastic Inference benchmarks. The standalone CPU instances used were ml.c5.xl, ml.c5.4xl, ml.m5.xl, and ml.m5.4xl. The standalone GPU instances used were ml.p3.2xl, ml.g4dn.xl, ml.g4dn.2xl, and ml.g4dn.4xl.
The following aggregate table shows cost performance data for Elastic Inference-enabled options followed by standalone instance options.
|Instance Type||Cost per Hour||Infer Latency (P90) [ms]||Average Infer Latencies [ms]||Cost per 100,000 Inferences (with Average Latency)|
|ml.c5.large + ml.eia2.medium||$0.29||51.74||47.85||$0.38|
|ml.m5.large + ml.eia2.medium||$0.30||55.41||52.07||$0.44|
To better understand the value proposition that Elastic Inference offers over standalone CPU and GPU instances, you can visualize this latency and cost efficiency data side-by-side for each hardware type. The following bar chart plots the cost per 100,000 inferences, and the line graph plots the P90 inference latency in milliseconds. Bars in dark gray are instances with Elastic Inference accelerators, bars in green are standalone GPU instances, and bars in blue are standalone CPU instances.
As expected, the CPU instances perform poorly when compared to the GPU instances. The ml.g4dn.xl instance is about seven times faster than the CPU instances. None of the standalone CPU instances satisfy the P90 latency threshold of 80 ms.
However, these CPU instances perform much better with Elastic Inference attached, because they benefit from GPU acceleration. The ml.c5.large instance with ml.eia2.medium speeds up inference by nearly three times over standalone CPU instances. However, standalone GPU instances still fare better than CPU instances with Elastic Inference attached; ml.g4dn.xl is a little more than twice as fast as ml.c5.large with ml.eia2.medium. Note that ml.g4dn.xl, ml.g4dn.2xl ,and ml.g4dn.4xl instances have roughly equal latencies with negligible variation. All three ml.g4dn instances have the same GPU, but the larger ml.g4dn instances have more vCPUs and memory resources. For DenseNet-121, increasing vCPU and memory resources does not improve inference latency.
Both Elastic Inference and standalone GPU instances meet the latency requirements.
Regarding cost, ml.c5.large with ml.eia2.medium stands out. Although ml.c5.large with ml.eia2.medium does not have the lowest price per hour, it has the lowest cost per 100,000 inferences. For more information about pricing per hour, see Amazon SageMaker Pricing.
You can conclude that instances that cost less per hour don’t necessarily also cost less per inference. This is because their latency per inference could be higher. Likewise, instances that achieve lower latency per inference might not have a lower cost per inference. The ml.m5.xlarge and ml.c5.xlarge CPU instances have the lowest price per hour, but still cost more per inference than most of the Elastic Inference and standalone GPU options. The larger ml.m5.4xlarge and ml.c5.4xlarge instances have higher latencies, cost more per hour, and therefore cost more per inference than all the Elastic Inference options. Standalone GPU instances achieve the best latencies across the board due to high compute parallelization, which CUDA operations exploit. However, Elastic Inference has the lowest cost per inference.
With Amazon Elastic Inference, you get the best of both worlds. You get most of the parallelization and inference speed-up that GPUs offer, and see greater cost-effectiveness than both CPU and GPU standalone instances. Furthermore, you also have the flexibility to decouple your host instance and inference acceleration hardware, which allows you to flexibly optimize your hardware for vCPU, memory, and all other resources that your application requires.
The preceding tests demonstrate that ml.c5.large with ml.eia2.medium is the lowest cost option that met the latency criterion and memory usage requirements for running DenseNet-121.
The latency metric used by this post (
ModelLatency emitted in CloudWatch Metrics) measures latency within Amazon SageMaker. This latency metric does not account for latencies from your application to Amazon SageMaker. Make sure to account for these latencies when benchmarking your applications.
Amazon Elastic Inference is a low-cost and flexible solution for PyTorch inference workloads on Amazon SageMaker. You can get GPU-like inference acceleration and remain more cost-effective than both standalone Amazon SageMaker GPU and CPU instances, by attaching Elastic Inference accelerators to an Amazon SageMaker instance. For more information, see What Is Amazon Elastic Inference?
About the Authors
David Fan is a software engineer with AWS AI. He is passionate about advancing the state-of-art in computer vision and deep learning research, and reducing the computational and domain knowledge barriers that prevent large-scale production use of AI research. In his spare time, he likes to do Kaggle competitions and keep up with arXiv papers.
Srinivas Hanabe is a principal product manager with AWS AI for Elastic Inference. Prior to this role, he was the PM lead for Amazon VPC. Srinivas loves running long distance, reading books on a variety of topics, spending time with his family, and is a career mentor.