AWS Machine Learning Blog

Reduce inference costs on Amazon EC2 for PyTorch models with Amazon Elastic Inference

You can now use Amazon Elastic Inference to accelerate inference and reduce inference costs for PyTorch models in both Amazon SageMaker and Amazon EC2.

PyTorch is a popular deep learning framework that uses dynamic computational graphs. This allows you to easily develop deep learning models with imperative and idiomatic Python code. Inference is the process of making predictions using a trained model. For deep learning applications that use frameworks such as PyTorch, inference accounts for up to 90% of compute costs. Selecting the right instance for inference can be challenging because deep learning models require different amounts of GPU, CPU, and memory resources. Optimizing for one of these resources on a standalone GPU instance usually leads to under-utilization of other resources. Therefore, you might pay for unused resources.

Elastic Inference solves this problem by enabling you to attach the right amount of GPU-powered inference acceleration to any Amazon SageMaker instance type, EC2 instance type, or Amazon ECS task. You can choose any CPU instance in AWS that is best suited to your application’s overall compute and memory needs, and separately attach the right amount of GPU-powered inference acceleration to satisfy your application’s latency requirements. This allows you to use resources more efficiently and lowers inference costs. PyTorch joins TensorFlow and Apache MXNet as an Elastic Inference-supported deep learning framework. The released version as of this writing is 1.3.1.

This post demonstrates how to lower costs and improve latency for your PyTorch models using Amazon EC2 instances with Elastic Inference. For information about lowering costs with Amazon SageMaker instead, see Reduce ML inference costs on Amazon SageMaker for PyTorch models using Amazon Elastic Inference.

TorchScript: Bridging the gap between research and production

We now discuss TorchScript, which is a way to create serializable and optimizable models from PyTorch code. You must convert your models to TorchScript in order to use Elastic Inference with PyTorch.

PyTorch’s use of dynamic computational graphs greatly simplifies the model development process. However, this paradigm presents unique challenges for production model deployment. In a production context, it is beneficial to have a static graph representation of the model. This not only enables you to use the model in Python-less environments, but also allows for performance and memory optimizations.

TorchScript bridges this gap by providing the ability to compile and export models to a Python-free graph-based representation. You can run your models in any production environment by converting PyTorch models into TorchScript. TorchScript also performs just-in-time graph-level optimizations, providing a performance boost over standard PyTorch.

To use Elastic Inference with PyTorch, you must convert your models into TorchScript format and use the inference API for Elastic Inference. This post provides an example of how to compile models into TorchScript and benchmark end-to-end inference latency with Elastic Inference-enabled PyTorch. This post concludes by comparing performance and cost metrics for a variety of instance and accelerator combinations to standalone CPU and GPU instances.

Compiling and serializing models with TorchScript

You can compile a PyTorch model into TorchScript using either tracing or scripting. Both produce a computation graph, but differ in how they do so.

Scripting a model is usually the preferred method of compiling to TorchScript because it preserves all model logic. However, as of this writing, the set of scriptable models with PyTorch 1.3.1 is smaller than the set of traceable models. Your model may be traceable, but not scriptable—or not traceable at all. You may need to modify your model code to make it compatible with TorchScript.

Due to the way that Elastic Inference currently handles control-flow operations in PyTorch 1.3.1, inference latency may be suboptimal for scripted models that contain many conditional branches. Try both tracing and scripting to see how your model performs with Elastic Inference. With the 1.3.1 release, a traced model likely performs better than its scripted version.

For more information, see the Introduction to TorchScript tutorial on the PyTorch website.


Scripting performs direct analysis of the source code to construct a computation graph and preserve control-flow. The following code example shows how to compile a model using scripting. It uses TorchVision’s pre-trained weights for ResNet-18. You can save the resulting scripted model to a file and then load it with torch.jit.load using Elastic Inference-enabled PyTorch. See the following code:

import torchvision, torch

# Call eval() to set model to inference mode
model = torchvision.models.resnet18(pretrained=True).eval()
scripted_model = torch.jit.script(model)


Tracing uses a sample input to record the operations performed when you run the model on that input. This means that control-flow might be erased because you are compiling the graph by tracing the code with just a single input. For example, a model definition might have code to pad images of a particular size x. If you trace the model with an image of a different size y, future inputs of size x fed to the traced model are not padded. This happens because not all code paths were executed while tracing with the given input.

The following example shows how to compile a model using tracing with a randomized tensor input. It also uses TorchVision’s pre-trained weights for ResNet-18. You must use the torch.jit.optimized_execution context block with a second parameter for device ordinal to use traced models with Elastic Inference. This modified function definition, which accepts two parameters, is only available through the Elastic Inference-enabled PyTorch framework.

If you are tracing your model with the standard PyTorch framework, omit the torch.jit.optimized_execution block. You can still save the resulting traced model to a file and load it with torch.jit.load using Elastic Inference-enabled PyTorch. See the following code:

# ImageNet pre-trained models take inputs of this size.
x = torch.rand(1,3,224,224)
# Call eval() to set model to inference mode
model = torchvision.models.resnet18(pretrained=True).eval()

# Required when using Elastic Inference
with torch.jit.optimized_execution(True, {‘target_device’: ‘eia:0’}):
    traced_model = torch.jit.trace(model, x)

Saving and loading a compiled model

The output of tracing and scripting is a ScriptModule, which is the TorchScript analog of standard PyTorch’s nn.Module. Serializing and deserializing a TorchScript module is as easy as calling and torch.jit.load(), respectively. This is the JIT analog of saving and loading a standard PyTorch model using and torch.load(). See the following code:, ''), '')

traced_model = torch.jit.load('')
scripted_model = torch.jit.load('')

Saved TorchScript models are not bound to specific classes and code directories, unlike saved standard PyTorch models. You can directly load saved TorchScript models without instantiating the model class first. This allows you to use TorchScript models in environments without Python.

End-to-end inference benchmarking with Elastic Inference-enabled PyTorch

This post walks you through the process of benchmarking Elastic Inference-enabled PyTorch inference latency for OpenAI’s generative pre-training (GPT) model in Amazon EC2. GPT is an unsupervised transformer model that has achieved state-of-art results in multiple language tasks.


To complete the walkthrough, you must first complete the following prerequisites:

  • Configure a VPC security group that allows all inbound traffic to ports 22 and 443, and allows all outbound traffic. For more information, see Configuring Your Security Groups for Elastic Inference.
  • Create a VPC endpoint using the Elastic Inference VPC service. For more information, see Configuring AWS PrivateLink Endpoint Services.
    • Take note of the VPC that you create the endpoint for. You create your instance later using the same VPC.
    • Select all Availability Zones in which you want to use Elastic Inference.
  • Create an IAM role with permissions to use Elastic Inference. For more information, see Configuring an Instance Role with an Elastic Inference Policy.
  • Launch an m5.large CPU instance and attach one eia2.xlarge accelerator.
    • Use either the Linux or Ubuntu Deep Learning AMI (DLAMI) v27.
    • Use the VPC and security group you configured earlier.

This post uses the built-in Conda environments from the DLAMI. However, as with all other Elastic Inference-supported frameworks, you may use Elastic Inference-enabled PyTorch through other means. Docker container options are available through AWS DL Containers. If you are not using the DLAMI, you can also build an environment using the Elastic Inference PyTorch pip wheel from the Amazon S3 bucket.


Complete the following steps:

  1. Log in to the instance that you created.
  2. Use the built-in EI Tool to get the device ordinal number of all attached Elastic Inference accelerators. See the following command:
    /opt/amazon/ei/ei_tools/bin/ei describe-accelerators --json

    For more information about the EI Tool, see Monitoring Elastic Inference Accelerators.

    You should see output similar to the following code:

      "ei_client_version": "1.6.2",
      "time": "Fri Mar 6 03:09:38 2019",
      "attached_accelerators": 1,
      "devices": [
          "ordinal": 0,
          "type": "eia2.xlarge",
          "id": "eia-56e1f73d4ab54b9e9389b0e535c905ec",
          "status": "healthy"

    If you have attached multiple accelerators to your client instance, this command returns multiple devices, starting with an ordinal of 0. Use the device ordinal of your desired Elastic Inference accelerator to run inference.

  3. Activate the Elastic Inference-enabled PyTorch Conda environment with the following command:
    source activate amazonei_pytorch_p36
  4. Install the transformers library—which you use to fetch pre-trained weights for OpenAI-GPT—with the following command:
    pip install transformers==2.3.0
  5. Create a script called with the following content. This script uses pre-trained weights for GPT, a popular unsupervised pre-trained language model. The script loads the model, traces it with tokenized text to convert into TorchScript, and saves the compiled model to disk. It then loads the model, performs 1,000 inferences, and reports the latency distribution. See the following code:
    import numpy as np
    import os
    import time
    import torch
    # Make sure to pip install the transformers package
    def nlp_input(tokenizer):
      tokenizer = torch.hub.load('huggingface/pytorch-transformers', 'tokenizer', tokenizer)
      sample_text = 'PyTorch is a Deep Learning Framework'
      indexed_tokens = tokenizer.encode(sample_text)
      return torch.tensor([indexed_tokens])
    token = nlp_input('openai-gpt')
    if not os.path.exists(''):
      # eval() toggles inference mode
      model = torch.hub.load('huggingface/pytorch-transformers', 'model', 'openai-gpt').eval()
      print('Compiling model ...')
      # Compile model to TorchScript via tracing
      # Here we would like to use the first accelerator, so we use ordinal 0.
      with torch.jit.optimized_execution(True, {'target_device': 'eia:0'}):
        # You can trace with any input
        model = torch.jit.trace(model, token)
      # Serialize model, '')
    print('Loading model ...')
    model = torch.jit.load('')
    # Perform 1000 inferences. Make sure to disable autograd and use EI execution context
    latencies = []
    for i in range(1000):
      with torch.no_grad():
        with torch.jit.optimized_execution(True, {'target_device': 'eia:0'}):
          start = time.time()
          _ = model(token)
          end = time.time()
          latencies.append((end - start) * 1000)
    # First inference is long due to overhead from setting up the service.
    # We discard it and look at all other inference latencies
    latencies = np.array(latencies[1:])
    print('Mean latency (ms): {:.4f}'.format(np.mean(latencies)))
    print('P50 latency (ms): {:.4f}'.format(np.percentile(latencies, 50)))
    print('P90 latency (ms): {:.4f}'.format(np.percentile(latencies, 90)))
    print('P95 latency (ms): {:.4f}'.format(np.percentile(latencies, 95)))

    You don’t have to save and load your model. You can compile your model and perform inferences directly. The benefit of saving your model is that it saves time for future inference jobs.

    Remember to use the torch.jit.optimized_execution code block. This is an inference API unique to Elastic Inference-enabled PyTorch, and you need to use it to trigger inference on the attached accelerator. If you fail to use this execution context correctly, your inferences run entirely on the client instance and fail to utilize the accelerator.

    The Elastic Inference-enabled PyTorch framework accepts two parameters for this context, whereas the standard PyTorch framework accepts only one parameter. The second parameter specifies the accelerator device ordinal. You should set target_device to the device’s ordinal number, not its ID. Ordinals are numbered beginning with 0.

  6. Set the device ordinal to use the first attached accelerator. See the following script:

    You should see the following output:

    Using Amazon Elastic Inference Client Library Version: 1.6.2
    Number of Elastic Inference Accelerators Available: 1
    Elastic Inference Accelerator ID: eia-56e1f73d4ab54b9e9389b0e535c905ec
    Elastic Inference Accelerator Type: eia2.xlarge
    Elastic Inference Accelerator Ordinal: 0
    Mean latency (ms): 10.2795
    P50 latency (ms): 9.76729
    P90 latency (ms): 10.7727
    P95 latency (ms): 13.0613

Selecting the right instance

When you deploy new inference workloads, you have many instance types to choose from. You should consider the following key parameters:

  • Memory – You need to select a client instance and accelerator combination that provides sufficient CPU and accelerator memory for your application. You can lower bound the runtime memory requirement as the sum of your input tensor sizes and model size. However, runtime memory usage is usually significantly higher than this lower bound for any model, and also varies by framework. You should only use this guideline to help roughly guide your choice of CPU instance and Elastic Inference accelerators.
  • Latency requirements – After you have a set of client instances and accelerators with sufficient memory, you can further narrow down your choices to those that satisfy the application’s latency requirements. This post considers latency per inference as the key metric to assess performance. Inference throughput in images or words processed per unit time is another commonly used metric.
  • Cost – After you have a set of hardware combinations that satisfy both your memory and latency requirements, you can optimize cost efficiency by selecting the combination that gives the lowest price per inference. You can compute this metric as (price / second * average latency per inference call). To make numbers more concrete, this post provides cost per 100,000 inferences. You can compare cost efficiency for your workload and pick the optimal hardware by doing this for each hardware combination. This post uses the price per hour for the US West (Oregon) Region.

You are now ready to apply this process to select the optimal instance for running GPT. First, assess the memory and CPU requirements of your applications, and shortlist a subset of client instances and accelerators that satisfy those requirements.

Next, look at latency performance. We ran 1,000 inferences for the same input on each instance by using torch.hub pre-trained weights for OpenAI GPT. To create the input, we tokenized a six-word sentence to create an input token of size (1,7). We ran 1,000 inferences on the model using this input, collected latency per run, and reported the average latencies and the 90th percentile latencies (P90 latencies). This post requires the P90 latency to be less than 15 milliseconds — 90% of all inference calls should have a latency lower than 15 ms.

We attached Elastic Inference accelerators to four types of CPU client instances and ran the preceding performance test for each. The following table lists the price per hour, P90 latency per inference call, average latency per inference call and cost per 100,000 inferences. All combinations meet the latency threshold. Note that we use the average inference latencies to compute the cost per 100,000 inferences.

Client Instance Type Elastic Inference Accelerator Type Cost per Hour Inference Latency (P90) [ms] Average Inference Latencies [ms] Cost per 100,000 Inferences
(Based on Average Latencies)
m5.large eia2.medium $0.22 10.77 10.28 $0.06
eia2.large $0.34 9.02 8.72 $0.08
eia2.xlarge $0.44 8.78 8.50 $0.10
m5.xlarge eia2.medium $0.31 9.88 9.99 $0.09
eia2.large $0.43 9.17 8.83 $0.11
eia2.xlarge $0.53 8.99 8.64 $0.13
c5.large eia2.medium $0.21 9.89 10.03 $0.06
eia2.large $0.33 9.05 8.77 $0.08
eia2.xlarge $0.43 8.93 8.63 $0.10
c5.xlarge eia2.medium $0.29 10.04 10.07 $0.08
eia2.large $0.41 8.93 8.59 $0.10
eia2.xlarge $0.51 8.92 8.59 $0.12


You can now examine the effect of different client instances on latency. For the same accelerator type, using more powerful client instances does not improve latency significantly. However, attaching a larger accelerator lowers latency because the model runs on the accelerator, and a larger accelerator has more resources such as GPU compute and memory. You should choose the cheapest client instance type that provides enough CPU memory for your application. An m5.large or a c5.large is sufficient for many use cases, but not all.

From the table above, all the options with Amazon Elastic Inference meet the latency criteria. Both m5.large and c5.large with eia2.medium cost the same per inference call. This post chose c5.large with eia2.medium because its P90 latency was lower than m5.large with eia2.medium. We also chose c5.large with eia2.large because it has a lower latency than c5.large with eia2.medium and nearly the same latency as m5.large with eia2.large. However, note that m5.large provides twice as much CPU memory for essentially the same price.

Comparing different EC2 instances for inference

This post also collected latency and cost performance data for standalone CPU and GPU host instances and compared against the preceding Elastic Inference benchmarks. The standalone CPU instances were c5.xlarge, c5.4xlarge, m5.xlarge, and m5.4xlarge. The standalone GPU instances were p3.2xlarge, g4dn.xlarge, g4dn.2xlarge, and g4dn.4xlarge.

The following aggregate table shows the Elastic Inference-enabled options and standalone instance options.

Instance Type Cost per Hour P90 Inference latency [ms] Average Inference Latencies [ms] Cost per 100,000 Inferences
(Based on Average Latencies)
c5.large + eia2.medium $0.21 9.89 10.03 $0.06
c5.large + eia2.large $0.33 9.05 8.77 $0.08
g4dn.xlarge $0.53 5.97 5.92 $0.09
g4dn.2xlarge $0.76 5.95 5.9 $0.12
g4dn.4xlarge $1.20 5.98 5.96 $0.20
c5.xlarge $0.17 49.85 49.46 $0.24
m5.xlarge $0.19 55.20 54.45 $0.29
c5.4xlarge $0.68 17.52 17.38 $0.33
m5.4xlarge $0.77 17.07 16.9 $0.37
p3.2xlarge $3.06 9.60 9.31 $0.82

To better understand the value proposition that Elastic Inference offers over standalone CPU and GPU instances, you can visualize this latency and cost efficiency data side-by-side for each instance type. The following bar chart plots the cost per 100,000 inferences, while the line graph plots the P90 inference latency in milliseconds. Bars in dark gray are instances with Elastic Inference accelerators, bars in green are standalone GPU instances, and bars in blue are standalone CPU instances.

Analyzing latency

As expected, the CPU instances perform poorly when compared to the GPU instances. The g4dn.xl instance is at least three times faster than the CPU instances. None of the standalone CPU instances satisfy the P90 latency threshold of 15 milliseconds.

However, these CPU instances perform much better with Elastic Inference attached because they benefit from GPU acceleration. The c5.large instance with eia2.medium is at least 1.7 times as fast, and up to 5.6 times as fast as standalone CPU instances. However, standalone GPU instances still fare better than CPU instances with Elastic Inference attached; g4dn.xl is about 1.7 times as fast as c5.large with eia2.large. Note that g4dn.xl, g4dn.2xl, and g4dn.4xl instances have roughly equal latencies with negligible variation. All three g4dn instances have the same GPU, but the larger g4dn instances have more vCPUs and memory resources. For this GPT model, increasing vCPU and memory resources does not improve inference latency.

Analyzing cost

Regarding cost, c5.large with eia2.medium stands out. Although c5.large with eia2.medium does not have the lowest price per hour, it has the lowest cost per 100,000 inferences. For more information about pricing, see Amazon Elastic Inference pricing and Amazon EC2 Pricing.

You can conclude that instances that cost less per hour don’t necessarily also cost less per inference. This is because their latency per inference could be higher. Likewise, instances that achieve lower latency per inference might not have a lower cost per inference. The m5.xlarge and c5.xlarge CPU instances have the lowest price per hour, but still cost more per inference than all Elastic Inference and standalone GPU options. The larger m5.4xlarge and c5.4xlarge instances have higher latencies, cost more per hour, and therefore cost more per inference than all the Elastic Inference options. GPU instances achieve the best latencies across the board due to high compute parallelization, which CUDA operations exploit. However, Elastic Inference has the lowest cost per inference.

With Elastic Inference, you get the best of both worlds. You get most of the parallelization and inference speedup that GPUs offer, and see greater cost-effectiveness than both CPU and GPU standalone instances. Furthermore, you have the flexibility to decouple your host instance and inference acceleration hardware, which allows you to flexibly optimize your hardware for vCPU, memory, and all other resources that your application requires.

The preceding tests demonstrate that c5.large with eia2.medium is the lowest-cost option that meets the latency criterion and runtime memory usage requirements for running OpenAI’s GPT model.


Elastic Inference is a low-cost and flexible solution for PyTorch inference workloads on Amazon EC2. You can achieve GPU-like inference acceleration and remain more cost-effective than both standalone GPU and CPU instances by attaching Elastic Inference accelerators to a CPU client instance. For more information, see What Is Amazon Elastic Inference?



About the Authors

David Fan is a software engineer with AWS AI. He is passionate about advancing the state-of-art in computer vision and deep learning research, and reducing the computational and domain knowledge barriers that prevent large-scale production use of AI research. In his spare time, he likes to do Kaggle competitions and keep up with arXiv papers.




Srinivas Hanabe is a principal product manager with AWS AI for Elastic Inference. Prior to this role, he was the PM lead for Amazon VPC. Srinivas loves running long distance, reading books on a variety of topics, spending time with his family, and is a career mentor.