AWS Machine Learning Blog

Model serving with Amazon Elastic Inference

Amazon Elastic Inference (EI) is a service that allows you to attach low-cost GPU-powered acceleration to Amazon EC2 and Amazon SageMaker instances. EI reduces the cost of running deep learning inference by up to 75%. Model Server for Apache MXNet (MMS) enables deployment of MXNet- and ONNX-based models for inference at scale. In this blog post, we’ll explore using MMS running on a general purpose EC2 instance with an Elastic Inference Accelerator (EIA) attached.

What is Model Server for Apache MXNet (MMS)?

MMS is an open-source model serving framework designed to simplify the task of serving deep learning models for inference at scale. After training deep learning models using Apache MXNet, MMS makes it easy to deploy the trained model for inference at scale in a production environment. The following architectural diagram shows a standard MMS scalable architecture.

What is Amazon Elastic Inference?

In deep learning applications, inference drives as much as 90 percent of the compute costs of the application. Accelerated GPU instances are oversized for inference because inference happens on a single input in real time that consumes only a small amount of GPU compute. GPU compute capacity usage might not be 100 percent even at peak load, which is wasteful and costly.

EI allows you to attach just the right amount of GPU-powered inference acceleration to any EC2 or Amazon SageMaker instance and save up to 75 percent. With EI, you can now choose the instance type that is best suited to the overall CPU and memory needs of your application, and then you can separately configure the amount of inference acceleration that you need to use resources efficiently and to reduce costs.

Setting up Amazon Elastic Inference with EC2

Starting up an EC2 instance with an attached EI accelerator requires some pre-configuration steps to setup your AWS account. Then, we can launch an instance with an accelerator by following the instructions in the Elastic Inference documentation. Here we’ll use a plain Ubuntu Amazon Machine Image (AMI), and configure it for our needs.

Serving ResNet-152 model with Elastic Inference

After launching an EC2 instance with an EI accelerator, we install MMS and the EI enabled version of MXNet.

# Install MMS 
$ pip install --user mxnet-model-server

# Install EIA enabled MXNet
$ wget
$ pip install --user amazonei_mxnet-1.3.0-py2.py3-none-manylinux1_x86_64.whl

Next, let’s start MMS with a ResNet-152 model-archive that is already configured for EI. To build your own model-archive with EIA support refer to Custom Service and Model Serving with Elastic Inference documentation. We configure MMS using a curl command to set the number of workers to 1, and to use synchronous creation of workers.

# Start MMS with resnet-152
$ mxnet-model-server --start --models resnet-152=

# Scale to a single worker, make scale request call synchronous (by default the call is asynchronous)
$ curl -X PUT ''

Notice that we set min_worker=1 in the previous curl command to configure MMS. This is important for MMS on EI because inference calls to the EI accelerator are sequential, so we don’t gain much benefit from using more than one worker. Now the model is ready for some inference requests. To classify an image of a kitten, execute the following commands:

# Download Kitten image for classification
$ curl -O

# Run inference request using EIA
$ curl -X POST -T kitten.jpg

This yields the following prediction results:

      "probability": 0.7148934602737427,
      "class": "n02123045 tabby, tabby cat"
      "probability": 0.22877734899520874,
      "class": "n02123159 tiger cat"
      "probability": 0.04032360762357712,
      "class": "n02124075 Egyptian cat"
      "probability": 0.008370809257030487,
      "class": "n02127052 lynx, catamount"
      "probability": 0.0006728142034262419,
      "class": "n02129604 tiger, Panthera tigris"

You can see that ResNet-152 identified the tabby cat using Elastic Inference hosted with MMS.

Cost and performance benefits of serving with Elastic Inference

In model serving, performance is measured in terms of throughput (number of inferences per second) and latency (time for each inference). For CPU or GPU instances, the best latency is achieved with a single worker, but higher throughput is achieved by using one worker for each vCPU or GPU. We started with an m5.large instance and evaluated both throughput and latency optimized configurations. We achieved a throughput of 4.5 requests per second, and 244ms latency. The cost efficiency of this configuration for 100,000 requests is $0.59. Compare this to the throughput of 41.42 on a p3.2xlarge, 23ms latency, and cost efficiency of $2.05. This can be expensive for production use cases. So, let’s look at how Amazon Elastic Inference can help.

EI accelerators are currently available in three sizes: eia1.medium, eia1.large, and eia1.xlarge. Each has from 1 to 4 GB of memory and from 8 to 32 TFLOPS of compute.

We evaluated performance of this model with an eia1.medium accelerator and observed that larger instance sizes didn’t materially improve latency performance. Next, we tested the m5.large with different EI accelerator sizes. Inference calls with an eia1.large accelerator had lower latency than an eia1.medium, but were not more cost efficient. We use eia1.medium since it provides a good balance between performance and cost for the ResNet-152 model use case.

When optimizing for throughout, the m5.large + eia1.medium instance is 46 percent as fast a regular p3.2xlarge but is 84 percent more cost efficient. It is also 30 percent more cost efficient than a c5.xlarge, while also delivering 1.9x the throughput. The m5.large + eia1.medium provides 4.2x more throughput than just the m5.large alone, with 45 percent better cost efficiency. The following figure shows results when optimizing for maximum throughput.

When optimizing for latency, the m5.large + eia1.medium instance is 43 percent more cost efficient than a c5.xlarge instance, while having 2.4x lower latency. Compared to the m5.large alone, adding an eia1.medium improves cost efficiency by 44 percent with 4.4x better latency. The p3.2xlarge instance has 2.3x lower latency but costs 5.6x more than an m5.large + eia1.medium. The following figure shows results when optimizing for minimum latency.


To recap, in this blog post we showed that Elastic Inference provides the best cost efficiency for running inference at scale while optimizing for latency or throughput.

Learn more about MMS and contribute

Using Elastic Inference is one among many possibilities of hosting models with MMS. To learn more about MMS, start with our examples and documentation in the repository’s model zoo and documentation folder.

We welcome community participation including questions, requests, and contributions, as we continue to improve MMS. If you are using MMS already, we welcome your feedback via the repository’s GitHub issues. Head over to awslabs/mxnet-model-server to get started!


The following table shows the specific configuration settings we used, and raw results we presented in this post.

Table 1: Raw Performance and Cost Results for ResNet-152.

Instance Type Optimized for # of workers *  # of OMP threads ** Hourly cost Cost Efficiency Latency (ms) Throughput (requests/sec)
m5.large Throughput 2 1 $ 0.10 $ 0.59 444 4.50
m5.large Latency 1 1 $ 0.10 $ 0.65 244 4.08
m5.large + eia.medium Both 1 1

$ 0.23


$ 0.33 52 19.08
c5.xlarge Throughput 4 1 $ 0.17 $ 0.47 400 9.95
c5.xlarge Latency 1 4 $ 0.17 $ 0.64 134 7.38
p3.2xlarge Both 1 N/A $ 3.06 $ 2.05 23 41.42

* Number of workers for the model –This can be tuned with management API of MMS.
** OMP_NUM_THREADS per worker – This indicates number of OpenMP threads per worker.

Typically, MMS automatically scales the number of workers to the number of vCPUs (for CPU instances) or to the number of GPUs (for GPU instances). However, we found for EI that using more than one worker didn’t give us any more throughput, and reduced the inference latency significantly. Thus, we use a single worker to get the best performance on EI. But for CPU or GPU instances, we achieve higher throughput by setting the number of workers to a larger value.

Notice that an additional configuration setting is needed to optimize for latency or throughput on CPU instances by setting the environment variable OMP_NUM_THREADS. MMS automatically sets this variable to 1 by default to optimize for throughput. However, we set this to 4 for c5.xlarge to optimize for latency. But your results might vary depending on the instance you choose and your optimization strategy.

About the Authors

Rakesh Vasudevan is a Software Development Engineer with AWS Deep Learning.He is passionate about  building scalable deep learning systems. In spare time, he enjoys gaming, cricket and hanging out with friends and family.





Denis Davydenko is an Engineering Manager with AWS Deep Learning. He focuses on building Deep Learning tools that enable developers and scientists to build intelligent applications. In his spare time he enjoys spending time with his family, playing poker and video games.




Sam Skalicky is a Software Engineer with AWS Deep Learning and enjoys building heterogeneous high performance computing systems. He is an avid coffee enthusiast and avoids hiking at all costs.