AWS Machine Learning Blog

Announcing availability of Inf1 instances in Amazon SageMaker for high performance and cost-effective machine learning inference

Amazon SageMaker is a fully managed service that provides every developer and data scientist with the ability to build, train, and deploy machine learning (ML) models quickly. Tens of thousands of customers, including Intuit, Voodoo, ADP, Cerner, Dow Jones, and Thompson Reuters, use Amazon SageMaker to remove the heavy lifting from each step of the ML process.

When it comes to deploying ML models for real-time prediction, Amazon SageMaker provides you with a large selection of AWS instance types, from small CPU instances to multi-GPU instances. This lets you find the right cost/performance ratio for your prediction infrastructure. Today we announce the availability of Inf1 instances in Amazon SageMaker to deliver high performance, low latency, and cost-effective inference.

A primer of Amazon EC2 Inf1 instances

The Amazon EC2 Inf1 instances were launched at AWS re:Invent 2019. Inf1 instances are powered by AWS Inferentia, a custom chip built from the ground up by AWS to accelerate machine learning inference workloads. When compared to G4 instances, Inf1 instances offer up to three times the inferencing throughput and up to 45% lower cost per inference.

Inf1 instances are available in multiple sizes, with 1, 4, or 16 AWS Inferentia chips. An AWS Inferentia chip contains four NeuronCores. Each implements a high-performance systolic array matrix multiply engine, which massively speeds up typical deep learning operations such as convolution and transformers. NeuronCores are also equipped with a large on-chip cache, which helps cut down on external memory accesses and saves I/O time in the process.

When several AWS Inferentia chips are available on an Inf1 instance, you can partition a model across them and store it entirely in cache memory. Alternatively, to serve multi-model predictions from a single Inf1 instance, you can partition the NeuronCores of an AWS Inferentia chip across several models.

To run machine learning models on Inf1 instances, you need to compile models to a hardware-optimized representation using the AWS Neuron SDK. Since the launch of Inf1 instances, AWS has released five versions of the AWS Neuron SDK that focused on performance improvements and new features, with plans to add more on a regular cadence. For example, image classification (ResNet-50) performance has improved by more than 2X, from 1100 to 2300 images/sec on a single AWS Inferentia chip. This performance improvement translates to 45% lower cost per inference as compared to G4 instances. Support for object detection models starting with Single Shot Detection (SSD) was also added, with Mask R-CNN coming soon.

Now let us show you how you can easily compile, load and run models on ml.Inf1 instances in Amazon SageMaker.

Using Inf1 instances in Amazon SageMaker

Compiling and deploying models for Inf1 instances in Amazon SageMaker is straightforward thanks to Amazon SageMaker Neo. The AWS Neuron SDK is integrated with Amazon SageMaker Neo to run your model optimally on Inf1 instances in Amazon SageMaker. You only need to complete the following steps:

  1. Train your model as usual.
  2. Compile your model for the Inf1 architecture with Amazon SageMaker Neo.
  3. Deploy your model on Inf1 instances in Amazon SageMaker.

In the following example use case, you train a simple TensorFlow image classifier on the MNIST dataset, like in this sample notebook on GitHub. The training code would look something like the following:

from sagemaker.tensorflow import TensorFlow
mnist_estimator = TensorFlow(entry_point='mnist.py', ...)
mnist_estimator.fit(inputs)

To compile the model for an Inf1 instance, you make a single API call and select ml_inf1 as the deployment target. See the following code:

# S3 bucket where the compiled model is saved
output_path ='/'.join(mnist_estimator.output_path.split('/')[:-1])

# Compile the model for Inf1 instances
optimized_estimator = mnist_estimator.compile_model(target_instance_family='ml_inf1',
						input_shape={'data':[1, 784]}, # Batch size 1,28x28 pixels flattened
						output_path=output_path,
						framework='tensorflow',
						framework_version='1.15.0')

Once the machine learning model has been compiled, you deploy the model on an Inf1 instance in Amazon SageMaker using the optimized estimator from Amazon SageMaker Neo. Under the hood, when creating the inference endpoint, Amazon SageMaker automatically selects a container with the Neo Deep Learning Runtime, a lightweight runtime that will load and invoke the optimized model for inference.

optimized_predictor = optimized_estimator.deploy(initial_instance_count = 1,
						instance_type = 'ml.inf1.xlarge')

That’s it! After you deploy the model, you can invoke the endpoint and receive predictions in real time with low latency. You can find a full example on Github.

Getting Started

Inf1 instances in Amazon SageMaker are available in four sizes: ml.inf1.xlarge, ml.inf1.2xlarge, ml.inf1.6xlarge, and ml.inf1.24xlarge. Machine learning models developed using TensorFlow and MxNet frameworks can be compiled with Amazon SageMaker Neo to run optimally on Inf1 instances and deployed on Inf1 instances in Amazon SageMaker for real-time inference. You can start using Inf1 instances in Amazon SageMaker today in the US East (N. Virginia) and US West (Oregon) Regions.


About the Author

Julien Simon is an Artificial Intelligence & Machine Learning Evangelist for EMEA, Julien focuses on helping developers and enterprises bring their ideas to life.