Amazon Elastic Inference allows you to attach just the right amount of GPU-powered inference acceleration to any Amazon EC2 or Amazon SageMaker instance type. This means you can now choose the instance type that is best suited to the overall compute, memory, and storage needs of your application, and then separately configure the amount of inference acceleration that you need.
Integrated with Amazon SageMaker and Amazon EC2
There are two ways to run inference workloads on AWS: deploy your model on Amazon SageMaker for a fully managed experience, or run it on Amazon EC2 instances and manage it yourself. Amazon Elastic Inference is integrated to work seamlessly with Amazon SageMaker and Amazon EC2, allowing you to add inference acceleration in both scenarios. With Amazon SageMaker, you can specify the desired amount of inference acceleration when you create your model's HTTPS endpoint, and with Amazon EC2, when you launch your instance.
TensorFlow and Apache MXNet support
Amazon Elastic Inference is designed to be used with AWS’s enhanced versions of TensorFlow Serving and Apache MXNet. These enhancements enable the frameworks to automatically detect the presence of inference accelerators, optimally distribute the model operations between the accelerator’s GPU and the instance’s CPU, and securely control access to your accelerators using AWS Identity and Access Management (IAM) policies. The enhanced TensorFlow Serving and MXNet libraries are provided automatically in Amazon SageMaker and the AWS Deep Learning AMIs, so you don't have to make any code change to deploy your models in production. You can also download them separately by following the instructions here.
Open Neural Network Exchange (ONNX) format support
ONNX is an open format that makes it possible to train a model in one deep learning framework and then transfer it to another for inference. This allows you to take advantage of the relative strengths of different frameworks. For example, with ONNX you can benefit from the flexibility of PyTorch to build and train your model, and then transfer it to Apache MXNet so that it can efficiently run inference at massive scale. ONNX is integrated into PyTorch, MXNet, Chainer, Caffe2, and Microsoft Cognitive Toolkit, and there are connectors for many other frameworks including TensorFlow. To use ONNX models with Amazon Elastic Inference, your trained models need to be transferred to the AWS-optimized version of Apache MXNet for production deployment.
Choice of single or mixed precision operations
Amazon Elastic Inference accelerators support both single-precision (32-bit floating point) operations and mixed precision (16-bit floating point) operations. Single precision provides an extremely large numerical range to represent the parameters used by your model. However, most models don’t actually need this much precision and calculating numbers that large results in unnecessary loss of performance. To avoid that problem, mixed-precision operations allow you to reduce the numerical range by half to gain up to 8x greater inference performance.
Available in multiple amounts of acceleration
Amazon Elastic Inference is available in multiple throughput sizes ranging from 1 to 32 trillion floating point operations per second (TFLOPS) per accelerator, making it efficient for accelerating a wide range of inference models including computer vision, natural language processing, and speech recognition. Compared to standalone Amazon EC2 P3 instances that start at 125 TFLOPS (the smallest P3 instance available), Amazon Elastic Inference starts at a single TFLOPS per accelerator. This allows you to scale up inference acceleration in more appropriate increments. You can also select from larger accelerator sizes, up to 32 TFLOPS per accelerator, for more complex models.
Amazon Elastic Inference can be part of the same Amazon EC2 Auto Scaling group you use to scale your Amazon EC2 and Amazon SageMaker instances. When EC2 Auto Scaling adds more EC2 instances to meet the demands of your application, it also scales up the accelerator attached to each instance. Similarly, when Auto Scaling reduces your EC2 instances as demand goes down, it also scales down the attached accelerator for each instance. This makes it easy to scale your inference acceleration alongside your application’s compute capacity to meet the demands of your application.