AWS’s vision is to make deep learning pervasive for everyday developers and to democratize access to cutting edge infrastructure made available in a low-cost pay-as-you-go usage model. AWS Inferentia is Amazon's first custom silicon designed to accelerate deep learning workloads and is part of a long-term strategy to deliver on this vision. AWS Inferentia is designed to provide high performance inference in the cloud, to drive down the total cost of inference, and to make it easy for developers to integrate machine learning into their business applications.
The AWS Neuron software development kit (SDK), consists of a compiler, run-time, and profiling tools that help optimize the performance of workloads for AWS Inferentia. Developers can deploy complex neural network models that have been built and trained on popular frameworks such as Tensorflow, PyTorch, and MXNet, and deploy them on AWS Inferentia-based Amazon EC2 Inf1 instances. You can continue to use the same ML frameworks you use today and migrate your models onto Inf1 with minimal code changes and without tie-in to vendor specific solutions.
Each AWS Inferentia chip supports up to 128 TOPS (trillions of operations per second) of performance with up to 16 Inferentia chips per EC2 Inf1 instance. Inferentia is optimized for maximizing throughput for small batch sizes, which is especially beneficial for applications that have strict latency requirements such as voice generation and search.
AWS Inferentia features a large amount of on-chip memory which can be used for caching large models, instead of storing them off-chip. This has a significant impact in reducing inference latency since Inferentia’s processing cores, called Neuron Cores, have high-speed access to models that are stored in on-chip memory and are not limited by the off-chip memory bandwidth.
Ease of use
Developers can train models using popular frameworks such as TensorFlow, PyTorch, and MXNet, and easily deploy them to AWS Inferentia-based Inf1 instances using the AWS Neuron SDK. AWS Inferentia supports FP16, BF16, and INT8 data types. Furthermore, Inferentia can take a 32-bit trained model and automatically run it at the speed of a 16-bit model using BFloat16.
Amazon EC2 Inf1 Instances Powered by AWS Inferentia
Amazon EC2 Inf1 instances based on AWS Inferentia chips deliver up 2.3x higher throughput and up to 70% lower cost per inference than comparable current generation GPU-based Amazon EC2 instances. Inf1 instances feature up to 16 AWS Inferentia chips, latest custom 2nd generation Intel® Xeon® Scalable processors and up to 100 Gbps networking to enable high throughput inference. The easiest and quickest way to get started with Inf1 instances is via Amazon SageMaker, a fully managed service that enables developers to build, train, and deploy machine learning models quickly. Developers using containerized applications can also use Amazon Elastic Kubernetes Service (EKS) to deploy Inf1 instances.
AWS Neuron SDK
AWS Neuron is a software development kit (SDK) for running machine learning inference using AWS Inferentia chips. It consists of a compiler, run-time, and profiling tools that enable developers to run high-performance and low latency inference using AWS Inferentia-based Amazon EC2 Inf1 instances. Using Neuron developers can easily train their machine learning models on any popular framework such as TensorFlow, PyTorch, and MXNet, and run it optimally on EC2 Inf1 instances. You can continue to use the same ML frameworks you use today and migrate your software onto Inf1 instances with minimal code changes and without tie-in to vendor specific solutions. AWS Neuron SDK comes pre-installed in AWS Deep Learning AMIs, as well as in AWS Deep Learning Containers making it easy to get started with Inf1 instances.
Blogs and Articles
Patrick Moorhead, May 13, 2020
James Hamilton, Nov 28, 2018