The demand for deep learning acceleration is growing at a rapid pace, and across a wide range of applications. Applications such as personalized search recommendations, dynamic pricing, or automated customer support, are growing in sophistication and becoming more expensive to run in production. As more applications embed machine learning capabilities, a higher percentage of workloads needs acceleration, including ones that need low-latency, real-time performance. These applications benefit from infrastructure optimized to execute machine learning algorithms.
AWS’s vision is to make deep learning pervasive for everyday developers and to democratize access to cutting edge hardware made available in a low-cost pay-as-you-go usage model. AWS Inferentia is a big step and commitment that will help us deliver on this vision. AWS Inferentia is designed to provide high inference performance in the cloud, drive down the total cost of inference, and to make it easy for you to integrate machine learning as part of your standard application features and capabilities.
Each AWS Inferentia chip supports up to 128 TOPS (trillions of operations per second) of performance at low power to enable multiple chips per EC2 instance. AWS Inferentia supports FP16, BF16, and INT8 data types. Furthermore, Inferentia can take a 32-bit trained model and run it at the speed of a 16-bit model using BFloat16.
AWS Inferentia features a large amount of on-chip memory which can be used for caching large models, removing the need to store them off-chip. This has a significant impact in lowering inference latency as Inferentia’s processing cores – Neuron Cores, have high-speed access to models and are not limited by the chip’s off-chip memory bandwidth.
Ease of use
AWS Inferentia comes with the AWS Neuron software development kit (SDK) that enables complex neural net models, created and trained in popular frameworks to be executed using AWS Inferentia based EC2 Inf1 instances. Neuron consists of a compiler, run-time, and profiling tools and is pre-integrated into popular machine learning frameworks including TensorFlow, Pytorch, and MXNet to deliver optimal performance of EC2 Inf1 instances.