AWS Neuron is a software development kit (SDK) for running machine learning inference using AWS Inferentia chips. It consists of a compiler, run-time, and profiling tools that enable developers to run high-performance and low latency inference using AWS Inferentia-based Amazon EC2 Inf1 instances. Using Neuron developers can easily train their machine learning models on any popular framework such as TensorFlow, PyTorch, and MXNet, and run it optimally on Amazon EC2 Inf1 instances. You can continue to use the same ML frameworks you use today and migrate your software onto Inf1 instances with minimal code changes and without tie-in to vendor specific solutions.
The fastest and easiest way to get started with Inf1 instances is Amazon SageMaker, a fully managed service that enables data scientists and developers to build, train, and deploy machine learning models. Developers who prefer to manage their own machine learning workflows will find AWS Neuron easy to integrate into their existing and future workflows as it is natively integrated with popular frameworks including TensorFlow, PyTorch, and MXNet. Neuron is pre-installed in AWS Deep Learning AMIs as well as in AWS Deep Learning Containers. Customers who are using containerized applications can deploy Neuron using Amazon ECS, Amazon EKS, or their native container engine of choice.
Easy to use
The AWS Neuron SDK is integrated with popular frameworks such as TensorFlow, PyTorch, and MXNet. It is preinstalled in Amazon Deep Learning AMIs and Amazon Deep Learning Containers for customers to quickly get started with running high performance and cost-effective inference on Amazon EC2 Inf1 instances, featuring AWS Inferentia chips.
The AWS Neuron SDK enables efficient programming and runtime access to the Inferentia chips. It provides advanced capabilities such as Auto Casting, which automatically converts FP32 (32-bit floating point) models that are optimized for accuracy to 16-bit bfloat to maximize processing throughput. Developers can further improve performance by utilizing Neuron features such as model parallelism, batching, or NeuronCore Groups allowing data parallelism of the same or different models run in parallel for maximum throughput.
Flexibility and choice
Since Neuron is integrated with common machine learning frameworks, developers deploy their existing models to EC2 Inf1 instances with minimal code changes. This gives them the freedom to maintain hardware portability and take advantage of latest technologies without being tied to vendor-specific software libraries. With Neuron, developers can deploy many commonly used machine learning models such as single shot detector (SSD) and ResNet for image recognition/classification as well as Transformer and BERT for natural language processing and translation. Additionally, support for HuggingFace model repository in Neuron, provides customers the ability to compile and run inference using pretrained models – or even fine-tuned ones, easily, by changing just a single line of code.
AWS Neuron automatically optimizes neural-net compute to execute intensive tasks on Inferentia and other tasks on the CPU to increase overall performance.
AWS Neuron takes high precision FP32 trained models and autocasts them to BF16 for high throughput inference at the lower cost and higher speed of 16-bit data type.
NeuronCore pipeline enables high throughput model parallelism for latency sensitive applications such as natural language processing by sharding a compute-graph across multiple NeuronCores, caching the model parameters in each core’s on-chip memory, and then streaming inference requests across the cores in a pipelined manner.
NeuronCore groups enable developers to concurrently deploy multiple models to optimally utilize hardware resources by running different models on each of the groups.
AWS Neuron optimizes workloads on the Inferentia chip to achieve maximal utilization on small batches, which enables high performance for applications that have strict response time requirements.
How it works
AWS Inferentia chips
AWS Inferentia is a machine learning inference chip designed and built by AWS to deliver high performance at low cost. Each AWS Inferentia chip has 4 NeuronCores and supports FP16, BF16, and INT8 data types. AWS Inferentia chips feature a large amount of on-chip memory which can be used for caching large models, which is especially beneficial for models that require frequent memory access.
Amazon EC2 Inf1 Instances
Amazon EC2 Inf1 instances based on AWS Inferentia chips deliver up 2.3x higher throughput and up to 70% lower cost per inference than comparable current generation GPU-based Amazon EC2 instances. Inf1 instances feature up to 16 AWS Inferentia chips, latest custom 2nd generation Intel® Xeon® Scalable processors and up to 100 Gbps networking to enable high throughput inference.