Amazon SageMaker Neo

Run ML models anywhere with up to 25x better performance

Amazon SageMaker Neo enables developers to optimize machine learning (ML) models for inference on SageMaker in the cloud and supported devices at the edge.

ML inference is the process of using a trained machine learning model to make predictions. After training a model for high accuracy, developers often spend a lot of time and effort tuning the model for high performance. For inference in the cloud, developers often turn to large instances with lots of memory and powerful processing capabilities at higher costs to achieve better throughput. For inference on edge devices with limited compute and memory, developers often spend months hand-tuning the model to achieve acceptable performance within the device hardware constraints.

Amazon SageMaker Neo automatically optimizes machine learning models for inference on cloud instances and edge devices to run faster with no loss in accuracy. You start with a machine learning model already built with DarkNet, Keras, MXNet, PyTorch, TensorFlow, TensorFlow-Lite, ONNX, or XGBoost and trained in Amazon SageMaker or anywhere else. Then you choose your target hardware platform, which can be a SageMaker hosting instance or an edge device based on processors from Ambarella, Apple, ARM, Intel, MediaTek, Nvidia, NXP, Qualcomm, RockChip, or Texas Instruments. With a single click, SageMaker Neo optimizes the trained model and compiles it into an executable. The compiler uses a machine learning model to apply the performance optimizations that extract the best available performance for your model on the cloud instance or edge device. You then deploy the model as a SageMaker endpoint or on supported edge devices and start making predictions.

For inference in the cloud, SageMaker Neo speeds up inference and saves cost by creating an inference optimized container in SageMaker hosting. For inference at the edge, SageMaker Neo saves developers months of manual tuning by automatically tuning the model for the selected operating system and processor hardware.

Amazon SageMaker Neo uses Apache TVM and partner-provided compilers and acceleration libraries to deliver the best available performance for a given model and hardware target. AWS contributes the compiler code to the Apache TVM project and the runtime code to the Neo-AI open-source project, under the Apache Software License, to enable processor vendors and device makers to innovate rapidly on a common compact runtime.

How it works

How Amazon SageMaker Neo works

Benefits

Improve performance up to 25x

Amazon SageMaker Neo automatically optimizes machine learning models to perform up to 25x faster with no loss in accuracy. SageMaker Neo uses the tool chain best suited for your model and target hardware platform while providing a simple standard API for model compilation.



Less than 1/10 the runtime footprint

Amazon SageMaker Neo runtime consumes as little as 1/10 the footprint of a deep learning framework such as TensorFlow or PyTorch. Instead of installing the framework on your target hardware, you load the compact Neo runtime library into your ML application. And unlike a compact framework like TensorFlow-Lite, the Neo runtime can run a model trained in any of the frameworks supported by the Neo compiler.

Faster time to production

Amazon SageMaker Neo makes it easy to prepare your model for deployment on virtually any hardware platforms with only a few clicks in the Amazon SageMaker console. You get all the benefits of manual tuning with none of the effort.

Key Features

Optimizes inference without compromising accuracy
Amazon SageMaker Neo uses research-led techniques in machine learning compilers to optimize your model for the target hardware. Applying these systematic optimization techniques automatically, SageMaker Neo speeds up your models with no loss in accuracy.

Supports popular machine learning frameworks
Amazon SageMaker Neo converts a model from the framework-specific format of DarkNet, Keras, MXNet, PyTorch, TensorFlow, TensorFlow-Lite, ONNX, or XGBoost into a common representation, optimizes the computations, and generates a hardware-specific executable for the target SageMaker hosting instance or edge device.

Provides a compact runtime with standard APIs
Amazon SageMaker Neo runtime occupies 1MB of storage and 2MB of memory, which is many times smaller than the storage and memory footprint of a framework, while providing a simple common API to run a compiled model originating in any framework.

Supports popular target platforms
Amazon SageMaker Neo runtime is supported on Android, Linux, and Windows operating systems and on processors from Ambarella, ARM, Intel, Nvidia, NXP, Qualcomm, and Texas Instruments. SageMaker Neo also converts PyTorch and TensorFlow models to the Core ML format for deployment on macOS, iOS, iPadOS, watchOS, and tvOS on Apple devices.

Inference optimized containers for Amazon SageMaker hosting instances
For inference in the cloud, Amazon SageMaker Neo provides inference optimized containers that include MXNet, PyTorch, and TensorFlow integrated with Neo runtime. Previously, SageMaker Neo would fail to compile models that used unsupported operators. Now, SageMaker Neo optimizes every model to the extent that the compiler supports the operators in the model and uses the framework to run the remainder of the uncompiled model. As a result, you can run any MXNet, PyTorch, or TensorFlow model in the inference optimized containers while getting better performance for the models that can be compiled.

Model partitioning for heterogeneous hardware
Amazon SageMaker Neo takes advantage of partner-provided accelerator libraries to deliver the best available performance for a deep learning model on heterogeneous hardware platforms with a hardware accelerator as well as a CPU. Acceleration libraries such as Ambarella CV Tools, Nvidia Tensor RT, and Texas Instruments TIDL each support a specific set of functions and operators. SageMaker Neo automatically partitions your model so that the part with operators supported by the accelerator can run on the accelerator while the rest of the model runs on the CPU. In this way, SageMaker Neo makes most use of the hardware accelerator, increasing the types of models that can be run on the hardware while improving the performance of model to the extent that its operators are supported by the accelerator.

Support for Amazon SageMaker INF1 instances
Amazon SageMaker Neo now compiles models for Amazon SageMaker INF1 instance targets. SageMaker hosting provides a managed service for inference on the INF1 instances, which are based on the AWS Inferentia chip. SageMaker Neo provides the standard model compilation API while using the Neuron compiler specific to the Inferentia processor under the hood, simplifying the task of preparing a model for deployment on the SageMaker INF1 instance while delivering the best available performance and cost savings of the INF1 instance.

Amazon SageMaker Neo