AWS Machine Learning Blog

Amazon SageMaker Neo makes it easier to get faster inference for more ML models with NVIDIA TensorRT

Amazon SageMaker Neo now uses the NVIDIA TensorRT acceleration library to increase the speedup of machine learning (ML) models on NVIDIA Jetson devices at the edge and AWS g4dn and p3 instances in the AWS Cloud. Neo compiles models from TensorFlow, TFLite, MXNet, PyTorch, ONNX, and DarkNet to make optimal use of NVIDIA GPUs, providing you with the best available performance from the hardware for a broader range of frameworks and models, without the need to learn the nitty-gritty details of deep learning frameworks and acceleration libraries.

The NVIDIA TensorRT library supports a subset of operators commonly used in deep learning models. Previously, Neo used TensorRT only when the entire computational graph of the model and all its operators could be accelerated by the library. As a result, the use of the library was limited mostly to image classification models.

Now, Neo takes advantage of TensorRT for all models, even when a model contains operators that the library doesn’t support. Neo does this by partitioning the model into sub-graphs, in which TRT handles one type of sub-graph and Apache TVM handles the other. Then, at runtime, Neo uses a new heterogeneous execution mechanism to run both types of sub-graphs with the same runtime.

With this approach, Neo automatically takes advantage of TensorRT to accelerate computation-heavy operations, such as convolutions supported by the accelerator library, while generating highly performant CUDA code for all other operations using Apache TVM. As a result, Neo delivers better performance for more models than either NVIDIA TRT or Apache TVM alone.

The Neo team generalized this approach into a mechanism we call Bring Your Own Codegen. It allows us to easily extend this work to new hardware partners, who can bring their own accelerator libraries to take advantage of the wide range of frameworks and models covered by Neo while improving performance to the full extent possible on their hardware.

Performance highlights

The following table summarizes the platform, corresponding framework, model, and latency performance.

Platform Framework Model Latency
Jetson Xavier TensorFlow SSD MobilenetV2 COCO

49.71ms

 

MXNet SSD MobileNet 1.0 VOC 19.7ms
Pytorch YoloV4 68.76ms
DarkNet YoloV3 Tiny 18.98ms
Jetson Nano TensorFlow SSD MobilenetV2 COCO 223.69ms
MXNet SSD MobileNet 1.0 VOC 131.58ms
DarkNet YoloV3 Tiny 41.73ms

Conclusion

We’re very excited to offer this new integration with TensorRT, which allows you to speed up inference for your ML models. To get started with Amazon SageMaker Neo for NVIDIA Jetson or AWS g4dn, p3, and p2 instances, see Amazon SageMaker Neo.


About the Author

Trevor MorrisTrevor Morris is a Software Engineer at AWS AI working on compiler technology and optimization for machine learning inference. He focuses on improving performance for GPUs, with previous experience at NVIDIA.