Amazon SageMaker Neo makes it easier to get faster inference for more ML models with NVIDIA TensorRT

Amazon SageMaker Neo now uses the NVIDIA TensorRT acceleration library to increase the speedup of machine learning (ML) models on NVIDIA Jetson devices at the edge and AWS g4dn and p3 instances in the AWS Cloud. Neo compiles models from TensorFlow, TFLite, MXNet, PyTorch, ONNX, and DarkNet to make optimal use of NVIDIA GPUs, providing you with the best available performance from the hardware for a broader range of frameworks and models, without the need to learn the nitty-gritty details of deep learning frameworks and acceleration libraries.

The NVIDIA TensorRT library supports a subset of operators commonly used in deep learning models. Previously, Neo used TensorRT only when the entire computational graph of the model and all its operators could be accelerated by the library. As a result, the use of the library was limited mostly to image classification models.

Now, Neo takes advantage of TensorRT for all models, even when a model contains operators that the library doesn’t support. Neo does this by partitioning the model into sub-graphs, in which TRT handles one type of sub-graph and Apache TVM handles the other. Then, at runtime, Neo uses a new heterogeneous execution mechanism to run both types of sub-graphs with the same runtime.

With this approach, Neo automatically takes advantage of TensorRT to accelerate computation-heavy operations, such as convolutions supported by the accelerator library, while generating highly performant CUDA code for all other operations using Apache TVM. As a result, Neo delivers better performance for more models than either NVIDIA TRT or Apache TVM alone.

The Neo team generalized this approach into a mechanism we call Bring Your Own Codegen. It allows us to easily extend this work to new hardware partners, who can bring their own accelerator libraries to take advantage of the wide range of frameworks and models covered by Neo while improving performance to the full extent possible on their hardware.

Performance highlights

The following table summarizes the platform, corresponding framework, model, and latency performance.

Platform	Framework	Model	Latency
Jetson Xavier	TensorFlow	SSD MobilenetV2 COCO	49.71ms
	MXNet	SSD MobileNet 1.0 VOC	19.7ms
	Pytorch	YoloV4	68.76ms
	DarkNet	YoloV3 Tiny	18.98ms
Jetson Nano	TensorFlow	SSD MobilenetV2 COCO	223.69ms
	MXNet	SSD MobileNet 1.0 VOC	131.58ms
	DarkNet	YoloV3 Tiny	41.73ms

Conclusion

We’re very excited to offer this new integration with TensorRT, which allows you to speed up inference for your ML models. To get started with Amazon SageMaker Neo for NVIDIA Jetson or AWS g4dn, p3, and p2 instances, see Amazon SageMaker Neo.

About the Author

Trevor Morris is a Software Engineer at AWS AI working on compiler technology and optimization for machine learning inference. He focuses on improving performance for GPUs, with previous experience at NVIDIA.

AWS Machine Learning Blog

Amazon SageMaker Neo makes it easier to get faster inference for more ML models with NVIDIA TensorRT

Performance highlights

Conclusion

About the Author

Resources

Blog Topics

Follow