Sign in
Your Saved List Partners Sell in AWS Marketplace Amazon Web Services Home Help

Triton Inference Server

By: NVIDIA Latest Version: 21.06

Product Overview

Triton Inference Server is an open source inference serving software that lets teams deploy trained AI models from any framework on GPU or CPU infrastructure. It is designed to simplify and scale inference serving.
Triton Inference Server supports all major frameworks like TensorFlow, TensorRT, PyTorch, ONNX Runtime and even custom framework backend. It provides AI researchers and data scientists the freedom to choose the right framework.

Also available as a Docker container, it integrates with Kubernetes for orchestration and scaling and exports Prometheus metrics for monitoring. It helps IT/DevOps streamline model deployment in production.
NVIDIA Triton Inference Server can load models from local storage or AWS S3. As the models are retrained continuously with new data, developers can easily update models without restarting the inference server and without any disruption to the application.

Triton Inference Server runs multiple models from the same or different frameworks concurrently on a single GPU using CUDA Streams. In a multi-GPU server, it automatically creates an instance of each model on each GPU. All these increase GPU utilization without any extra coding from the user.

The inference server supports low latency real time inferencing, batch inferencing to maximize GPU/CPU utilization. It also has built-in support for audio streaming input for streaming inference. It also supports model ensembles - pipeline of models.





Operating System


Delivery Methods

  • Container

Pricing Information

Usage Information

Support Information

Customer Reviews