Posted On: Nov 27, 2023

Today, Amazon SageMaker launched a new version (0.25.0) of Large Model Inference (LMI) Deep Learning Container (DLC), with support for NVIDIA’s TensorRT-LLM Library. With these upgrades, customers can easily access state-of-the-art tooling to optimize Large Language Models (LLMs) on SageMaker. Amazon SageMaker LMI TensorRT-LLM DLC reduces latency by 33% on average and improves throughput by 60% on average for Llama2-70B, Falcon-40B and CodeLlama-34B models, compared to previous version.

LLMs have lately seen unprecedented growth in popularity across a broad spectrum of applications. However, these models are often too large to fit on a single accelerator or GPU device, making it difficult to achieve low-latency inference and achieve scale. Amazon SageMaker offers LMI deep learning containers (DLCs) to help customers maximize the utilization of available resources and improve performance. The latest LMI DLCs offer continuous batching support for inference requests to improve throughput, efficient inference collective operations to improve latency, and the latest TensorRT-LLM library from NVIDIA to maximize performance on GPUs. LMI TensorRT-LLM DLC offers low-code interface that simplifies compilation with TensorRT-LLM by just requiring the model id and optional model parameters; all of the heavy lifting required with building TensorRT-LLM optimized model is managed by LMI DLC. Customers can also leverage the latest quantization techniques — GPTQ, AWQ, SmoothQuant — with LMI DLCs. 

These new LMI DLCs are supported in all AWS regions where SageMaker is available. For detailed steps on how to get started, please see the AWS ML blog, Large Model Inference DLC documentation, and sample notebook.