Posted On: Mar 18, 2024

You can now achieve even better price-performance of large language models (LLMs) running on NVIDIA accelerated computing infrastructure when using Amazon SageMaker with newly integrated NVIDIA NIM inference microservices. SageMaker is a fully managed service that makes it easy to build, train, and deploy machine learning and LLMs, and NIM, part of the NVIDIA AI Enterprise software platform, provides high-performance AI containers for inference with LLMs.

When deploying LLMs for generative AI use cases at scale, customers often use NVIDIA GPU-accelerated instances and advanced frameworks like NVIDIA Triton Inference Server and NVIDIA TensorRT-LLM to accelerate and optimize the performance of the LLMs. Now, customers using Amazon SageMaker with NVIDIA NIM can deploy optimized LLMs on SageMaker quickly and reduce deployment time from days to minutes.

NIM offers containers for a variety of popular LLMs which are optimized for inference. LLMs supported out-of-the-box include Llama 2 (7B, 13B, and 70B), Mistral-7b-Instruct, Mixtral-8x7b, NVIDIA Nemotron-3 8B and 43B, StarCoder, and StarCoderPlus which use pre-built NVIDIA TensorRT™ engines. These models are curated with the most optimal hyper-parameters to ensure performant deployment on NVIDIA GPUs. For other models, NIM also gives you tools to create GPU-optimized versions. To get started, use the NIM container available through the NVIDIA API catalog and deploy it on Amazon SageMaker by creating an inference endpoint.

NIM containers are accessible in all AWS regions where Amazon SageMaker is available. To learn more, see our launch blog.