Artificial Intelligence
How Rufus scales conversational shopping experiences to millions of Amazon customers with Amazon Bedrock
Our team at Amazon builds Rufus, an AI-powered shopping assistant which delivers intelligent, conversational experiences to delight our customers. More than 250 million customers have used Rufus this year. Monthly users are up 140% YoY and interactions are up 210% YoY. Additionally, customers that use Rufus during a shopping journey are 60% more likely to […]
How Amazon scaled Rufus by building multi-node inference using AWS Trainium chips and vLLM
In this post, Amazon shares how they developed a multi-node inference solution for Rufus, their generative AI shopping assistant, using Amazon Trainium chips and vLLM to serve large language models at scale. The solution combines a leader/follower orchestration model, hybrid parallelism strategies, and a multi-node inference unit abstraction layer built on Amazon ECS to deploy models across multiple nodes while maintaining high performance and reliability.
Introducing AWS Batch Support for Amazon SageMaker Training jobs
AWS Batch now seamlessly integrates with Amazon SageMaker Training jobs. In this post, we discuss the benefits of managing and prioritizing ML training jobs to use hardware efficiently for your business. We also walk you through how to get started using this new capability and share suggested best practices, including the use of SageMaker training plans.
Amazon Bedrock Marketplace now includes NVIDIA models: Introducing NVIDIA Nemotron-4 NIM microservices
At AWS re:Invent 2024, we are excited to introduce Amazon Bedrock Marketplace. This a revolutionary new capability within Amazon Bedrock that serves as a centralized hub for discovering, testing, and implementing foundation models (FMs). In this post, we discuss the advantages and capabilities of Amazon Bedrock Marketplace and Nemotron models, and how to get started.
Scaling Rufus, the Amazon generative AI-powered conversational shopping assistant with over 80,000 AWS Inferentia and AWS Trainium chips, for Prime Day
In this post, we dive into the Rufus inference deployment using AWS chips and how this enabled one of the most demanding events of the year—Amazon Prime Day.
Amazon SageMaker inference launches faster auto scaling for generative AI models: up-to 6x faster scale-up detection
Today, we are excited to announce a new capability in Amazon SageMaker inference that can help you reduce the time it takes for your generative artificial intelligence (AI) models to scale automatically. This feature can detect the need for scaling model copies up-to 6x faster as compared to traditional mechanisms used by customers. You can […]
Optimize price-performance of LLM inference on NVIDIA GPUs using the Amazon SageMaker integration with NVIDIA NIM Microservices
NVIDIA NIM microservices now integrate with Amazon SageMaker, allowing you to deploy industry-leading large language models (LLMs) and optimize model performance and cost. You can deploy state-of-the-art LLMs in minutes instead of days using technologies such as NVIDIA TensorRT, NVIDIA TensorRT-LLM, and NVIDIA Triton Inference Server on NVIDIA accelerated instances hosted by SageMaker. NIM, part […]
Reduce model deployment costs by 50% on average using the latest features of Amazon SageMaker
As organizations deploy models to production, they are constantly looking for ways to optimize the performance of their foundation models (FMs) running on the latest accelerators, such as AWS Inferentia and GPUs, so they can reduce their costs and decrease response latency to provide the best experience to end-users. However, some FMs don’t fully utilize […]
Minimize real-time inference latency by using Amazon SageMaker routing strategies
Amazon SageMaker makes it straightforward to deploy machine learning (ML) models for real-time inference and offers a broad selection of ML instances spanning CPUs and accelerators such as AWS Inferentia. As a fully managed service, you can scale your model deployments, minimize inference costs, and manage your models more effectively in production with reduced operational […]
Deploy Falcon-40B with large model inference DLCs on Amazon SageMaker
Last week, Technology Innovation Institute (TII) launched TII Falcon LLM, an open-source foundational large language model (LLM). Trained on 1 trillion tokens with Amazon SageMaker, Falcon boasts top-notch performance (#1 on the Hugging Face leaderboard at time of writing) while being comparatively lightweight and less expensive to host than other LLMs such as llama-65B. In […]









