Introducing latency-optimized inference for foundation models in Amazon Bedrock

Posted on: Dec 2, 2024

Latency-optimized inference for foundation models in Amazon Bedrock now available in public preview, delivering faster response times and improved responsiveness for AI applications. Currently, these new inference options support Anthropic's Claude 3.5 Haiku model and Meta's Llama 3.1 405B and 70B models offering reduced latency compared to standard models without compromising accuracy. As verified by Anthropic, with latency-optimized inference in Amazon Bedrock, Claude 3.5 Haiku runs faster on AWS than anywhere else. Additionally, with latency-optimized inference in Bedrock, Llama 3.1 405B and 70B runs faster on AWS than any other major cloud provider.

As more customers move their generative AI applications to production, optimizing the end-user experience becomes crucial, particularly for latency-sensitive applications such as real-time customer service chatbots and interactive coding assistants. Using purpose-built AI chips like AWS Trainium2 and advanced software optimizations in Amazon Bedrock, customers can access more options to optimize their inference for a particular use case. Accessing these capabilities requires no additional setup or model fine-tuning, allowing for immediate enhancement of existing applications with faster response times.

Latency-optimized inference is available for Anthropic’s Claude 3.5 Haiku and Meta’s Llama 3.1 405B and 70B in the US East (Ohio) Region via cross-region inference. To get started, visit the Amazon Bedrock console. For more information about Amazon Bedrock and its capabilities, visit the Amazon Bedrock product page, pricing page, and documentation.

Introducing latency-optimized inference for foundation models in Amazon Bedrock

Learn

Resources

Developers

Help