Amazon SageMaker launches faster auto-scaling for Generative AI models

Posted on: Jul 25, 2024

We are excited to announce a new capability in Amazon SageMaker Inference that helps customers reduce the time it takes for their Generative AI models to scale automatically. They can now use sub-minute metrics and significantly reduce overall scaling latency for AI models. Using this enhancement customers can improve the responsiveness of their Generative AI applications as demand fluctuates.

With this capability customers get two new high resolution CloudWatch metrics - ConcurrentRequestsPerModel and ConcurrentRequestsPerModelCopy - that enable faster autoscaling. These metrics are emitted at a 10 second interval and provide a more accurate representation of the load on the endpoint by tracking the actual concurrency or number of in-flight inference requests being processed by the model. Customers can create auto-scaling policies using these high-resolution metrics to scale their models deployed on SageMaker endpoints. Amazon SageMaker will start adding new instances or model copies in under a minute when thresholds defined in these auto-scaling policies are reached. This allows customers to optimize performance and cost-efficiency for their inference workloads on SageMaker.

This new capability is accessible on accelerator instance families (g4dn, g5, g6, p2, p3, p4d, p4de, p5, inf1, inf2, trn1n, trn1) in all AWS regions where Amazon SageMaker Inference is available, except China and the AWS GovCloud (US) Regions. To learn more, see AWS ML blog and visit our documentation.