Amazon SageMaker introduces Scale Down to Zero for AI inference to help customers save costs

Posted on: Nov 25, 2024

We are excited to announce Scale Down to Zero, a new capability in Amazon SageMaker Inference that allows endpoints to scale to zero instances during periods of inactivity. This feature can significantly reduce costs for running inference using AI models, making it particularly beneficial for applications with variable traffic patterns such as chatbots, content moderation systems, and other generative AI usecases.

With Scale Down to Zero, customers can configure their SageMaker inference endpoints to automatically scale to zero instances when not in use, then quickly scale back up when traffic resumes. This capability is effective for scenarios with predictable traffic patterns, intermittent inference traffic, and development/testing environments. Implementing Scale Down to Zero is simple with SageMaker Inference Components. Customers can configure auto-scaling policies through the AWS SDK for Python (Boto3), SageMaker Python SDK, or the AWS Command Line Interface (AWS CLI). The process involves setting up an endpoint with managed instance scaling enabled, configuring scaling policies, and creating CloudWatch alarms to trigger scaling actions.

Scale Down to Zero is now generally available in all AWS regions where Amazon SageMaker is supported. To learn more about implementing Scale Down to Zero and optimizing costs for generative AI deployments, please visit our documentation page.