Posted On: May 10, 2023

Today, we are excited to announce general availability of Provisioned Concurrency support for Amazon SageMaker Serverless Inference. Provisioned Concurrency allows you to deploy models on serverless endpoints with predictable performance and high scalability. You can add provisioned concurrency to your serverless endpoints, and for the pre-defined amount of provisioned concurrency SageMaker will keep the endpoints warm and ready to respond to requests instantaneously. Provisioned Concurrency is ideal for customers who have predictable traffic, with low throughput.

With on-demand serverless endpoints, if your endpoint does not receive traffic for a while and then your endpoint suddenly receives new requests, it can take some time for your endpoint to spin up the compute resources to process the requests. This is called a cold start. A cold start can also occur if your concurrent requests exceed the current concurrent request usage. To reduce variability in your latency profile, you can optionally enable Provisioned Concurrency for your serverless endpoints. With provisioned concurrency, your serverless endpoints are always ready and can instantaneously serve bursts in traffic upto the configured number of Provisioned Concurrency, without any cold starts.

You can enable Provisioned Concurrency for serverless endpoints from the AWS console, AWS SDKs, or the AWS Command Line Interface (AWS CLI). Provisioned Concurrency for SageMaker Serverless Inference is generally available in all AWS Regions where SageMaker Serverless Inference is generally available.