Posted On: Aug 20, 2021

We are introducing Amazon SageMaker Asynchronous Inference, a new inference option in Amazon SageMaker that queues incoming requests and processes them asynchronously. This option is ideal for inferences with large payload sizes (up to 1GB) and/or long processing times (up to 15 minutes) that need to be processed as requests arrive. Asynchronous inference enables you to save on costs by autoscaling the instance count to zero when there are no requests to process, so you only pay when your endpoint is processing requests.

With the introduction of asynchronous inference, Amazon SageMaker provides three options to deploy trained machine learning models for generating inferences on new data. Real-time inference is suitable for workloads where payload sizes are up to 6MB and need to be processed with low latency requirements in the order of milliseconds or seconds. Batch transform is ideal for offline predictions on large batches of data that is available upfront. The new asynchronous inference option is ideal for workloads where the request sizes are large (up to 1GB) and inference processing times are in the order of minutes (up to 15 minutes). Example workloads for asynchronous inference include running predictions on high resolution images generated from a mobile device at different intervals during the day and providing responses within minutes of receiving the request. For use cases that can tolerate a cold start penalty of a few minutes, you can optionally scale down the endpoint instance count to zero when there are no outstanding requests and scale back up as new requests arrive so that you only pay for the duration that the endpoints are actively processing requests.

Creating an asynchronous inference endpoint is similar to creating a real-time endpoint. You can use your existing Amazon SageMaker Models and only need to specify additional asynchronous inference specific configuration parameters while creating your endpoint configuration. To invoke the endpoint, you need to place the request payload in Amazon S3 and provide a pointer to the payload as a part of the invocation request. Upon invocation, Amazon SageMaker enqueues the request for processing and returns an output location as a response. Upon processing, Amazon SageMaker places the inference response in the previously returned Amazon S3 location. You can optionally choose to receive success or error notifications via Simple Notification Service (SNS).

For a detailed description of how to create, invoke, and monitor asynchronous inference endpoints, please read our documentation, which also contains a sample notebook to help you get started. For pricing information, please visit the Amazon SageMaker pricing page. Amazon SageMaker Asynchronous Inference is generally available in all commercial AWS Regions where Amazon SageMaker is available except Asia Pacific (Osaka), Europe (Milan), and Africa (Cape Town).