Posted On: Mar 16, 2021
Amazon SageMaker now supports deploying multiple containers on real-time endpoints for low latency inferences and invoking them independently for each request. This new capability enables you to run up to five different machine learning (ML) models and frameworks on a single endpoint and save up to 80% in costs. This option is ideal when you have multiple ML models with similar resource needs and when individual models don't have sufficient traffic to utilize the full capacity of the endpoint instances. For example, if you have a set of ML models that are invoked infrequently or at different times, or if you have dev/test endpoints.
To use this feature, you need to specify the list of containers along with the trained models that should be deployed on an endpoint and select the “Direct” inference execution mode which instructs SageMaker that the models will be accessed independently. To make an inference against a specific model, invoke the endpoint and specify the name of the container in the request header. You can secure inference requests to each container in the direct invocation mode by specifying condition keys and also get per container metrics in Amazon CloudWatch.
You can also execute the containers on multi-container endpoints sequentially (i.e. Inference Pipelines) for each inference if you want to pre/post process requests when making inferences or if you want to run a set of ML models sequentially. This capability is already supported as the default behavior of the multi-container endpoints or can be enabled by setting the inference execution mode as “Serial.”