Posted On: Nov 29, 2023

We are excited to announce new capabilities on Amazon SageMaker which help customers reduce model deployment costs by 50% on average and achieve 20% lower inference latency on average. Customers can deploy multiple models to the same instance to better utilize the underlying accelerators. SageMaker actively monitors instances that are processing inference requests and intelligently routes requests based on which instances are available.

These features are available for SageMaker's real-time inference which makes it easy to deploy ML models. You can now create one or more InferenceComponents and deploy them to a SageMaker endpoint. An InferenceComponent abstracts your ML model and enables you to assign CPUs, GPU, or Neuron accelerators, and scaling policies per model. We will intelligently place each model across instances behind the endpoint to maximize utilization and save costs. Each model can be independently scaled up and down to zero. This frees up hardware resources for other models to make use of the accelerators on the instance. Each model will also emit its own metrics and logs to help you monitor and debug any issues. We added a new Least Outstanding Requests routing algorithm which leads to more even distribution of requests resulting in reduced end-to-end latency. 

These new features are generally available in: Asia Pacific (Tokyo, Seoul, Mumbai, Singapore, Sydney, Jakarta), Canada (Central), Europe (Frankfurt, Stockholm, Ireland, London), Middle East (UAE), South America (Sao Paulo), US East (N. Virginia, Ohio), and US West (Oregon).

Learn more by visiting our documentation page and our product page.