Posted On: Sep 6, 2023
SageMaker Multi-Model Endpoint (MME) is a fully managed capability that allows customers to deploy 1000s of models on a single SageMaker endpoint and reduce costs. Until today, MME was not supported for PyTorch models deployed using TorchServe. Now, customers can use MME to deploy 1000s of PyTorch models using TorchServe to reduce inference costs.
Customers are increasingly building ML models using PyTorch to achieve business outcomes, To deploy these ML models, customers use TorchServe on CPU/GPU instances to meet desired latency and throughput goals. However, costs can add up if customers are deploying 10+ models. With MME support for TorchServe, customers can deploy 1000s of PyTorch based models on a single SageMaker endpoint. Behind the scenes, MME will run multiple models on a single instance and dynamically load/unload models across multiple instances based on the incoming traffic. With this feature, customers can save costs, as they can share instances behind an endpoint across 1000s of models and only pay for the number of instances used.
This feature supports PyTorch models which use SageMaker TorchServe Inference Container with all machine learning optimized CPU instances and single GPU instances in ml.g4dn, ml.g5, ml.p2, ml.p3 family. It is also available in all regions supported by Amazon SageMaker.
To get started, create an MME endpoint with instance type of you choice using our APIs or SageMaker Python SDK. To learn more, visit our documentation page on MME for TorchServe and visit our launch blog.