What is Amazon SageMaker Inference?
Amazon SageMaker AI makes it easier to deploy ML models including foundation models (FMs) to make inference requests at the best price performance for any use case. From low latency and high throughput to long-running inference, you can use SageMaker AI for all your inference needs. SageMaker AI is a fully managed service and integrates with MLOps tools, so you can scale your model deployment, reduce inference cost, manage models more effectively in production, and reduce operational burden.
Benefits of SageMaker Model Deployment
Wide range of inference options
Real-Time Inference
Serverless Inference
Asynchronous Inference
Batch Transform
Scalable and cost-effective inference options
Single-model endpoints
One model on a container hosted on dedicated instances or serverless for low latency and high throughput.
Multiple models on a single endpoint
Host multiple models to the same instance to better utilize the underlying accelerators, reducing deployment costs by up to 50%. You can control scaling policies for each FM separately, making it easier to adapt to model usage patterns while optimizing infrastructure costs.
Serial inference pipelines
Multiple containers sharing dedicated instances and executing in a sequence. You can use an inference pipeline to combine preprocessing, predictions, and post-processing data science tasks.
Support for most machine learning frameworks and model servers
Amazon SageMaker inference supports built-in algorithms and prebuilt Docker images for some of the most common machine learning frameworks such as TensorFlow, PyTorch, ONNX, and XGBoost. If none of the pre-built Docker images serve your needs, you can build your own container for use with CPU backed multi-model endpoints. SageMaker inference supports most popular model servers such as TensorFlow Serving, TorchServe, NVIDIA Triton, AWS multi-model server.
Amazon SageMaker AI offers specialized deep learning containers (DLCs), libraries, and tooling for model parallelism and large model inference (LMI), to help you improve performance of foundational models. With these options, you can deploy models including foundation models (FMs) quickly for virtually any use case.