Amazon SageMaker Model Deployment

Easily deploy and manage machine learning (ML) models for inference

Why SageMaker Model Deployment?

Amazon SageMaker makes it easy to deploy ML models including foundation models (FMs) to make inference requests at the best price-performance for any use case. From low latency (a few milliseconds) and high throughput (millions of transactions per second) to long-running inference for use cases such as natural language processing and computer vision, you can use SageMaker for all their inference needs. SageMaker is a fully managed service and integrates with MLOps tools, so you can scale your model sagemaker-deployment, reduce inference costs, manage models more effectively in production, and reduce operational burden.

Amazon SageMaker ML Inference

Benefits of SageMaker Model Deployment

From low latency (a few milliseconds) and high throughput (hundreds of thousands of requests per second) to long-running inference for use cases such as natural language processing and computer vision, you can use Amazon SageMaker for all your inference needs.
Amazon SageMaker offers more than 70 instance types with varying levels of compute and memory on the most high-performing infrastructure or choose Amazon SageMaker Serverless Inference to easily scale to thousands of models per endpoint. You can use autoscaling, to shut down instances when there is no usage to prevent idle capacity and reduce inference cost.
As a fully managed service, Amazon SageMaker takes care of setting up and managing instances, software version compatibilities, and patching versions. With built-in integration with MLOps features, it helps off-load the operational overhead of deploying, scaling, and managing ML models while getting them to production faster.

Wide range of options for every use case

Broad range of inference options

From low latency (a few milliseconds) and high throughput (hundreds of thousands of requests per second) to long-running inference for use cases such as natural language processing and computer vision, you can use Amazon SageMaker for all your inference needs.

Real-Time Inference

Real-Time Inference

Low latency and ultra-high throughput for use cases with steady traffic patterns.

Serverless Inference

Serverless Inference

Low latency and high throughput for use cases with intermittent traffic patterns.

Asynchronous Inference

Asynchronous Inference

Low latency for use cases with large payloads (up to 1 GB) or long processing times (up to 15 minutes).

Batch Transform

Batch Transform

Offline inference on data batches for use cases with large datasets.

Scalable and cost-effective deployment options

Amazon SageMaker provides scalable and cost-effective ways to deploy large numbers of ML models. With SageMaker’s multiple models on a single endpoint, you can deploy thousands of models on shared infrastructure, improving cost-effectiveness while providing the flexibility to use models as often as you need them. Multiple models on a single endpoint support both CPU and GPU instance types, allowing you to reduce inference cost by up to 50%

Single-model endpoints

Single-model endpoints

One model on a container hosted on dedicated instances or serverless for low latency and high throughput.

Multi-model endpoints