Amazon SageMaker Model Deployment

Easily deploy and manage machine learning (ML) models for inference

Deploy models in production for inference for any use case.

Achieve optimal inference performance and cost.

Use MLOps ready deployment to reduce operational burden.

Amazon SageMaker makes it easy to deploy ML models to make predictions (also known as inference) at the best price-performance for any use case. It provides a broad selection of ML infrastructure and model deployment options to help meet all your ML inference needs. It is a fully managed service and integrates with MLOps tools, so you can scale your model deployment, reduce inference costs, manage models more effectively in production, and reduce operational burden.

Wide range of options for every use case

Broad range of inference options

From low latency (a few milliseconds) and high throughput (hundreds of thousands of requests per second) to long-running inference for use cases such as natural language processing and computer vision, you can use Amazon SageMaker for all your inference needs.

Deployment_real-time_inference_transparent

Real-time Inference

Low latency and ultra-high throughput for use cases with steady traffic patterns.

Deployment_serverless_inference_transparent

Serverless Inference

Low latency and high throughput for use cases with intermittent traffic patterns.

Deployment_asynchronous_inference_transparent

Asynchronous Inference

Low latency for use cases with large payloads (up to 1 GB) or long processing times (up to 15 minutes).

Deployment_batch_transform_transparent

Batch Transform

Offline inference on data batches for use cases with large datasets.

Flexible deployment endpoint options

Amazon SageMaker provides scalable and cost-effective ways to deploy large numbers of ML models. With SageMaker’s multi-model endpoints and multi-container endpoints, you can deploy thousands of models on a single endpoint, improving cost-effectiveness while providing the flexibility to use models as often as you need them.

Single-model-endpoint

Single-model endpoints

One model on a container hosted on dedicated instances or serverless for low latency and high throughput.

Multi-model-endpoint

Multi-model endpoints

Multiple models sharing a single container hosted on dedicated instances for cost-effectiveness.

Multi-container-endpoint

Multi-container endpoints

Multiple containers sharing dedicated instances for models that use different frameworks.

Serial-inference-pipeline

Serial inference pipelines

Multiple containers sharing dedicated instances and executing in a sequence.

Supports most machine learning frameworks and model servers

Amazon SageMaker inference supports built-in algorithms and prebuilt Docker images for some of the most common machine learning frameworks such as Apache MXNet, TensorFlow, and PyTorch. Or you can bring your own containers. SageMaker inference also supports most popular model servers such as TensorFlow Serving, TorchServe, NVIDIA Triton, and AWS Multi Model Server. With these options, you can deploy models quickly for virtually any use case.

TensorFlow
PyTorch
mxnet
Hugging Face logo
TensorFlow

Achieve high inference performance at low cost

Deploy models on the most high-performing infrastructure or go serverless

Amazon SageMaker offers more than 70 instance types with varying levels of compute and memory, including Amazon EC2 Inf1 instances based on AWS Inferentia, high-performance ML inference chips designed and built by AWS, and GPU instances such as Amazon EC2 G4dn. Or, choose Amazon SageMaker Serverless Inference to easily scale to thousands of models per endpoint, millions of transactions per second (TPS) throughput, and sub10 millisecond overhead latencies.

Automatic inference instance selection and load testing

Amazon SageMaker Inference Recommender helps you choose the best available compute instance and configuration to deploy machine learning models for optimal inference performance and cost. SageMaker Inference Recommender automatically selects the compute instance type, instance count, container parameters, and model optimizations for inference to maximize performance and minimize cost.

Auto scaling for elasticity

You can use scaling policies to automatically scale the underlying compute resources to accommodate fluctuations in inference requests. With auto scaling, you can shut down instances when there is no usage to prevent idle capacity and reduce inference cost.

Reduce operational burden and accelerate time to value

Fully managed model hosting and management

As a fully managed service, Amazon SageMaker takes care of setting up and managing instances, software version compatibilities, and patching versions. It also provides built-in metrics and logs for endpoints that you can use to monitor and receive alerts.

Built-in integration with MLOps features

Amazon SageMaker model deployment features are natively integrated with MLOps capabilities, including SageMaker Pipelines (workflow automation and orchestration), SageMaker Projects (CI/CD for ML), SageMaker Feature Store (feature management), SageMaker Model Registry (model and artifact catalog to track lineage and support automated approval workflows), SageMaker Clarify (bias detection), and SageMaker Model Monitor (model and concept drift detection). As a result, whether you deploy one model or tens of thousands, SageMaker helps off-load the operational overhead of deploying, scaling, and managing ML models while getting them to production faster.

Customer success

Hugging Face

"Transformers have changed machine learning, and Hugging Face has been driving their adoption across companies, starting with natural language processing, and now, with audio and computer vision. The new frontier for machine learning teams across the world is to deploy large and powerful models in a cost-effective manner. We tested Amazon SageMaker Serverless Inference and were able to significantly reduce costs for intermittent traffic workloads, while abstracting the infrastructure. We’ve enabled Hugging Face models to work out-of-the-box with SageMaker Serverless Inference, helping customers reduce their machine learning costs even further." 

Jeff Boudier, Director of Product – Hugging Face

Bazaarvoice

"Bazaarvoice leverages machine learning to moderate user-generated content to enable a seamless shopping experience for our clients in a timely and trustworthy manner. Operating at a global scale over a diverse client base, however, requires a large variety of models, many of which are either infrequently used or need to scale quickly due to significant bursts in content. Amazon SageMaker Serverless Inference provides the best of both worlds: it scales quickly and seamlessly during bursts in content and reduces costs for infrequently used models."

Lou Kratz, Ph.D., Principal Research Engineer – Bazaarvoice

Shape

Shaped, a Y Combinator backed startup, provides APIs that personalize user experience for all media types.

"Each of our clients requires many inference endpoints that can scale to millions of recommendation requests per day. Using SageMaker Serverless Inference we plan to deploy hundreds of scalable model endpoints with different clients globally in a cost-effective and flexible way.”

Tullie Murrell, CEO – Shaped

TruConnect

TruConnect, a free mobile service for low-income households, leverages machine learning to improve customer onboarding process.

"In collaboration with Quantiphi (AWS partner), we deployed an intelligent document processing solution on Amazon SageMaker Serverless Inference to automatically extract information from customer documents to confirm eligibility for public assistance programs. As a result, we accelerated our onboarding process by 3x times. With zero human intervention, we now spend less than a minute per document, saving multiple hours previously spent scanning each document."

Travlin McCormack, CTO – TruConnect

LinkSquares

"LinkSquares uses machine learning to accelerate the contract review process by automatically extracting 115+ terms from legal contracts. Using Amazon SageMaker Serverless Inference, we dynamically allocate compute resources appropriate to variable demand, reducing our operating cost by 90% and increasing overall system stability and responsiveness to bursting demands."

Andrew Leverone, SVP of Product and Engineering – LinkSquares

Deploy your first model on SageMaker