Amazon SageMaker Deployment

Easily deploy and manage machine learning models at scale

Amazon SageMaker provides a broad selection of machine learning (ML) infrastructure and model deployment options to help meet your needs, whether real time or batch. Once you deploy a model, SageMaker creates persistent endpoints to integrate into your applications to make ML predictions (also known as inference). It supports the entire spectrum of inference, from low latency (a few milliseconds) and high throughput (hundreds of thousands of inference requests per second) to long-running inference for use cases such as natural language processing (NLP) and computer vision (CV). Whether you bring your own models and containers or use those provided by AWS, you can implement MLOps best practices using SageMaker to reduce the operational burden of managing ML models at scale.

Easily deploy ML models for any use case

Wide selection of infrastructure for virtually every need

Amazon SageMaker offers more than 70 instance types with varying levels of compute and memory, including Amazon EC2 Inf1 instances based on AWS Inferentia, high-performance ML inference chips designed and built by AWS, and GPU instances such as Amazon EC2 G4dn. It can help meet the performance and cost requirements of almost any use case, including real time, high throughput, and batch inference.

Real-time inference for single-digit millisecond overhead latency

Amazon SageMaker allows developers to access deployed models through application programming interface (API) requests with response times as low as a few milliseconds. This supports use cases requiring real-time responses, such as ad serving, fraud detection, and personalized product recommendations.

SageMaker Real-time Inference

Serverless inference for intermittent usage patterns

For use cases with intermittent and unpredictable usage patterns, Amazon SageMaker Serverless Inference (preview) allows you to deploy ML models on pay-per-use pricing without worrying about servers or clusters. When deploying your model, simply select the serverless option, and Amazon SageMaker automatically provisions, scales, and turns off compute capacity based on the volume of inference requests, so you don’t need to manage complex scaling policies and forecast traffic demand up front. With SageMaker Serverless Inference, you pay only for the compute capacity used to run the inference requests, billed by the millisecond, and the amount of data processed—you are not charged for periods of no traffic.

SageMaker Serverless Inference

Asynchronous inference for large models with long-running processing times

For inference requests that take longer to finish, Amazon SageMaker offers asynchronous inference, a queue-based system that supports processing times up to 15 minutes and payloads up to 1 gigabyte. Asynchronous inference is best suited for use cases that require real-time user experience but use large CV or NLP models. It separates your inference requests from your backend application and web servers to prevent long-running requests from blocking your calling application. Asynchronous inference can also scale your model serving instance count down to zero, optimizing your costs when you don't need to make inference requests.

SageMaker Asynchronous Inference

Fully managed inference pipelines

Deploy several models to make predictions in a sequence, with each model using as its input the inference from the previous model. Inference pipelines are fully managed and allow any combination of pretrained Amazon SageMaker built-in algorithms and your own custom algorithms. You can also use an inference pipeline in a more general way to combine preprocessing, inference, and post-processing data science tasks.

Achieve the best inference performance and cost

Automatic inference instance selection and load testing

Amazon SageMaker Inference Recommender helps you choose the best available compute instance and configuration to deploy machine learning models for optimal inference performance and cost. SageMaker Inference Recommender automatically selects the compute instance type, instance count, container parameters, and model optimizations for inference to maximize performance and minimize cost. You can use SageMaker Inference Recommender from SageMaker Studio, the AWS Command Line Interface (CLI), or the AWS SDK, and within minutes, get recommendations to deploy your ML model. You can then deploy your model to one of the recommended instances or run a fully managed load test on a set of instance types you choose without worrying about testing infrastructure. You can review results of the load test in SageMaker Studio and evaluate the tradeoffs between latency, throughput, and cost to select the most optimal deployment configuration for your use case.

Cost-effective deployment with multi-model endpoints and multi-container endpoints

Amazon SageMaker provides scalable and cost-effective ways to deploy large numbers of ML models. SageMaker multi-model endpoints and multi-container endpoints enable you to deploy thousands of models on a single endpoint, improving cost-effectiveness while providing the flexibility to use models as often as you need them.

Auto scaling for elasticity

You can use scaling policies to automatically scale the underlying compute resources to accommodate fluctuations in inference requests. With auto scaling, you can shut down instances when there is no usage to prevent idle capacity and reduce inference cost.

Reduce operational burden and accelerate time to value

Fully managed service

As a fully managed service, Amazon SageMaker takes care of setting up and managing instances, software version compatibilities, and patching versions. It also provides built-in metrics and logs for endpoints that you can use to monitor and receive alerts.

Rapid and reliable deployment

Amazon SageMaker allows you to implement advanced deployment capabilities for ML, such as A/B testing, canary, batch, and blue/green deployments. Using these techniques, data scientists can rapidly release model updates, avoid downtime during deployment, and benchmark the performance of their models in development, test, and production environments.

Built-in integration with MLOps features

Amazon SageMaker model deployment features are natively integrated with MLOps capabilities, including SageMaker Pipelines (workflow automation and orchestration), SageMaker Projects (CI/CD for ML), SageMaker Feature Store (feature management), SageMaker Model Registry (model and artifact catalog to track lineage and support automated approval workflows), SageMaker Clarify (bias detection), and SageMaker Model Monitor (model and concept drift detection). As a result, whether you deploy one model or tens of thousands, SageMaker helps off-load the operational overhead of deploying, scaling, and managing ML models while getting them to production faster.

Customer success

iFood

"iFood, a leading player in online food delivery in Latin America fulfilling over 60 million orders each month, uses machine learning to make restaurant recommendations to its customers ordering online. We have been using Amazon SageMaker for our machine learning models to build high-quality applications throughout our business. With Amazon SageMaker Serverless Inference we expect to be able to deploy even faster and scale models without having to worry about selecting instances or keeping the endpoint active when there is no traffic. With this, we also expect to see a cost reduction to run these services." 

Ivan Lima, Director of Machine Learning & Data Engineering at iFood

Qualtrics

"Amazon SageMaker Inference Recommender improves the efficiency of our MLOps teams with the tools required to test and deploy machine learning models at scale. With SageMaker Inference Recommender, our team can define latency and throughput requirements and quickly deploy these models faster, while also meeting our budget and production criteria."

Samir Joshi, ML Engineer at Qualtrics

Loka

"Loka, a machine learning consulting firm, helps its clients harness and build ML into their products across a wide range of use cases to deliver better customer experiences. We spend a lot of time and effort optimizing models, tuning servers, and testing instance types to deliver performant, scalable, and cost effective ML environments for our clients. Now using Amazon SageMaker Inference Recommender, our engineers are able to get an ML model deployed to production within minutes from any location.”

Bobby Mukherjee, CEO of Loka

Holmusk

"Holmusk, a digital health company, launched its FoodDX app to help people improve their diet and health. Our food image recognition algorithms need low latency to ensure our users get the right diet recommendations at the right time. To achieve low latency, we were over-provisioning GPUs, which was expensive. Using Amazon SageMaker Inference Recommender, we can now easily conduct load tests across different instances and determine an instance configuration within hours to reduce our compute costs significantly while maintaining latency requirements. This is a huge win for our team and lets our ML scientists focus on creating algorithms to help people live healthier lives rather than managing infrastructure."

Sai Subramanian, Chief Technology Officer, Holmusk

Deploy your first model on SageMaker