Amazon SageMaker Deployment

Easily deploy and manage machine learning models at scale

Amazon SageMaker provides the broadest selection of machine learning (ML) infrastructure and model deployment options meeting the needs of your use case, whether real-time or batch, so you can easily deploy your ML models at scale. Once you deploy your model, SageMaker creates persistent endpoints to integrate into your applications to make ML predictions (also known as inference). SageMaker supports the entire spectrum of inference requirements ranging from low latency (a few milliseconds) and high throughput (hundreds of thousands of inference requests per second), to long-running inference for use cases such as natural language processing (NLP) and computer vision (CV). Whether you bring your own models and containers, or use those provided by AWS, SageMaker enables you to use MLOps best practices to automatically provision, secure, scale, and manage model deployment, reducing the operational burden of managing ML models.

Easily deploy ML models for any use case

Wide selection of infrastructure to meet the needs of every use case

Amazon SageMaker offers more than 70 instance types with varying levels of compute and memory, including Amazon EC2 Inf1 instances based on AWS Inferentia, high-performance ML inference chips designed and built by AWS, and GPU instances such as Amazon EC2 G4dn to meet the performance and cost requirements of any use case, including real-time, high throughput, and batch inference.

Single-digit millisecond overhead latency for real-time inference

Amazon SageMaker allows developers to access deployed models through application programming interface (API) requests with response times as low as a few milliseconds. This supports use cases requiring real-time responses, such as ad serving, fraud detection, and personalized product recommendations.

Large models with long-running processing times

For inference requests that take longer to finish, Amazon SageMaker offers asynchronous inference, a queue-based system that supports processing times up to 15 minutes and payloads up to 1 gigabyte. Asynchronous inference is best suited for use cases requiring real-time user experience, but use large CV or NLP models. With asynchronous inference, your inference requests get separated from your backend application and web servers to prevent long-running inference requests from blocking your calling application. Asynchronous inference can also scale your model serving instance count down to zero, optimizing your costs when you don't need to make inference requests.

Fully managed inference pipelines

Deploy several models to make predictions in a sequence, with each model using as its input the inference from the previous model. Inference pipelines are fully managed and allow any combination of pretrained Amazon SageMaker built-in algorithms and your own custom algorithms. You can also use an inference pipeline in a more general way to combine preprocessing, inference, and post-processing data science tasks.

Achieve the best inference performance and cost

Cost-effective deployment with multi-model endpoints and multi-container endpoints

Amazon SageMaker provides scalable and cost-effective ways to deploy large numbers of ML models. SageMaker multi-model endpoints and multi-container endpoints enable you to deploy thousands of models on a single endpoint, improving cost-effectiveness while providing the flexibility to utilize models as often as you need them.

Autoscaling for elasticity

You can use autoscaling policies to automatically scale the underlying compute resources to accommodate fluctuations in inference requests. With autoscaling, you can shut down instances when there is no usage to prevent idle capacity and reduce inference cost.

Reduce operational burden and accelerate time to value

Fully managed service

As a fully managed service, Amazon SageMaker takes care of setting up and managing instances, ensuring software version compatibilities, and patching versions. SageMaker also provides built-in metrics and logs for endpoints that you can use to monitor and receive alerts.

Rapid and reliable deployment

Amazon SageMaker allows you to implement advanced deployment capabilities for ML, such as A/B testing, canary, batch, and blue/green deployments. Using these techniques, data scientists can rapidly release model updates, avoid downtime during deployment, and benchmark the performance of their models in development, test, and production environments.

Built-in integration with MLOps features

Amazon SageMaker model deployment features are integrated with MLOps capabilities, including SageMaker Pipelines (workflow automation and orchestration) and SageMaker Projects (CI/CD for ML), SageMaker Feature Store (feature management), SageMaker Model Registry (model and artifact catalog to track lineage and support automated approval workflows), SageMaker Clarify (bias detection), and SageMaker Model Monitor (model and concept drift detection). As a result, deploying one model or tens of thousands of models, SageMaker helps offload the operational overhead of deploying, scaling, and managing ML models while also getting your models to production faster.

Deploy your first model on SageMaker