AWS Marketplace

Rightsizing Amazon SageMaker endpoints

As AWS consultants, Victor and I often get asked about recommendations on the right instance configuration to use for real-time inference. Finding the correct instance size to host your trained machine learning (ML) models might be a challenging task. However, choosing the right instance and auto scaling configuration can help reduce model serving costs without any disruptions to your end users. In this blog post, Victor and I will show how to choose the right instance for your inference endpoints, so you can perform conscientious analysis rather than take a guesstimate approach based on prior expertise or trial and error.

To illustrate how to identify the right instance type for deploying your ML models for real-time inference, we use this pretrained Image Classification model. This model is trained on the ImageNet dataset and available in the AWS ML Marketplace. We host this model on multiple endpoints and use the Locust framework to load test the endpoints. To find the right instance, we test each endpoint by increasing the requests per second (RPS) until the endpoint fails to respond to at least one percent of the total incoming requests. You can follow the process by running the Jupyter Notebook in the aws-sagemaker-examples repository on GitHub.

Background concepts

  • Amazon API Gateway is a fully managed service that makes it easy for developers to create, publish, maintain, monitor, and secure APIs at scale.
  • AWS Lambda is a serverless compute service that lets you run code without provisioning or managing servers, creating workload-aware cluster scaling logic, maintaining event integrations, or managing runtimes.
  • Locust is an open-source load testing tool where your application user behavior can be expressed in regular Python code. Locust must be installed in the Notebook instance where you are running the tests.
  • AWS ML Marketplace helps you find pretrained ML models that can be quickly deployed on Amazon SageMaker. By pretraining the ML models for you, solutions in AWS Marketplace take care of the heavy lifting, helping you deliver AI- and ML-powered features faster and at a lower cost.

Solution architecture

The following architecture shows how the model is deployed on SageMaker endpoints and how the load tests are executed through requests made via Amazon API Gateway and AWS Lambda. The diagram shows the launch of an Amazon SageMaker notebook instance in an AWS account. That is then used to deploy multiple real-time SageMaker endpoints. You then deploy the infrastructure to access the endpoints via the internet, through Amazon API Gateway and a Lambda function in the same account. You perform load tests using the Locust framework.

0.1: Launch Amazon SageMaker Notebook instance and install required packages
0.2: Set up required IAM permissions for SageMaker execution role
0.3: Update account limits in AWS console
1.1: Setup your notebook environment loading the specified libraries
1.2: Identify and download (if necessary) your datasets for testing
1.3: Subscribe to the PyTorch ResNet50 ML Model from AWS Marketplace
1.4: Deploy the infrastructure components for load testing
2.1: Deploy multiple Amazon SageMaker endpoints
2.2: Test endpoints with a sample payload
2.3: Perform load tests

AWS architecture for SageMaker endpoint load testing

As the architecture suggests, we use multiple endpoints (CPU and GPU) to find the optimal endpoint instance type that can serve your application requirements.

Solution overview

The implementation of this solution and load testing involves the following steps:

  • Step 0: Prerequisites:
    • Authenticate into your AWS account, create an Amazon SageMaker notebook instance and create a copy of this sample Jupyter Notebook.
    • Update your account limits to be able to deploy the instances required for the load tests.
    • Update your notebook role permissions.
  • Step 1: Set up the model and the endpoint
  • Step 2: Run the load tests, review the results, and identify the best endpoint configuration.
  • Step 3: To avoid additional costs, clean up and delete resources.

The example we explain in this blog post uses the PyTorch ResNet 50 AWS Marketplace model. However, the code provided in the repository enables you to use other AWS Marketplace models as well as your custom machine learning models.

Solution walkthrough: Rightsizing Amazon SageMaker endpoints

Step 0: Prerequisites

  1. To create a classic notebook instance, follow these instructions. After the instance is up and running, open Jupyter and clone the repository as follows:
    git clone
  2. Open the right_size_your_sagemaker_endpoints folder and open the Right-sizing your Amazon SageMaker Endpoints notebook.
  3. For the code to run properly, you must configure your Amazon SageMaker execution role with Administrator permissions on the account or with the following AWS managed policies:
    • AmazonSageMakerFullAccess (this policy is attached by default when you create a notebook execution role)
    • IAMFullAccess
    • AmazonAPIGatewayAdministrator
    • AWSPriceListServiceFullAccess
    • AWSLambda_FullAccess
  4. You also must update your account limits to create the endpoints with the type of instances you’d like to test. Request account quota updates from your AWS account. For detailed instructions on how to do that, see Requesting a quota increase. To run the supporting notebook end-to-end, you will need the following instance sizes and corresponding instance counts.
    • ml.c5.xlarge – 1
    • ml.c5.2xlarge – 1
    • ml.m5.large – 1
    • ml.m5.xlarge – 1
    • ml.p2.xlarge – 1
    • ml.p3.2xlarge – 1
    • ml.g4dn.xlarge -3

Step 1: Setting up the model and endpoint

Step 1.1: Set up the environment

To set up the environment to ensure the notebook runs without any errors, import all necessary libraries, including Python packages and custom scripts. To do that, in the notebook that you opened in the Prerequisites section, run the cell under Step 1.1: Set up environment.

import os
import re
import json
import time
import boto3
import random
import requests
import base64
import sagemaker
import pandas as pd
from pprint import pprint
from IPython.display import Image
from sagemaker import ModelPackage
from sagemaker import get_execution_role

# Import from helper functions
from api_helper import create_infra, delete_infra
from sagemaker_helper import deploy_endpoints, clean_up_endpoints
from load_test_helper import run_load_tests, generate_plots, get_min_max_instances, generate_latency_plot

# Define the boto3 clients/resources that will be used later on
sm_client = boto3.client('sagemaker')

# Define session variables
sagemaker_session = sagemaker.Session()
region = sagemaker_session.boto_region_name
account_id = sagemaker_session.account_id()
role = get_execution_role()

The first block of libraries are common libraries typically used in data science.

In the second block, you are importing custom helper functions. These are provided to you to execute the load tests more consistently and quickly. You can explore the functions and add any modifications if necessary.

In the final block, the session variables are defined. These are eventually used when invoking the SageMaker APIs.

Step 1.2: Identify and prepare data for load testing

For our tests, we use the Pytorch ResNet50 model, which we deployed from AWS Marketplace. This model has been trained on the ImageNet dataset, which contains more than 20,000 categories with several hundred images of each category.

To run load testing on the ML model and reduce memory usage, we used a single image contained in the repository, plants.jpg. If you want to test the model with other images or types of data, download your own dataset. Do remember to update the scripts to read the type of data you provide and to load it from the updated location. Specifically, you must update the scripts and

Step 1.3: Subscribe to the PyTorch ResNet50 ML Model in AWS Marketplace

To use the model from AWS Marketplace, follow these instructions:

    1. Open the following ML models in separate tabs. Note that both ML models have application/x-image as the input type.
    1. Subscribe to both of the ML models. You can skip this step if you are using your own model. If you’re using your own model, ensure that it is loaded on Amazon SageMaker as a model or model package that can be readily deployed to an endpoint.
    2. You must now create a dynamic model object that can be configured for load testing. Since you want to compare model performances across different types of instances, create both the GPU and CPU version of the model. For each model package listing, identify model package ARN by following these steps:
      • Choose Continue to Configuration.
      • Copy the Product Amazon Resource Number (ARN). Your ARNs should look similar to the following ARNs.

model_arn_cpu = "arn:aws:sagemaker:us-east-1:<account>:model-package/pytorch-ic-wide-resnet50-2-cpu-6a1d8d24bbc97d8de3e39de7e74b3293"
model_arn_gpu = "arn:aws:sagemaker:us-east-1:<account>:model-package/pytorch-ic-wide-resnet50-2-gpu-445fe358cb7a3a0d92861174cf00c113"

    1. Create CPU and GPU models on Amazon SageMaker from the given model ARNs. This code is also included in the example notebook. To create the models, run the following cell in the notebook.
from sagemaker import ModelPackage
pytorch_model_cpu = ModelPackage(
pytorch_model_gpu = ModelPackage(

Step 1.4: Set up Lambda function and an API Gateway

To simulate and test the functioning of the model endpoint in a production environment, implement an API Gateway endpoint and an AWS Lambda function. These will direct the requests to the SageMaker endpoint.

To set up the infrastructure suggested for the load test, use the create_infra() helper function. This creates a Lambda function to invoke the endpoint and an API Gateway in front of the Lambda function. It also outputs the API gateway URL that you can use later to request the predictions via POST requests. To set up the infrastructure, run the following cell in the notebook.

project_name = “right-size-endpoints”
api_url = create_infra(project_name, account_id, region)

Step 2: Load testing

Depending on which model you are using and what your requirements are, you can do your load testing in two different ways.

  1. If you are a beginner, test the endpoint’s performance for all supported instance types and settle on the best instance type configuration based on the results.
  2. For a more seasoned user or if there are too many supported instances, perform a focused test on the set of instances that you would like to test.

For simplicity, we are showing how to test the endpoint on all supported instances for the selected model. However, the notebook also helps you perform a more strategic test using the Semi-automatic testing section.

Step 2.1: Create endpoints

After setting up the infrastructure, deploy the endpoints with the preselected instances. The notebook has a user-defined function that automatically deploys the model to all supported endpoints. The function receives a dictionary listing the endpoint types and number of instances per endpoint as input. It also receives the model objects that are expected to be deployed in each endpoint. To deploy the endpoints, run the following cell in the notebook.

endpoints = deploy_endpoints(endpoints_dict, cpu_model, gpu_model)

Step 2.2: Test endpoints with a sample payload

After deployment, test the endpoint before starting the load test. This confirms that the model was deployed correctly and that the endpoint is responding as expected. In the example notebook, there is a step to test if your endpoints are on.

To test that your endpoints are working as expected, run the following cell in the notebook.

input_file = "plants.jpg"
with open(input_file, 'rb') as f:
    image =

# Run the lambda function once for each endpoint
# and check for HTTP 200 response
for ep in endpoints:
    # Create a payload for Lambda - input variables are the image and endpoint name
    payload = {
        'data': str(image),
        'endpoint': ep}
    response =, json=payload)
    print(f"Endpoint {ep}: {response}")

An HTTP 200 response for all the requests ensures that the model endpoints are working as expected.

Step 2.3: Execute load tests

To execute load tests, use the run_load_tests() function that calls a short bash script that runs Locust.

You must pass the list of endpoints to test to the function run_load_tests(). The load test generates multiple comma-separated files (.csv) with the test results and stores them in a folder with the name results-YYYY-MM-DD-HH-MM-SS. Inside this folder are the individual test results for each endpoint instance.

To run the tests, run the following cell in the notebook.

results_folder = run_load_tests(api_url, endpoints)

Background on Locust load tests

The run_load_tests function performs load tests and organizes the resulting files into a single folder. Locust saves up to four files for each load test with the following suffixes, similar to the following directory structure.

├── results-timestamp
│   ├── endpoint-name
│   │   ├── endpoint-name_failures.csv
│   │   ├── endpoint-name_stats.csv
│   │   └── endpoint-name_stats_history.csv
│   │   └── endpoint-name_exceptions.csv (optional)
│   ├── endpoint-name ...

The first two files contain the failures and stats for the whole test run, with a row for every stats entry and an aggregated row. The stats history gets new rows with the current, 10-second sliding window stats, appended during the whole test run. You can find more information on the files and how to increase or decrease the interval of writing stats in the Retrieve test statistics in CSV format documentation.

In step 2.4, you plot the maximum number of RPS handled by each endpoint. For further understanding, check the files generated for exceptions, failures, and users or requests generated.

Step 2.4: Performance vs. price plot

To provide the best value to customers, AWS constantly updates instance prices. The Price List Service API (Query API) and AWS Price List API (Bulk API) enable you to query for the prices of AWS services using either JSON for the Price List Service API or HTML for the AWS Price List API. To query the instance prices in real time, your Amazon SageMaker execution role must have permissions to access the service. Refer to the Prerequisites section.

After the load tests run in step 2.3, analyze the results. For an easy understanding of how the different instances perform, it is better to visualize them in a single plot.

In this case, the GPU instances perform significantly better for image classification problems, with the least metric at ~20 RPS (RPS) for US$1.125 per hour. This was using the ml.p2.xlarge instance. This contrasts with the most expensive CPU providing around three RPS at US$0.40 per hour. This totaled US$3.33 for 25 RPS, or 2.5 times more expensive.

Within the GPU instances, the ml.g4dn.xlarge instance is the industry’s most cost-effective GPU instance for deploying ML models that are graphics-intensive, such as image classification and object detection. At US$0.70 an hour, it performs twice as well as the next cheapest option, the ml.p2.xlarge.

Depending on your RPS, such as 10 RPS or lower, choose the CPU or the GPU option. If you go the GPU route, the G4 instance clearly emerges as the winner.

The following scatter graph shows the comparison in performance between the different instance types by plotting the maximum RPS on the y axis against the instance price per hour on the x axis. Each instance size is depicted with a colored dot.

Plot showing pricing of different instance types versus their performance in requests per second

Optional: Latency metrics

Optionally, if latency is what you desire, visualize the latency results. To do that, in the example code provided, run the generate_latency_plot function from the file. This plots the average, minimum, and maximum response times for each endpoint.

The minimum response times are close to one second for all endpoints, regardless of the instance type. However, the average response time is quickest for the three GPU instances, with the ml.g4dn.xlarge and ml.p3.2xlarge having the lowest average response times.

Remember that the response time includes model processing time and the API Gateway and Lambda processing time. If you require a faster response for your application, use the Locust result file to examine the response times and choose accordingly.

The following image shows the minimum, maximum and average response times for the different instance types. Latency in seconds is shown on the y axis, and instance sizes are on the x axis.

Plot showing latency metrics for response times.

Step 2.5 shows how to programmatically obtain the recommended instance type and autoscaling configuration.

Step 2.5: Optimize endpoint configuration

Now that you have run load tests with all the possible endpoints, you must decide on the final endpoint configuration for your use case. Choose this based on the number of average and maximum RPS expected for your application. A user-defined function helps you obtain the ideal configuration for your model serving. This function bases its recommendation on your required performance in terms of minimum and maximum value of RPS. Run the following cell in the notebook to get the recommended configuration.

get_min_max_instances(results, min_requests_per_second, max_requests_per_second)

This function assumes linear scaling based on tests performed while calculating the minimum and maximum number of instances needed to serve your expected requests per second. The code calculates the price for each instance type and instance count and returns the least expensive option.

After you get a recommended configuration, test the endpoints with the given instance count configurations to ensure that they can serve your requirements.

If your application or endpoint gets spiky loads, it is a good idea to set up an auto scaling configuration. For more information, see Automatically Scale Amazon SageMaker Models. Additionally, for other best practices on ML deployment, see Deployment Best Practices.

The following chart shows the maximum supported RPS on the Y axis against the price per hour for the instance type on the X axis. It shows that the endpoint with two instances ml.g4dn.xlarge can withstand a max load requirement of 66 RPS at a price of $1.472 per hour.

Graph showing pricing versus performance of G4dn instances

Step 3: Clean up resources

To avoid incurring additional costs, delete the resources created by doing the following:

  1. To stop the endpoints launched, use the clean_up_endpoints() function included in the example notebook. This function receives the list of endpoints and deletes it for you.
  2. Delete the infrastructure, including the Lambda function and API Gateway and the model objects created.
  3. Ensure that you do not have any deployable models created from the model package or using the algorithm. Then unsubscribe from the model in AWS Marketplace.


In this blog post, Victor and I demonstrated how to find the right cost and performance-optimized instance type. We also showed the best practice of testing your endpoint against your expected requirements before hosting them for real-time predictions on a production environment. In cases of spiky loads, we recommended implementing an automatic scaling configuration based on Amazon CloudWatch metrics. This will avoid any errors when the number of requests increase and avoid costs when the number of requests decrease.

About the authors

Durga Sury is a ML Solutions Architect in the Amazon SageMaker Service SA team. She is passionate about making machine learning accessible to everyone. Prior to AWS, she enabled non-profit and government agencies derive insights from their data to improve education outcomes. In her spare time, she loves motorcycle rides and hiking with her four-year old husky.
Victor Jaramillo is a Data Scientist in AWS Professional Services. He has a PhD in Mechatronics, specialized in Deep Learning and Artificial Intelligence. Throughout his experience as researcher, lecturer and consultant he has built solid analytic skills and expertise on Data Science for industrial applications.In his free time, he enjoys riding his motorcycle and DIY motorcycle mechanics.