AWS Machine Learning Blog

Load test and optimize an Amazon SageMaker endpoint using automatic scaling

Once you have trained, optimized and deployed your machine learning (ML) model, the next challenge is to host it in such a way that consumers can easily invoke and get predictions from it. Many customers have consumers who are either external or internal to their organizations and want to use the model for predictions (ML inference). These consumers might not understand your ML stack and would want a simple API that can give them predictions in real time or in batch mode. Amazon SageMaker lets you host your model and provides an endpoint that consumers can invoke by a secure and simple API call using an HTTPS request. Many customers are curious about ways to invoke such endpoints with appropriate input that the model expects, and they are curious about the scalability and high availability of these endpoints.

This blog post describes how to invoke an Amazon SageMaker endpoint from the web and how to load test the model to find the right configuration of instance size and numbers serving the endpoint. With automatic scaling in Amazon SageMaker, you can ensure model’s elasticity and availability and optimize the cost by selecting the right metrics to monitor and react to.

Creating the endpoint for a model

I used a prebuilt and pre-trained model in this blog post that uses decision trees, a non-parametric supervised learning method for classification and regression. The model uses the iris dataset in the UCI Machine Learning Repository and predicts flower type based on sepal and petal length and width. I have already created a model endpoint for this image classification model but I encourage you to create your own endpoint. If you need help, see this GitHub repository.

A number of Jupyter notebooks including scikit_bring-your_own notebook in the GitHub repo described previously are available to you on the SageMaker instance at the following path:

/SageMaker Examples/advanced_functionality/scikit_bring_your_own

Open the scikit_bring_your_own notebook and execute all of the cells except the last one, which deletes the endpoint. Pay attention to the instance type ml.m4.xlarge in notebook cells where the model is being deployed, as shown in the following image. Amazon SageMaker doesn’t support automatic scaling for burstable instances such as t2. You would normally use the t2 instance types as part of your development and light testing.

Following the steps in the notebook creates an endpoint named decision-trees-sample-yyyy-mm-dd-hh-mm-ss-mss. It also creates a variant named AllTraffic under Endpoint runtime settings on the Endpoints page of the Amazon SageMaker console. The variant has only one instance of the ml.m4.xlarge type. Take note of the endpoint name as shown in the following image.

Invoking the endpoint from the web

Once you have created the endpoint, you need a way to invoke it outside a notebook. There are different ways you can invoke your endpoint and the model expects appropriate input (the model signature) when you invoke it. These input parameters can be in a file format such as .csv and .libsvm, an audio file, an image, or a video clip. The endpoint in this post expects .csv input. You will use AWS Lambda and Amazon API Gateway to format the input request and invoke the endpoint from the web. The following diagram shows the infrastructure’s architecture at this point.

If you want to use my code and setup, sign in to your AWS account and launch this AWS CloudFormation stack.

The stack requires the name of the endpoint as input. It creates a Lambda function named InvokeDecisionTreeSMEP and an API Gateway endpoint named DecisionTreeAPI to front the model’s endpoint. After the stack has been created successfully, you can make a test call at API Gateway console or using a tool such as postman. The following image shows sample output from making a test call using postman.

As you increase the number of concurrent prediction requests, at some point the endpoint responds more slowly and eventually errors out for some requests. Automatically scaling the endpoint avoids these problems and improves prediction throughput. The following diagram shows the architecture with automatic scaling.

When the endpoint scales out, Amazon SageMaker automatically spreads instances across multiple Availability Zones. This provides Availability Zone level fault tolerance and protects from an individual instance failure.

Load testing the endpoint

So far, you have invoked the endpoint with a single request or a few requests on the web. Next step is to load it with a large number of requests and observe the results. You can use an open-source tool named serverless-artillery to generate this load. It runs hundreds or thousands of Lambda functions to generate a massive number of requests per second (RPS) for a given HTTP or HTTPS endpoint. If you want to use a different tool, skip to the next section.

Before setting up serverless-artillery, make sure that you have Node.js version 4 or later and serverless installed. If you don’t, see AWS – Installation in the Serverless Framework documentation.

Install and deploy serverless-artillery:

npm install -g serverless-artillery

Configure serverless-artillery. This command creates a local copy of the deployment assets for modification and deployment of serverless-artillery resources:

slsart configure

Create a Lambda function for load testing:

slsart deploy

Create a local serverless-artillery script that you can customize for your load requirements:

slsart script

This command creates a file named script.yml in your local directory. Open the file for editing and add the following code. Get the target URL from the output of the AWS CloudFormation stack. The comment in this code has an example of a target URL:

  target: "<your API Gateway endpoint here>"
# target: ""
    maxErrorRate: 0
      duration: 3600
      arrivalRate: 200
      rampTo: 600
      name: "warm up"
      duration: 3600
      arrivalRate: 600
      rampTo: 800
      name: "Max load"
            content-type: "text/csv"
          url: "/DummyStage/DT"
          body: "5,3.5,1.3,0.3"

Deciding when to automatically scale

If the endpoint has only a moderate load, you can run it on a single instance and still get good performance. Use automatic scaling to ensure high availability during traffic fluctuations without having to provision for peak. For production workloads, use at least two instances. Because Amazon SageMaker automatically spreads the endpoint instances across multiple Availability Zones, a minimum of two instances ensures high availability and provides individual fault tolerance.

To determine the scaling policy for automatic scaling in Amazon SageMaker, test for how much load (RPS) the endpoint can sustain. Then configure automatic scaling and observe how the model behaves when it scales out. Expected behavior is lower latency and fewer or no errors with automatic scaling. For the decision-tree model in this post, a single ml.m4.xlarge instance hosting the endpoint can handle 25,000 requests per minute (400–425 RPS) with a latency of less than 15 milliseconds and without a significant (greater than 5%) number of errors. For more information on load testing a variant, see Load Testing for Variant Automatic Scaling in the Amazon SageMaker Developer Guide.

Running the load test

Run the load test in two scenarios:

  • BEFORE: No automatic scaling configured for the endpoint variant
  • AFTER: Automatic scaling configured for the endpoint variant
    Use the following parameters for the test:

    • Minimum instance count: 1
    • Maximum instance count: 4
    • Scaling policy: Target value of 25000 for SageMakerVariantInvocationsPerInstance (per minute)

In the directory where you installed serverless-artillery, invoke the load test. The following command generates the load specified in the script.yml configuration file that you updated:

slsart invoke

Next, load the endpoint for a duration of two hours with an initial rate of 200 RPS, ramping up to 600 RPS and eventually peaking at 800 RPS. These parameters are already configured in script.yml. During each test, capture metrics ModelLatency and InvocationXXXErrors in Amazon CloudWatch. When the BEFORE test finishes, configure automatic scaling for the endpoint and repeat the test for the AFTER scenario. If you need help with configuring automatic scaling, see Auto Scaling Is Now Available for Amazon SageMaker at the AWS News Blog website.

Note: While running performance/load test it is sometimes necessary to kill the test before it is complete. For example, you might consider the test complete when the target endpoint is not able to handle the current load and you want to stop the test as you notice excessive latency or errors.

You can run the following command to kill the performance test:

slsart kill –region<region-used-for-deploy>

The command will terminate the artillery and remove the deployed assets. Wait for approximately 5 minutes before redeploying.

Analyzing the test results

Now you can review the metrics in CloudWatch. On the Amazon SageMaker console, choose your endpoint. Under Monitor, choose View Invocation metrics, as shown in the following image.

The following image provides the recommended configuration for the CloudWatch metrics. However, feel free to explore the results with the level of granularity that you want.

Observe the differences between two scenarios under same load. The following graphs report the number of invocations and invocations per instance. In the BEFORE scenario, because only one instance is serving all requests, the lines merge. In the AFTER scenario, the load pattern and Invocations (the blue line) remain the same, but InvocationsPerInstance (the red line) changes as the endpoint scales out. The endpoint scales out as RPS exceeds 25,000 (the target for the scaling policy).

The following graphs report average ModelLatency per minute. In the BEFORE scenario, the average latency per minute remains under 10 milliseconds for the first 45 minutes of the test. Latency exceeds 300 milliseconds as the endpoint gets more than 25,000 RPS. In the AFTER scenario, latency per minute increases until 45 minutes into the test, when RPS exceeds 25,000 and the endpoint scales out. Then latency per minute drops. Pay attention to the latency numbers on the vertical axis in both scenarios. In AFTER scenario, the average ModelLatency per minute remains under 16 milliseconds.

The following graphs report the errors from endpoint. In the BEFORE scenario, as the load on the endpoint exceeds 28,000 RPS, the endpoint can’t keep up with the incoming prediction requests. It starts to error out for some requests. In the AFTER scenario, the endpoint scales out and reports no errors.

Wrapping up

After you have explored the Amazon SageMaker instance, the model’s endpoint, the API Gateway endpoint, the Lambda functions, and the other resources covered in this post, you can do the following:

  • Keep exploring! Find other ways to invoke your endpoint. Run more load testing scenarios. See how the model responds under different load conditions and how automatic scaling improves performance and throughput.
  • Delete the resources that you created. Clean up resources that you’re no longer using to avoid incurring further deployment costs. See Step 4: Clean up in the Amazon SageMaker Developer Guide and Deleting a Stack on the AWS CloudFormation Console in the AWS CloudFormation User Guide.

Note: The latency that CloudWatch reports for Amazon SageMaker doesn’t include the latency introduced by API Gateway and Lambda. If you want your model endpoint to be available to your customers on the internet, this is one easy way to go. For internal customers such as data scientists and analysts in your organization, you could also invoke the SageMaker model endpoint directly. Leave me some feedback/comments, and I’ll be happy to cover this in a future post for you!


Automatically scaling an Amazon SageMaker model endpoint helps in improving its throughput (number of invocations) and performance (number of predictions). When you host your endpoint for predictions, load test it and automatically scale on a CloudWatch metric based on how many requests per minute the endpoint can sustain. Automatically scaling an endpoint also helps in improving customer experience because consumers will not get errors when the endpoint is loaded beyond its peak capacity.

I want to thank my colleagues Guy Ernest, Nic Waldron, and Sang Kong for their contributions to this post.


[1] Dua, D. and Karra Taniskidou, E. (2017).  []. Irvine, CA: University of California, School of Information and Computer Science.

About the Author

BK Chaurasiya is a Solutions Architect with Amazon Web Services. He provides technical guidance, design advice and thought leadership to some of the largest and successful AWS customers and partners.