How do I troubleshoot latency with my Amazon SageMaker endpoint?

3 minute read

When I invoke my Amazon SageMaker endpoint, I experience high latency.

Short description

Use Amazon CloudWatch to monitor the latency metrics ModelLatency and OverheadLatency for a SageMaker endpoint that serves a single model.

ModelLatency is the amount of time that a model takes to respond to an inference request, as viewed by SageMaker. This duration includes the local communication time for the model to send the request and fetch the response. It also includes the completion time of the inference inside the model container.
OverheadLatency is the amount of time that SageMaker takes to respond to an invocation request with overheads. This measurement lasts from when SageMaker receives a request until it returns a response, minus ModelLatency.

When you use a SageMaker multi-model endpoint, the following additional metrics are available in CloudWatch:

ModelLoadingWaitTime: The amount of time that an invocation request waits for the target model to download or load, before performing inference.
ModelDownloadingTime: The amount of time to download the model from Amazon Simple Storage Service (Amazon S3).
ModelLoadingTime: The amount of time to load the model from the container.
ModelCacheHit: The number of InvokeEndpoint requests that are sent to the endpoint where the model previously loaded.

Multi-model endpoints load and unload models throughout their lifetime. You can use the LoadedModelCount CloudWatch metric to view the number of loaded models for an endpoint.

Resolution

High ModelLatency

To reduce this latency, take any of the following actions:

Benchmark the model outside of a SageMaker endpoint to test performance.
If SageMaker Neo supports your model, then you can compile the model. SageMaker Neo optimizes models to run up to twice as fast with less than a tenth of the memory footprint with no loss in accuracy.
If AWS Inferentia supports your model, then you can compile the model for Inferentia. This offers up to three times higher throughput and up to 45% lower cost per inference compared to the AWS GPU-based instances.
If you use a CPU instance and the model supports GPU acceleration, then use a GPU instance to add GPU acceleration to an instance.
Note: The inference code might affect the model latency depending on how the code handles the inference. Any delays in code increase the latency.
An overused endpoint might cause higher model latency. To dynamically increase and decrease the number of instances that are available for an endpoint, add auto scaling to an endpoint.

High OverheadLatency

Multiple factors might contribute to OverheadLatency. These factors include the payload size for request and responses, request frequency, and the authentication or authorization of the request.

The first invocation for an endpoint might have an increase in latency because of a cold start. This is expected with the first invocation requests. To avoid this issue, send test requests to the endpoint to pre-warm it. Note that infrequent requests might also lead to an increase in OverheadLatency.

Topics

Machine Learning & AI

Relevant content

SageMaker Inference recommender - Model latency for streaming response
Gabriel
asked 5 months ago
Overall p90 Latency Among Many Latency Metrics
Xicheng
asked 2 years ago
Unable to publish CloudWatch latency and result metrics with timestamp
daichi
asked 2 years ago
Amazon CloudWatch Metric ModelSetupTime not available
rePost-User-0987313
asked 2 years ago
How does Cloudwatch latency metric get calculated/ Did I find a bug in X-Ray Traces?
jdetlefs
asked 2 years ago
How do I troubleshoot write latency spikes in my Amazon RDS DB instance?
AWS OFFICIALUpdated 2 years ago
How do I troubleshoot the high write or read latency of the Amazon EBS volumes in my Amazon RDS instance?
AWS OFFICIALUpdated 8 months ago
How can I troubleshoot latency issues for my edge-optimized API endpoint in API Gateway?
AWS OFFICIALUpdated a year ago
Why is my Amazon DynamoDB maximum latency metric high when the average latency is normal?
AWS OFFICIALUpdated 2 years ago
Monitoring SageMaker Notebook Instance with CloudWatch Custom Metrics
EXPERT
Ben Lee
published 4 months ago