How do I troubleshoot latency with my Amazon SageMaker endpoint?
Last updated: 2022-10-14
I'm experiencing high latency when invoking my Amazon SageMaker endpoint.
You can use Amazon CloudWatch to monitor the latency metrics ModelLatency and OverheadLatency for a SageMaker endpoint that serves a single model.
- ModelLatency is the amount of time taken for a model to respond to an inference request as viewed by SageMaker. This duration includes the local communication time to send the request and fetch the response as well as the amount of time taken to complete the inference inside the model container.
- OverheadLatency is the time taken to respond to an invocation request with SageMaker overheads. This is measured from when SageMaker receives a request until it returns a response, minus ModelLatency.
When you use a SageMaker multi-model endpoint, the following additional metrics are available in CloudWatch:
- ModelLoadingWaitTime is the amount of time that an invocation request has waited for the target model to be downloaded or loaded before performing inference.
- ModelDownloadingTime is the amount of time taken to download the model from Amazon Simple Storage Service (Amazon S3).
- ModelLoadingTime is the amount of time taken to load the model from the container.
- ModelCacheHit is the number of InvokeEndpoint requests sent to the endpoint where the model has already been loaded.
Multi-model endpoints load and unload models throughout their lifetime. You can view the number of loaded models for an endpoint using the LoadedModelCount metric that's published in CloudWatch.
ModelLatency is the time spent for SageMaker to send an invocation request to your model container and then receive the result. This means that you can do the following to reduce this latency:
- Benchmark the model outside of a SageMaker endpoint to test performance.
- If your model is supported by SageMaker Neo, you can compile the model. SageMaker Neo optimizes models to run up to twice as fast with less than a tenth of the memory footprint with no loss in accuracy.
- If your model is supported by AWS Inferentia, you can compile the model for Inferentia that offers up to three times higher throughput and up to 45% lower cost per inference compared to the AWS GPU-based instances.
- If you're using a CPU instance and the model supports GPU acceleration, you can use a GPU instance or Amazon Elastic Inference to add GPU acceleration to an instance.
Note: The inference code can affect the model latency depending on how the code handles the inference. Any delays in code increase the latency.
- If an endpoint is overused, it might cause higher model latency. You can add Auto scaling to an endpoint to dynamically increase and decrease the number of instances available for an instance.
OverheadLatency is the time that SageMaker takes to respond to a request without including the ModelLatency. Multiple factors might contribute to OverheadLatency. These factors include the payload size for request and responses, request frequency, and the authentication or authorization of the request.
The first invocation for an endpoint might have an increase in latency due to a cold start. This is expected with the first invocation requests. To avoid this issue, you can send test requests to the endpoint to pre-warm it. Note that infrequent requests can also lead to an increase in OverheadLatency.