Why does my Amazon SageMaker endpoint go into the failed state when I create or update an endpoint?

Last updated: 2022-11-21

I want to troubleshoot why the creation or update of my Amazon SageMaker endpoint has failed.

Resolution

When the creation or update of your SageMaker endpoint fails, SageMaker provides the reason for the failure. Use either of the following options to review this reason:

  • Check the endpoint in the SageMaker Console. The reason for the failure is reported in the console.
  • Run the AWS Command Line Interface (AWS CLI) command describe-endpoint. Check the FailureReason field to know the reason for the failure.

Note: If you receive errors when running AWS CLI commands, make sure that you’re using the most recent version of the AWS CLI.

The following are some of the failure reasons and their resolution methods.

Unable to provision requested ML compute capacity due to InsufficientInstanceCapacity error

You might get the following error when you try to create an endpoint:

Unable to provision requested ML compute capacity due to InsufficientInstanceCapacity error

This error occurs when AWS doesn't have sufficient capacity to provision the instances requested for your endpoint.

You can resolve this error by trying one or more of the following approaches:

  • Wait for a few minutes and try again because capacity can shift frequently.
  • If you are using multiple instances for your endpoint, try to create the endpoint with a smaller number of instances. If you have Auto Scaling configured, SageMaker can scale up or down as required and as capacity permits.
  • Try a different instance type that supports your workload. After creating an endpoint, update the endpoint with the desired instance type. Because SageMaker uses a blue/green deployment method to maximize availability, you can transition to a new instance type without affecting your current production workloads.

The container for production variant <variant> did not pass the ping health check. Please check CloudWatch logs for this endpoint.

Containers used for SageMaker endpoints must implement a web server that responds to the /invocations and /ping endpoints. When you create an endpoint, SageMaker starts sending periodic GET requests to the /ping endpoint after the container starts.

At a minimum, a container must respond with an HTTP 200 OK status code and an empty body to indicate that the container is ready to accept inference requests. This error occurs when SageMaker doesn't get consistent responses from the container within four minutes after the container starts up. SageMaker doesn't consider that the endpoint is healthy because the endpoint doesn't respond to the health check. Therefore, the endpoint is marked as Failed.

Health check might also fail when you use one of AWS Deep Learning Containers images. These images use either TorchServe or Multi Model Server to serve the models that implement the HTTP endpoints for inference and health checks. These frameworks check whether the model is loaded before responding to SageMaker with a 200 OK response. If the server is unable to see that the model is loaded, then the health check fails. A model might not load for many reasons, including memory usage. The corresponding error messages are logged in to Amazon CloudWatch logs for the endpoint. If the code loaded into the endpoint caused the failure (for example, model_fn for PyTorch), then the errors are logged in to AWS CloudTrail. To increase the verbosity of these logs, update the SAGEMAKER_CONTAINER_LOG_LEVEL environmental variable for the model with the log levels for Python logging.

A health check request must receive a response within two seconds to be successful. Be sure to test the response by starting your model container locally and sending a GET request to the container to check the response.

Failed to extract model data archive for container

SageMaker expects a TAR file with the model data for use in your endpoint. After SageMaker downloads the TAR file, the data archive is extracted. This error might occur if SageMaker can't extract this data archive. For example, SageMaker can't extract the data archive if the model artifact contains symbolic links for files located in the TAR file.

When you create an endpoint, make sure that the model artifacts don't include symbolic links within the TAR file. To check if the TAR file includes symbolic links, extract the model data, and then run the following command inside the artifacts:

find . -type l -ls

This command returns all the symbolic links found after searching through the current directory and any of its subdirectories. Replace any link that's returned with the actual copies of the file.

CannotStartContainerError

This error occurs when SageMaker fails to start the container to prepare the container for inference.

When SageMaker starts the endpoint, your container is started with the following command:

docker run <image-id> serve

When this command is run, your container must start the serving process.

To resolve this error, use local mode for the SageMaker Python SDK. Or, try running your inference image with the docker run command. The SageMaker Python SDK loads up your model similar to a SageMaker endpoint. However, Docker doesn't load the model unless you configure the command or container to do so. You can use a command similar to the following to load your model locally:

docker run -v $(pwd)/test_dir:/opt/ml -p 8080:8080 --rm ${image} serve

Did this article help?


Do you need billing or technical support?