Why does my Amazon SageMaker notebook instance get stuck in the Pending state, and then fail?
Last updated: 2022-11-21
When I create or start an Amazon SageMaker notebook instance, the instance enters the Pending state. The notebook instance appears to be stuck in this state, and then it fails.
The Pending status means that SageMaker is creating the notebook instance. If any step in the creation process fails, SageMaker attempts to create the notebook again. This is why a notebook might stay in the Pending state longer than expected. If SageMaker still can't create the notebook instance, the status eventually changes to Failed .
Confirm the failure reason
- To see a pop-up window that shows a shortened version of the failure reason, pause on Failed in the Status column.
- To see the full failure reason, choose the name of the notebook instance. The failure reason appears at the top of the Notebook instance settings section.
Use the failure reason to troubleshoot the root cause.
"fatal: unable to access 'https://github.com/aws-samples/amazon-sagemaker-notebook-instance-lifecycle-config-samples/': Failed to connect to github.com port 443: Connection timed out"
This error happens when the networking configuration for the notebook instance doesn't support the domain name or connection for the external Git repository.
Important: Notebook instances that are deployed in a Virtual Private Cloud (VPC) don't automatically inherit custom route tables, like subnet route tables for VPC peering connections. If you need a custom route table, create a lifecycle configuration script that adds the route on startup. For more information, see Understanding Amazon SageMaker notebook instance networking configurations and advanced routing options.
To validate that the Git connection is active and that you can connect to the repository from a notebook instance: Create a new notebook instance without an associated Git repository. Then, open the Jupyter console and use a terminal session to run the following commands:
1.FSPResolve the hostname of the server:
If the answer section of the output is empty, the notebook wasn't able to resolve the hostname. For example, the answer section for github.com displays as:
;; ANSWER SECTION: github.com. 16 IN A 22.214.171.124
2.FSPIf the answer section of the output contains a response, the domain name resolution works. You can then run the following command to test the connection to the hostname:
curl -v your-git-repo-url:443
git pull https://your-git-repo-url
"Lifecycle Configuration failed"
If a lifecycle configuration script runs for longer than five minutes, it fails, and the notebook instance is neither created nor started. For suggestions on how to decrease script runtime, see Customize a notebook instance using a lifecycle configuration script. To troubleshoot issues with the script, check the Amazon CloudWatch logs for the lifecycle configuration:
- Log group: /aws/sagemaker/NotebookInstances
- Log stream: notebook-instance-name/LifecycleConfigOnStart or notebook-instance-name/LifecycleConfigOnCreate
"This Notebook Instance type 'ml.m4.xlarge' is temporarily unavailable. We apologize for the inconvenience. Please try again in a few minutes, or try a different instance type."
This error happens when Amazon Elastic Compute Cloud (Amazon EC2) doesn't have enough available capacity for the instance type that you selected. Capacity varies based on the demand for that instance type in that Region at that time. Try the request again later to see if capacity levels have changed. Or, choose a different instance type.
HTTP 500 internal errors
An HTTP 500 error indicates that an unexpected error occurred while creating the notebook instance. To rule out transient issues, try creating the notebook instance again.