How do I troubleshoot issues when I bring my custom container to Amazon SageMaker for training or inference?

4 minute read

I want to troubleshoot issues when I bring my custom container to Amazon SageMaker for training or inference.

Short description

You can customize your container images in SageMaker using one of the following approaches:

Extend a pre-built SageMaker container: Use this approach if you need to customize your environment or framework by adding additional functionalities. With this approach, you don't have to build the container image from scratch because the deep learning libraries are already predefined.

Bring your own container: Use this approach when you have an already existing image for processing data, model training, or real-time inference with additional features and safety requirements that aren't currently supported by pre-built SageMaker images.

Build a container image from scratch: If you have a custom algorithm and don't have a custom container image yet, then it's a best practice to use this approach.

With any of this approach, the errors that you get might be mostly related to the incorrect build of the container image. Therefore, be sure that the container is configured correctly.

Resolution

Extend a pre-built SageMaker container

Be sure that the environment variables SAGEMAKER_SUBMIT_DIRECTORY and SAGEMAKER_PROGRAM are set in the Dockerfile.
Be sure that you have installed the required additional libraries in your Dockerfile. You can do so by running the following commands:

# SageMaker PyTorch image
FROM 763104351884.dkr.ecr.us-east-1.amazonaws.com/pytorch-training:1.5.1-cpu-py36-ubuntu16.04
ENV PATH="/opt/ml/code:${PATH}"

# this environment variable is used by the SageMaker PyTorch container to determine our user code directory.
ENV SAGEMAKER_SUBMIT_DIRECTORY /opt/ml/code

# install the libraries using pip
COPY requirements.txt./requirements.txt
RUN pip install requirements.txt

# /opt/ml and all subdirectories are utilized by SageMaker, use the /code subdirectory to store your user code.
COPY cifar10.py /opt/ml/code/cifar10.py
# Defines cifar10.py as script
entrypoint 
ENV SAGEMAKER_PROGRAM cifar10.py

After the image build is successful, run the container in local mode. Make sure that the image works as expected.

For more information, see Extend a prebuilt container.

Bring your own container

Be sure that you installed the respective SageMaker Toolkit libraries for training or inference. These toolkits define the location for code and other resources along with the entry point that contains the code that must be run when the container is started. When you create a SageMaker training job or inference endpoint, the following directories are created by SageMaker:

/opt/ml
    ├── input
    │
    ├── model
    │
    ├── code
    │
    ├── output
    │
    └── failure

When you run a training job, the /opt/ml/input directory contains information about the data channel that's used to access the data stored in Amazon Simple Storage Service (Amazon S3). The training script (train.py) along with its dependencies is stored in opt/ml/code. Be sure that the script writes the final model in the /opt/ml/model directory after the training job is complete.

When you host a trained model on SageMaker to make inferences, the model is stored in /opt/ml/model, and the inference code (inference.py) is stored in /opt/ml/code.

For more information, see Adapting your own Docker container to work with SageMaker.

Build a container from scratch

To make sure that the container runs as an executable, use the exec form of ENTRYPOINT instruction in your Dockerfile:

ENTRYPOINT ["python", "cifar10.py"]

For a training job, the training script must exit with 0 if the training is successful and a non-zero exit code if the training is unsuccessful.
Be sure that the final model is written to /opt/ml/model, and all the dependencies and artifacts are stored in /opt/ml/output. If a training job fails, the script must write the failure information to /opt/ml/output/failure.
When you create an inference endpoint, be sure that the model is saved in the FILENAME.tar.gz format. The container must implement HTTP POST request on /invocations for inference and HTTP GET request on /ping for endpoint health check.

For more information, see Create a container with your own algorithms and models.

Related information

Use the Amazon SageMaker local mode to train on your notebook instance

Topics

Machine Learning & AI

Relevant content

how to use custom docker images from private repo for training in sagemaker pipelines?
clouduser
asked 7 months ago
custom containers for training in sagemaker?
clouduser
asked 7 months ago
Error using Sagemaker, Custom Triton Container, Huggingface/Pytorch Sagemaker Pre Built Docker Image
CS Ayo
asked 2 months ago
How do I pass images to Sagemaker inference endpoint
rePost-User-4895466
asked a year ago
sagemaker custom ml model in container
rePost-User-1658948
asked a year ago
How do I troubleshoot issues when bringing my custom container to Amazon SageMaker Studio?
AWS OFFICIALUpdated a year ago
How do I troubleshoot missing container logs for Amazon ECS or Amazon EKS?
AWS OFFICIALUpdated 4 years ago
How do I troubleshoot Amazon ECS tasks stopping or failing to start while my container exits?
AWS OFFICIALUpdated a year ago
How do I troubleshoot errors when I create a custom domain in Amazon Cognito?
AWS OFFICIALUpdated a year ago
Monitoring SageMaker Notebook Instance with CloudWatch Custom Metrics
EXPERT
Ben Lee
published 4 months ago