How do I troubleshoot issues when I bring my custom container to Amazon SageMaker for training or inference?

4 minute read
0

I want to troubleshoot issues when I bring my custom container to Amazon SageMaker for training or inference.

Short description

You can customize your container images in SageMaker using one of the following approaches:

Extend a pre-built SageMaker container: Use this approach if you need to customize your environment or framework by adding additional functionalities. With this approach, you don't have to build the container image from scratch because the deep learning libraries are already predefined.

Bring your own container: Use this approach when you have an already existing image for processing data, model training, or real-time inference with additional features and safety requirements that aren't currently supported by pre-built SageMaker images.

Build a container image from scratch: If you have a custom algorithm and don't have a custom container image yet, then it's a best practice to use this approach.

With any of this approach, the errors that you get might be mostly related to the incorrect build of the container image. Therefore, be sure that the container is configured correctly.

Resolution

Extend a pre-built SageMaker container

  • Be sure that the environment variables SAGEMAKER_SUBMIT_DIRECTORY and SAGEMAKER_PROGRAM are set in the Dockerfile.
  • Be sure that you have installed the required additional libraries in your Dockerfile. You can do so by running the following commands:
# SageMaker PyTorch image
FROM 763104351884.dkr.ecr.us-east-1.amazonaws.com/pytorch-training:1.5.1-cpu-py36-ubuntu16.04
ENV PATH="/opt/ml/code:${PATH}"

# this environment variable is used by the SageMaker PyTorch container to determine our user code directory.
ENV SAGEMAKER_SUBMIT_DIRECTORY /opt/ml/code

# install the libraries using pip
COPY requirements.txt./requirements.txt
RUN pip install requirements.txt

# /opt/ml and all subdirectories are utilized by SageMaker, use the /code subdirectory to store your user code.
COPY cifar10.py /opt/ml/code/cifar10.py
# Defines cifar10.py as script
entrypoint 
ENV SAGEMAKER_PROGRAM cifar10.py
  • After the image build is successful, run the container in local mode. Make sure that the image works as expected.

For more information, see Extend a prebuilt container.

Bring your own container

Be sure that you installed the respective SageMaker Toolkit libraries for training or inference. These toolkits define the location for code and other resources along with the entry point that contains the code that must be run when the container is started. When you create a SageMaker training job or inference endpoint, the following directories are created by SageMaker:

/opt/ml
    ├── input
    │
    ├── model
    │
    ├── code
    │
    ├── output
    │
    └── failure

When you run a training job, the /opt/ml/input directory contains information about the data channel that's used to access the data stored in Amazon Simple Storage Service (Amazon S3). The training script (train.py) along with its dependencies is stored in opt/ml/code. Be sure that the script writes the final model in the /opt/ml/model directory after the training job is complete.

When you host a trained model on SageMaker to make inferences, the model is stored in /opt/ml/model, and the inference code (inference.py) is stored in /opt/ml/code.

For more information, see Adapting your own Docker container to work with SageMaker.

Build a container from scratch

  • To make sure that the container runs as an executable, use the exec form of ENTRYPOINT instruction in your Dockerfile:
ENTRYPOINT ["python", "cifar10.py"]
  • For a training job, the training script must exit with 0 if the training is successful and a non-zero exit code if the training is unsuccessful.
  • Be sure that the final model is written to /opt/ml/model, and all the dependencies and artifacts are stored in /opt/ml/output. If a training job fails, the script must write the failure information to /opt/ml/output/failure.
  • When you create an inference endpoint, be sure that the model is saved in the FILENAME.tar.gz format. The container must implement HTTP POST request on /invocations for inference and HTTP GET request on /ping for endpoint health check.

For more information, see Create a container with your own algorithms and models.


Related information

Use the Amazon SageMaker local mode to train on your notebook instance

AWS OFFICIAL
AWS OFFICIALUpdated a year ago