How can I troubleshoot the InternalServerError response on Amazon SageMaker?

Last updated: 2020-09-24

When I run a training job or make a prediction in Amazon SageMaker, the request fails with the response "HTTP Error 500: Internal Server Error" or "InternalServerError: We encountered an internal error. Please try again." How can I find the root cause of the error?

Resolution

Training jobs

Retry the request. If TrainingJobStatus still shows Failed, then review the FailureReason to determine why it failed. To see the full error stack, check the Amazon CloudWatch logs, as explained in the following section.

Inference requests

Check the CloudWatch logs associated with the endpoint to determine the root cause:

1.    Open the SageMaker console.

2.    Choose Training jobs or Endpoints, depending on where the error occurred.

3.    Choose the name of the training job or endpoint.

4.    In the Monitor section, choose View logs to open the CloudWatch console.

5.    In the CloudWatch console, choose the log stream for the training job or endpoint:

For training jobs, the log stream is located in the /aws/sagemaker/TrainingJobs CloudWatch log group. The stream name starts with the training job name (for example, imageclassfi-job-2020-04-05/xxxxx).

For endpoints, the log stream is located in the /aws/sagemaker/Endpoints/endpoint_name CloudWatch log group. The stream name starts with the variant name (for example, AllTraffic/i-xxxxxxx).

6.    Review the logs to find the detailed error message. To simplify this process, you can add debugging code in your training job or inference script. For example, print the job status or the actual value of your dataset. Then, look for that printed message in the CloudWatch logs.

The following is an example of inference code for debugging an endpoint. You can use this code to confirm that you correctly called the predict() function. The code also prints the data variable that shows you the actual value that's passed to the endpoint. In this example, MYDEBUG is the keyword that you want to search for in the CloudWatch log stream.

def predict():
    data = None
    print("MYDEBUG: Predict function called")
    # Convert from CSV to pandas
    if flask.request.content_type == 'text/csv':
        data = flask.request.data.decode('utf-8')
        s = StringIO.StringIO(data)
        data = pd.read_csv(s, header=None)
        ## To print the  actual data set
        print( "MYDEBUG: Printing data")
        print( data.head(10) )

7.    Use the information in the logs to determine the root cause of the error.


Did this article help?


Do you need billing or technical support?