How can I troubleshoot the InternalServerError response on Amazon SageMaker?

Last updated: 2022-10-25

When I run an Amazon SageMaker processing, training, or prediction job, the request fails with the response "HTTP Error 500: Internal Server Error" or "InternalServerError: We encountered an internal error. Please try again".

Resolution

If your SageMaker job or inference request to an endpoint failed with the "InternalServerError", then retry the request. Retrying the request eliminates the failure due to a transient issue.

If the failure still occurs, then follow these steps to review the job or endpoint's logs on Amazon CloudWatch.

Review CloudWatch logs

Check the CloudWatch logs associated with the SageMaker resource to determine the root cause:

1.    Open the SageMaker console.

2.    Choose the relevant resource under Processing, Training, or Inference.

3.    Choose the name of the endpoint, processing, or training job.

4.    In the Monitoring section, choose View logs to open the CloudWatch console.

5.    In the CloudWatch console, choose the log stream for the job or endpoint.

6.    If there's no log stream, or if the log stream is empty, confirm that the resource's execution role has a policy with the following permissions:

{
 "Effect": "Allow",
 "Action": [
 "cloudwatch:PutMetricData",
 "logs:CreateLogStream",
 "logs:PutLogEvents",
 "logs:CreateLogGroup",
 "logs:DescribeLogStreams",
 "ecr:GetAuthorizationToken"
 ],
 "Resource": "*"
}

7.    Review the logs to find the error message.

Add debugging code to your inference script (optional)

To simplify the log review process, you can add debugging code to your inference script. The following is an example of inference code for debugging an endpoint. You can use this code to confirm that you correctly called the predict() function. The code also prints the data variable that shows you the actual value that's passed to the endpoint. In this example, MYDEBUG is the keyword to search for in the CloudWatch log stream.

def predict():
    data = None
    print("MYDEBUG: Predict function called")
    # Convert from CSV to pandas
    if flask.request.content_type == 'text/csv':
        data = flask.request.data.decode('utf-8')
        s = StringIO.StringIO(data)
        data = pd.read_csv(s, header=None)
        ## To print the actual data set
        print( "MYDEBUG: Printing data")
        print( data.head(10) )

Troubleshooting other common causes of "InternalServerError"

Resource utilization

A SageMaker job might fail with "InternalServerError" if the job's container on the instance uses up the instance's resources. You can view resource utilization by reviewing the instance's CPUUtilization, MemoryUtilization, and DiskUtilization metrics in CloudWatch.

To review the instance metrics, follow these steps:

1.    Open the SageMaker console.

2.    In the Processing/Training Jobs section, choose Processing/Training.

3.    Choose the job name.

4.    In the Monitoring section, choose View instance metrics to open the CloudWatch console. If the job is using high resources, switch to a larger instance type, or attach a larger storage volume to the existing instance.

Missing EC2 permissions in the SageMaker execution role

The Amazon SageMaker execution role might show an "InternalServerError" when Amazon Elastic Compute Cloud (Amazon EC2) permissions aren't properly configured. When specifying a VpcConfig object in your SageMaker job, confirm that the job's SageMaker execution role has a policy with the following permissions:

{
 "Effect": "Allow",
 "Action": [
 "ec2:CreateNetworkInterface",
 "ec2:CreateNetworkInterfacePermission",
 "ec2:DeleteNetworkInterface",
 "ec2:DeleteNetworkInterfacePermission",
 "ec2:DescribeNetworkInterfaces",
 "ec2:DescribeVpcs",
 "ec2:DescribeDhcpOptions",
 "ec2:DescribeSubnets",
 "ec2:DescribeSecurityGroups"
 ]
}

For more information, see SageMaker roles.