How can I troubleshoot the InternalServerError response on Amazon SageMaker?

3 minute read

When I run an Amazon SageMaker processing, training, or prediction job, the request fails with the response "HTTP Error 500: Internal Server Error" or "InternalServerError: We encountered an internal error. Please try again".

Resolution

If your SageMaker job or inference request to an endpoint failed with the "InternalServerError", then retry the request. Retrying the request eliminates the failure due to a transient issue.

If the failure still occurs, then follow these steps to review the job or endpoint's logs on Amazon CloudWatch.

Review CloudWatch logs

Check the CloudWatch logs associated with the SageMaker resource to determine the root cause:

1. Open the SageMaker console.

2. Choose the relevant resource under Processing, Training, or Inference.

3. Choose the name of the endpoint, processing, or training job.

4. In the Monitoring section, choose View logs to open the CloudWatch console.

5. In the CloudWatch console, choose the log stream for the job or endpoint.

6. If there's no log stream, or if the log stream is empty, confirm that the resource's execution role has a policy with the following permissions:

{
 "Effect": "Allow",
 "Action": [
 "cloudwatch:PutMetricData",
 "logs:CreateLogStream",
 "logs:PutLogEvents",
 "logs:CreateLogGroup",
 "logs:DescribeLogStreams",
 "ecr:GetAuthorizationToken"
 ],
 "Resource": "*"
}

7. Review the logs to find the error message.

Add debugging code to your inference script (optional)

To simplify the log review process, you can add debugging code to your inference script. The following is an example of inference code for debugging an endpoint. You can use this code to confirm that you correctly called the predict() function. The code also prints the data variable that shows you the actual value that's passed to the endpoint. In this example, MYDEBUG is the keyword to search for in the CloudWatch log stream.

def predict():
    data = None
    print("MYDEBUG: Predict function called")
    # Convert from CSV to pandas
    if flask.request.content_type == 'text/csv':
        data = flask.request.data.decode('utf-8')
        s = StringIO.StringIO(data)
        data = pd.read_csv(s, header=None)
        ## To print the actual data set
        print( "MYDEBUG: Printing data")
        print( data.head(10) )

Troubleshooting other common causes of "InternalServerError"

Resource utilization

A SageMaker job might fail with "InternalServerError" if the job's container on the instance uses up the instance's resources. You can view resource utilization by reviewing the instance's CPUUtilization, MemoryUtilization, and DiskUtilization metrics in CloudWatch.

To review the instance metrics, follow these steps:

1. Open the SageMaker console.

2. In the Processing/Training Jobs section, choose Processing/Training.

3. Choose the job name.

4. In the Monitoring section, choose View instance metrics to open the CloudWatch console. If the job is using high resources, switch to a larger instance type, or attach a larger storage volume to the existing instance.

Missing EC2 permissions in the SageMaker execution role

The Amazon SageMaker execution role might show an "InternalServerError" when Amazon Elastic Compute Cloud (Amazon EC2) permissions aren't properly configured. When specifying a VpcConfig object in your SageMaker job, confirm that the job's SageMaker execution role has a policy with the following permissions:

{
 "Effect": "Allow",
 "Action": [
 "ec2:CreateNetworkInterface",
 "ec2:CreateNetworkInterfacePermission",
 "ec2:DeleteNetworkInterface",
 "ec2:DeleteNetworkInterfacePermission",
 "ec2:DescribeNetworkInterfaces",
 "ec2:DescribeVpcs",
 "ec2:DescribeDhcpOptions",
 "ec2:DescribeSubnets",
 "ec2:DescribeSecurityGroups"
 ]
}

For more information, see SageMaker roles.

Related information

Logging and monitoring

CreateProcessingJob API: Execution role permissions

SageMaker jobs and endpoint metrics

Connect SageMaker Studio Notebooks in a VPC to external resources

Topics

Machine Learning & AI

Relevant content

FSxLustre FileSystemInput in Sagemaker TrainingJob leads to: InternalServerError
Chris
asked 2 years ago
InternalServerError while trying to create item dataset import job
scihan
asked 3 years ago
What to do after my training job fails with "InternalServerError"?
fascani
asked a year ago
InternalServerError with SageMaker Batch transform job
jmsmkn
asked 2 years ago
InternalServerError
George
asked 6 months ago
Why does my Amazon SageMaker endpoint go into the failed state when I create or update an endpoint?
AWS OFFICIALUpdated a year ago
How do I troubleshoot issues when I bring my custom container to Amazon SageMaker for training or inference?
AWS OFFICIALUpdated a year ago
How do I troubleshoot latency with my Amazon SageMaker endpoint?
AWS OFFICIALUpdated a year ago
How can I resolve the Amazon SageMaker inference error "upstream timed out (110: Connection timed out) while reading response header from upstream"?
AWS OFFICIALUpdated 2 years ago
Accelerating SageMaker Training Jobs running on AWS Trainium
EXPERT
Kamran Khan
published 2 months ago