How do I troubleshoot errors when running Amazon SageMaker training jobs?

2 minute read
0

I want to troubleshoot errors when running Amazon SageMaker training jobs.

Resolution

Your SageMaker training job might fail due to multiple reasons. To identify the reason for the failure, check the failure reason on the SageMaker console or through the DescribeTrainingJob API call. Use the following troubleshooting steps based on the error that you get when your training job fails.

Internal Server Error

If your SageMaker training job failed with the Internal Server Error, retry the job to make sure that the job didn't fail due to a transient issue. If the job fails when you retry, then review the logs for training jobs on Amazon CloudWatch. You can find these logs in CloudWatch under the log group /aws/sagemaker/TrainingJobs in the log stream that looks similar to the following:

example-training-job-name/algo-example-instance-number-in-cluster-example-epoch-timestamp

Also, review job metrics, such as CPUUtilization, MemoryUtilization, and DiskUtilization to make sure that the failure didn't occur due to a resource crunch.

You can access the training job logs and job metrics by doing the following:

  1. Open the SageMaker console.
  2. Choose Training jobs, and then choose the training job that you want to see the metrics for.
  3. Choose TrainingJobName.
  4. In the Monitor section, choose View logs.
  5. In the Monitor section, review the graphs of instance utilization.

If you find that the job is using up all the resources, switch to a larger instance type, or attach a larger storage volume to the instance.

For more information, see Monitoring training job metrics (SageMaker console).


Related information

Monitor and analyze training jobs using Amazon CloudWatch metrics

Logs for built-in algorithms

AWS OFFICIAL
AWS OFFICIALUpdated a year ago