How do I troubleshoot the "Command failed with exit code" error in AWS Glue?

Last updated: 2021-08-24

My AWS Glue extract, transform, and load (ETL) job fails with the error "Command failed with exit code".

Short description

You get this error when one or more of the following conditions are true:

  • The driver or executor in your job has run out of memory.
  • Your ETL script has code-related issues.
  • The AWS Glue IAM role lacks the required permissions to access the script path.

When your job run fails due to one of the preceding reasons, the respective error logs are written to Amazon CloudWatch.

Resolution

Use the appropriate troubleshooting steps based on your use case.

The job fails with the error "Command failed with exit code 1" and the CloudWatch logs show the error "java.lang.OutOfMemoryError: Java heap space"

The "java.lang.OutOfMemoryError: Java heap space" error indicates that a driver process in your job is running out of memory. To find out whether the out-of-memory (OOM) exception is caused by a driver or an executor, see Debugging OOM exceptions and job abnormalities. For debugging the OOM exception caused by the driver, see How do I resolve the "java.lang.OutOfMemoryError: Java heap space" error in AWS Glue? For more information, see Debugging a driver OOM exception.

The AWS Glue job fails with the error "Command failed with exit code 1" and the CloudWatch logs show the error "Container killed by YARN for exceeding memory limits"

This error indicates that the executor causes the OOM exception. To debug the OOM exception caused by the executor, see Debugging an executor OOM exception.

The AWS Glue job fails with the error "Command failed with exit code 10"

Check the CloudWatch logs for the job to find errors related to executors. This error usually occurs during the shuffle stage of Spark. For example, the error occurs when repartition action is called and the shuffle executor runs out of memory. Monitor the executor for straggler tasks during data shuffle operations. For more information, see Debugging demanding stages and straggler tasks.

The AWS Glue job fails with the error "Command failed with exit code 1" and doesn't start

Check the CloudWatch job logs for errors related to Amazon Simple Storage Service (Amazon S3). The error logs might look similar to the following:

com.amazonaws.services.s3.model.AmazonS3Exception: Access Denied (Service: Amazon S3; Status Code: 403; Error Code: AccessDenied' Request ID: xxxxxxxxxxxxx)

This error occurs when the AWS Glue IAM role doesn't have the required permission to access the AWS Glue ETL script from the Amazon S3 path. Review the permissions required by an AWS Glue IAM role to access the script location path. Be sure that these necessary permissions are attached to the role.

The AWS Glue job fails with the error "Command failed with exit code 1" and the CloudWatch logs shows the error "Exception in thread "main" java.lang.NoSuchMethodError", or "Exception in thread "main” java.lang.ExceptionInInitializerError“

These exceptions indicate a JAR dependency conflict or Spark version conflict. Check the JAR executable and/or the extra JAR files passed in the job for conflict.

The AWS Glue job fails with "Command failed with exit code 1" and the CloudWatch logs show the "RuntimeError" error

The error "RuntimeError" indicates that the Spark SQL passed in the ETL Script has a semantic exception.

For example:

RuntimeError: FAILED: SemanticException [Error 10006]: Partition not found

To troubleshoot this error, review the AWS Glue job logs for SQL syntax errors.

Note: This error might occur due to different reasons, such as networking issues. Therefore, the resolution steps are not limited to those provided in the article.