Why is my AWS Glue ETL job running for a long time?

Last updated: 2021-08-20

My AWS Glue job is running for a long time.

-or-

My AWS Glue straggler task is taking a long time to complete.

Short description

Some common reasons why your AWS Glue jobs take a long time to complete are the following:

  • Large datasets
  • Non-uniform distribution of data in the datasets
  • Uneven distribution of tasks across the executors
  • Resource under-provisioning

Resolution

Enable metrics

AWS Glue provides Amazon CloudWatch metrics that can be used to provide information about the executors and the amount of done by each executor. You can enable CloudWatch metrics on your AWS Glue job by doing one of the following:

Using a special parameter: Add the following argument to your AWS Glue job. This parameter allows you to collect metrics for job profiling for your job run. These metrics are available on the AWS Glue console and the CloudWatch console.

Key: --enable-metrics

Using the AWS Glue console: To enable metrics on an existing job, do the following:

  1. Open the AWS Glue console.
  2. In the navigation pane, choose Jobs.
  3. Select the job that you want to enable metrics for.
  4. Choose Action, and then choose Edit job.
  5. Under Monitoring options, select Job metrics.
  6. Choose Save.

Using the API: Use the AWS Glue UpdateJob API with --enable-metrics as the DefaultArguments parameter to enable metrics on an existing job.

Note: AWS Glue 2.0 doesn't use YARN that reports metrics. This means that you can't get some of the executor metrics, such as numberMaxNeededExecutors and numberAllExecutor, for AWS Glue 2.0.

Enable continuous logging

If you enable continuous logging in your AWS Glue job, then the real-time driver and executor logs are pushed to CloudWatch every five seconds. With this real-time logging information, you can get more details on the running job. For more information, see Enabling continuous logging for AWS Glue jobs.

Check the driver and executor logs

In the driver logs, check for tasks that ran for a long time before they were completed. For example:

2021-04-15 10:53:54,484 ERROR executionlogs:128 - g-7dd5eec38ff57a273fcaa35f289a99ecc1be6901:2021-04-15 10:53:54,484 INFO [task-result-getter-1] scheduler.TaskSetManager (Logging.scala:logInfo(54)): Finished task 0.0 in stage 7.0 (TID 139) in 4538 ms on 10.117.101.76 (executor 10) (13/14)
...
2021-04-15 12:11:30,692 ERROR executionlogs:128 - g-7dd5eec38ff57a273fcaa35f289a99ecc1be6901:2021-04-15 12:11:30,692 INFO [task-result-getter-3] scheduler.TaskSetManager (Logging.scala:logInfo(54)): Finished task 13.0 in stage 7.0 (TID 152) in 4660742 ms on 10.117.97.97 (executor 11) (14/14)

In these logs, you can view that a single task took 77 minutes to complete. Use this information to review why that particular task is taking a long time. You can do so by using the Apache Spark web UI. The Spark UI provides well-structured information for different stages, tasks and executors.

Enable the Spark UI

You can use the Spark UI to troubleshoot Spark jobs that run for a long time. By launching the Spark history server and enabling the Spark UI logs, you can get information on the stages and tasks. You can use the logs to learn how the tasks are executed by the workers. You can enable Spark UI using the AWS Glue console or the AWS Command Line Interface (AWS CLI). For more information, see Enabling the Apache Spark web UI for AWS Glue jobs.

After the job is complete, you might see driver logs similar to the following:

ERROR executionlogs:128 - example-task-id:example-timeframe INFO [pool-2-thread-1] s3n.MultipartUploadOutputStream (MultipartUploadOutputStream.java:close(414)): close closed:false s3://dox-example-bucket/spark-application-1626828545941.inprogress

After analyzing the logs for the job, you can launch the Spark history server either on an Amazon Elastic Compute Cloud (Amazon EC2) instance or using Docker. Open the UI, and navigate to the Executor tab to check whether a particular executor is running for a longer time. If so, the uneven distribution of work and under-utilization of available resources could be caused due to a data skew in the dataset. In the Stages tab, you can get more information and statistics on the stages that took a long time. You can find details on whether these stages involved shuffle spills that are expensive and time-consuming.

Capacity planning for data processing units (DPUs)

If all executors contribute equally for the job, but the job still takes a long time to complete, then consider adding more workers to your job to improve the speed. DPU capacity planning can help you to avoid the following:

  • Under-provisioning that might result is slower execution time
  • Over-provisioning that incurs higher costs, but provides results in the same amount of time

From CloudWatch metrics, you can get information on the number of executors used currently and the maximum number of executors needed. The number of DPUs needed depends on the number of input partitions and the worker type requested.

Keep the following in mind when you define the number of input partitions:

  • If the Amazon Simple Storage Service (Amazon S3) files aren't splittable, then the number of partitions is equal to the number of input files.
  • If the Amazon S3 files are splittable, and the data is unstructured/semi-structured, then the number of partitions is equal to the total file size / 64 MB. If the size of each file is less than 64 MB, then the number of partitions is equal to the number of files.
  • If the Amazon S3 files are splittable and the data is structured, then the number of partitions is equal to the total file size / 128 MB.

Do the following to calculate the optimal number of DPUs:

For example, suppose that the number of input partitions is 428. Then, you can calculate the optimal number of DPUs by the following formula:

Maximum number of executors needed = Number of input partitions / Number of tasks per executor = 428/4 = 107

Keep in mind the following:

  • The Standard worker type supports 4 tasks per executor
  • G.1X supports 8 tasks per executor
  • G.2X supports 16 tasks per executor

The standard worker type has two executors, including one driver, in one node. One of these executors is a driver in Spark. Therefore, you need 108 executors.

The number of DPUs needed = (No. of executors / No. of executors per node) + 1 DPU = (108/2) + 1 = 55.