Why is my AWS Glue ETL job running for a long time?

7 minute read

My AWS Glue job runs for a long time. Or, my AWS Glue straggler task takes a long time to complete.

Short description

Your AWS Glue job might take a long time to complete for the following reasons:

Large datasets
Non-uniform distribution of data in the datasets
Uneven distribution of tasks across the executors
Under-provisioned resources

Resolution

Turn on metrics

AWS Glue uses Amazon CloudWatch metrics to provide information about the executors, such as the amount of work of each executor. When you use AWS Glue version 3.0 or later, you can also use AWS Glue serverless Spark UI and Observability metrics. For more information, see Turning on the Apache Spark web UI for AWS Glue jobs and Monitoring with AWS Glue Observability metrics.

To turn on CloudWatch metrics for your AWS Glue job, you can use either a special parameter, the AWS Glue console, or an API.

Use a special parameter

Add the following argument to your AWS Glue job:

Key: --enable-metrics

Note: The enable-metrics parameter allows you to collect job profiling metrics for your job run. The metrics are available on the AWS Glue console and the CloudWatch console.

Use the AWS Glue console

Complete the following steps:

Open the AWS Glue console.
In the navigation pane, choose Jobs.
Select the job that you want to turn on metrics for.
Choose Action, and then choose Edit job.
Under Monitoring options, choose Job metrics.
Choose Save.

Use the API

Use the AWS Glue UpdateJob API with --enable-metrics as the DefaultArguments parameter.

Note: AWS Glue 2.0 doesn't use a YARN that reports metrics because you can't get some of the executor metrics, such as numberMaxNeededExecutors and numberAllExecutor.

Turn on continuous logging

If you turn on continuous logging for your AWS Glue job, then the real-time driver and executor logs are pushed to CloudWatch every 5 seconds. When you use real-time logging information, you get more details on the job. For more information, see Turning on continuous logging for AWS Glue jobs.

Check the driver and executor logs

In the driver logs, check for tasks that run for a long time before they are completed. In the following example, one task took 77 minutes to complete:

2021-04-15 10:53:54,484 ERROR executionlogs:128 - g-7dd5eec38ff57a273fcaa35f289a99ecc1be6901:2021-04-15 10:53:54,484 INFO [task-result-getter-1] scheduler.TaskSetManager (Logging.scala:logInfo(54)): Finished task 0.0 in stage 7.0 (TID 139) in 4538 ms on 10.117.101.76 (executor 10) (13/14)
...
2021-04-15 12:11:30,692 ERROR executionlogs:128 - g-7dd5eec38ff57a273fcaa35f289a99ecc1be6901:2021-04-15 12:11:30,692 INFO [task-result-getter-3] scheduler.TaskSetManager (Logging.scala:logInfo(54)): Finished task 13.0 in stage 7.0 (TID 152) in 4660742 ms on 10.117.97.97 (executor 11) (14/14)

To review why the task took a long time to complete, use the Apache Spark web UI.

Turn on the Spark UI

When you launch the Spark history server and turn on the Spark UI logs, you can access information on the log's stages and tasks. Use the logs to learn how workers run the tasks. For more information, see Turning on the Apache Spark web UI for AWS Glue jobs.

Note: If you use the AWS Command Line Interface (AWS CLI) to turn on the Spark UI, then make sure that you're using the most recent AWS CLI version. If you receive errors when you run AWS CLI commands, then see Troubleshoot AWS CLI errors.

After the job is complete, you might see driver logs similar to the following example:

ERROR executionlogs:128 - example-task-id:example-timeframe INFO [pool-2-thread-1] s3n.MultipartUploadOutputStream (MultipartUploadOutputStream.java:close(414)): close closed:false s3://dox-example-bucket/spark-application-1626828545941.inprogress

Use an Amazon Elastic Compute Cloud (Amazon EC2) instance or Docker to launch the Spark history server. Open the UI, and navigate to the Executor tab to check if a particular executor is running for a long time. If an executor runs for a long time, then a skew in the dataset might cause unevenly distributed work and under-utilized resources. In the Stages tab, review information and statistics on the stages that took a long time.

Capacity planning for DPUs

If all executors contribute equally and the job takes a long time, then add more workers to your job to improve the speed. Data processing units (DPU) capacity planning can help you to avoid the following issues:

Under-provisioning that might result in slower execution time
Over-provisioning that incurs higher costs, but provides results in the same amount of time

When you use CloudWatch metrics, you can get information on the number of executors that are currently used and the maximum number of executors that are required. The number of DPUs that are required depends on the number of input partitions and the worker type that's requested.

The type of Amazon Simple Storage Service (Amazon S3) file that you use and the type of data determine the number of partitions that you define:

For Amazon S3 files that you can't split, the number of partitions equals the number of input files.
For Amazon S3 files that you can split and the data are unstructured or semi-structured, the number of partitions equals the file size divided by 64 MB. If the size of each file is less than 64 MB, then the number of partitions equals the number of files.
For Amazon S3 files that you can split and the data are structured, the number of partitions equals the total file size divided by 128 MB.

The following is an example of how to calculate the optimal number of DPUs. In this example, the number of input partitions is 240. Use the following formula to calculate the optimal number of DPUs:

Maximum number of required executors = Number of input partitions / Number of tasks per executor

Note: For Glue 2.0 and earlier with a G1.X worker type, the maximum number of required executors is equal to 240 divided by eight. The result is 30. For Glue 3.0 and later with a G1.X worker type, the maximum number of required executors is equal to 240 divided by four. The result is 60.

When you plan DPU capacity for AWS Glue version 3.0 or later, use a worker type that supports the required number of tasks for each executor. Each worker type supports a different number of tasks per executor:

The Standard worker type supports four tasks per executor.
G.1X supports four tasks per executor.
G.2X supports eight tasks per executor.
G.4X supports 16 tasks per executor.
G.8X supports 32 tasks per executor.

Glue 2.0 and earlier example

The G1.X worker type has one executor per worker and one driver node for a Glue job. Based on the following example, you need 30 executors:
Number of required DPUs = (Number of executors / Number of executors per node) + 1 DPU = (30/1) + 1 = 31

Glue 3.0 and later example

The G1.X worker type has one executor per worker and one driver node for a Glue job. Based on the following example, you need 60 executors.
Number of required DPUs = (Number of executors / Number of executors per node) + 1 DPU = (60/1) + 1 = 61

Related information

AWS Glue job parameters

Monitoring jobs using the Apache Spark web UI

Monitoring for DPU capacity planning

Topics

Analytics

Relevant content

Optimizing AWS Glue Job for Faster S3 Writes of Large Datasets
itamar
asked a month ago
AWS Glue PII detector job taking too much time
Accepted Answer
rePost-User-0309502
asked a year ago
a glue job in a step function is taking so long to continue the next step
Accepted Answer
Willi5
asked a year ago
AWS Glue - DataSink is taking long time to write
rePost-User-3581525
asked 9 months ago
AWS glue studio node long run time for data preview
aws-user-4575
asked 2 months ago
Why do my Amazon Athena queries take a long time to run?
AWS OFFICIALUpdated 4 months ago
Why is the RefreshCache operation taking a long time on my file gateway?
AWS OFFICIALUpdated 2 years ago
Why is the AWS Glue crawler running for a long time?
AWS OFFICIALUpdated 3 years ago
Why is my AWS Glue ETL job reprocessing data even when job bookmarks are enabled?
AWS OFFICIALUpdated a year ago
AWS re:Post Live - Monitor and Debug AWS Glue Jobs Using Job Observability Metrics and Amazon Grafana
EXPERT
AWS rePost Live
published 25 days ago