Why does my AWS Glue ETL job fail with the error "Container killed by YARN for exceeding memory limits"?

4 minute read

My AWS Glue extract, transform, and load (ETL) job fails with the error "Container killed by YARN for exceeding memory limits".

Short description

The most common causes for this error are the following:

Memory-intensive operations, such as joining large tables or processing datasets with a skew in the distribution of specific column values, exceeding the memory threshold of the underlying Spark cluster
Fat partitions of data that consume more than the memory that's assigned to the respective executor
Large files that can't be split resulting in large in-memory partitions

Resolution

Use one or more of the following solution options to resolve this error:

Upgrade the worker type from G.1x to G.2x that has higher memory configurations. For more information on specifications of worker types, see the Worker type section in Defining job properties for Spark jobs.
To learn more about migrating from AWS Glue 2.0 to 3.0, see Migrating from AWS Glue 2.0 to AWS Glue 3.0.
You can also review the following table for information on worker type specifications:


AWS Glue versions 1.0 and 2.0
Standard	spark.executor.memory: 5g spark.driver.memory: 5g spark.executor.cores: 4
G.1x	spark.executor.memory: 10g spark.driver.memory: 10g spark.executor.cores: 8
G.2x	spark.executor.memory: 20g spark.driver.memory: 20g spark.executor.cores: 16
AWS Glue version 3.0
Standard	spark.executor.memory: 5g spark.driver.memory: 5g spark.executor.cores: 4
G.1x	spark.executor.memory: 10g spark.driver.memory: 10g spark.executor.cores: 4
G.2x	spark.executor.memory: 20g spark.driver.memory: 20g spark.executor.cores: 8

If the error persists after upgrading the worker type, then increase the number of executors for the job. Each executor has a certain number of cores. This number determines the number of partitions that can be processed by the executor. Spark configurations for the data processing units (DPUs) are defined based on the worker type.
Be sure that data is properly parallelized so that executors can be used evenly before any shuffle operation, such as joins. You can repartition the data across all the executors. You can do so by including the following commands for AWS Glue DynamicFrame and Spark DataFrame in your ETL job, respectively.

dynamicFrame.repartition(totalNumberOfExecutorCores)

dataframe.repartition(totalNumberOfExecutorCores)

Using job bookmarks allows only the newly written files to be processed by the AWS Glue job. This reduces the number of files processed by the AWS Glue job and alleviate memory issues. Bookmarks store the metadata about the files processed in the previous run. In the subsequent run, the job compares the timestamp and then decides whether to process these files again. For more information, see Tracking processed data using job bookmarks.
When connecting to a JDBC table, Spark opens only one concurrent connection by default. The driver tries to download the whole table at once in a single Spark executor. This might take longer and even cause out of memory errors for the executor. Instead, you can set specific properties of your JDBC table to instruct AWS Glue to read data in parallel through DynamicFrame. For more information, see Reading from JDBC tables in parallel. Or, you can achieve parallel reads from JDBC through Spark DataFrame. For more information, see Spark DataFrame parallel reads from JDBC, and review properties, such as partitionColumn, lowerBound, upperBound, and numPartitions.
Avoid using user-defined functions in your ETL job, especially when combining Python/Scala code with Spark's functions and methods. For example, avoid using Spark's df.count() for verifying empty DataFrames within if/else statements or for loops. Instead, use better performant function, such as df.schema() or df.rdd.isEmpty().
Test the AWS Glue job on a development endpoint and optimize the ETL code accordingly.
If none of the preceding solution options work, split the input data into chunks or partitions. Then, run multiple AWS Glue ETL jobs instead of running one big job. For more information, see Workload partitioning with bounded execution.

Related information

Debugging OOM exceptions and job abnormalities

Best practices to scale Apache Spark jobs and partition data with AWS Glue

Optimize memory management in AWS Glue

Topics

Analytics

Relevant content

AWS Glue ETL“Failed to delete key: target_folder/_temporary” caused by S3
cell
asked 4 years ago
Runaway glue jobs leading to exception Task allocated capacity limit being exceeded ? "Glue ErrorCode 400 InvalidInputException"
Accepted Answer
rePost-User-9055820
asked 2 years ago
AWS ECS Tasks are killed by OOM killer without hitting memory limit
icaliskanoglu
asked a year ago
ETL job failing with weird error
Accepted Answer
Mark
asked 6 months ago
Kill yarn jobs that not running automatically
Accepted Answer
Mark
asked 6 months ago
How do I troubleshoot the "Command failed with exit code" error in AWS Glue?
AWS OFFICIALUpdated 3 years ago
How do I resolve the error "Container killed by YARN for exceeding memory limits" in Spark on Amazon EMR?
AWS OFFICIALUpdated a year ago
Why does my AWS Glue job return the 403 Access Denied error?
AWS OFFICIALUpdated 2 years ago
Why is my AWS Glue job failing with the error "Exit status: -100. Diagnostics: Container released on a *lost* node"?
AWS OFFICIALUpdated 3 years ago
Push down queries when using the Google BigQuery Connector for AWS Glue
EXPERT
Fabrizio@AWS
published 2 years ago