Why does my AWS Glue ETL job fail with the error "Container killed by YARN for exceeding memory limits"?
Last updated: 2021-07-21
My AWS Glue extract, transform, and load (ETL) job fails with the error "Container killed by YARN for exceeding memory limits".
The most common causes for this error are the following:
- Memory-intensive operations, such as joining large tables or processing datasets with a skew in the distribution of specific column values, exceeding the memory threshold of the underlying Spark cluster
- Fat partitions of data that consume more than the memory that's assigned to the respective executor
- Large files that can't be split resulting in large in-memory partitions
Use one or more of the following solution options to resolve this error:
- Upgrade the worker type from G1.x to G2.x that has higher memory configurations. For more information on specifications of worker types, see the Worker type section in Defining job properties for Spark jobs. You can also review the following table for information on worker type specifications:
- If the error persists after upgrading the worker type, then increase the number of executors for the job. Each executor has a certain number of cores. This number determines the number of partitions that can be processed by the executor. Spark configurations for the data processing units (DPUs) are defined based on the worker type.
- Be sure that data is properly parallelized so that executors can be used evenly before any shuffle operation, such as joins. You can repartition the data across all the executors. You can do so by including the following commands for AWS Glue DynamicFrame and Spark DataFrame in your ETL job, respectively.
- Using job bookmarks allows only the newly written files to be processed by the AWS Glue job. This reduces the amount of files processed by the AWS Glue job and alleviate memory issues. Bookmarks store the metadata about the files processed in the previous run. In the subsequent run, the job compares the timestamp and then decides whether to process these files again. For more information, see Tracking processed data using job bookmarks.
- When connecting to a JDBC table, Spark opens only one concurrent connection by default. The driver tries to download the whole table at once in a single Spark executor. This might take longer and even cause out of memory errors for the executor. Instead, you can set specific properties of your JDBC table to instruct AWS Glue to read data in parallel through DynamicFrame. For more information, see Reading from JDBC tables in parallel. Or, you can achieve parallel reads from JDBC through Spark DataFrame. For more information, see Spark DataFrame parallel reads from JDBC, and review properties, such as partitionColumn, lowerBound, upperBound, and numPartitions.
- Avoid using user-defined functions in your ETL job, especially when combining Python/Scala code with Spark's functions and methods. For example, avoid using Spark's df.count() for verifying empty DataFrames within if/else statements or for loops. Instead, use better performant function, such as df.schema() or df.rdd.isEmpty().
- Test the AWS Glue job on a development endpoint and optimize the ETL code accordingly.
- If none of the preceding solution options work, split the input data into chunks or partitions. Then, run multiple AWS Glue ETL jobs instead of running one big job. For more information, see Workload partitioning with bounded execution.