How do I resolve the "java.lang.OutOfMemoryError: Java heap space" error in AWS Glue?

Last updated: 2020-06-19

My AWS Glue job fails with "Command failed with exit code 1" and Amazon CloudWatch Logs shows the "java.lang.OutOfMemoryError: Java heap space" error.

Short description

The "java.lang.OutOfMemoryError: Java heap space" error indicates that a driver or executor process is running out of memory. To determine whether a driver or an executor causes the OOM, see Debugging OOM exceptions and job abnormalities. The following resolution is for driver OOM exceptions only.

Resolution

Driver OOM exceptions commonly happen when an Apache Spark job reads a large number of small files from Amazon Simple Storage Service (Amazon S3). Resolve driver OOM exceptions with DynamicFrames using one or more of the following methods.

Grouping

When you enable the grouping feature, tasks process multiple files instead of individual files. For more information, see Fix the processing of multiple files using grouping.

Enable useS3ListImplementation

When AWS Glue lists files, it creates a file index in driver memory. When you set useS3ListImplementation to True, as shown in the following example, AWS Glue doesn't cache the list of files in memory all at once. Instead, AWS Glue caches the list in batches. This means that the driver is less likely to run out of memory.

Here's an example of how to enable useS3ListImplementation with from_catalog:

datasource0 = glueContext.create_dynamic_frame.from_catalog(database = "database", table_name = "table", additional_options = {'useS3ListImplementation': True}, transformation_ctx = "datasource0")

Here's an example of how to enable useS3ListImplementation with from_options:

datasource0 = glueContext.create_dynamic_frame.from_options(connection_type="s3", connection_options = {"paths": ["s3://input_path"], "useS3ListImplementation":True,"recurse":True}, format="json")

The useS3ListImplementation feature is an implementation of the Amazon S3 ListKeys operation, which splits large results sets into multiple responses. It's a best practice to use useS3ListImplementation with job bookmarks.

Additional troubleshooting

If grouping and useS3ListImplementation don't resolve driver OOM exceptions, try the following:

  • Use CloudWatch Logs and CloudWatch metrics to analyze driver memory. Set up CloudWatch alarms to alert you when specific thresholds are breached in your job.
  • Avoid using actions like collect and count. These actions collect results on the driver, which can cause driver OOM exceptions.
  • Analyze your dataset and select the right worker type for your job. Consider scaling up to G.1X or G.2X.