How do I resolve the "java.lang.OutOfMemoryError: Java heap space" error in AWS Glue?

Last updated: 2019-09-12

My AWS Glue job fails with "Command failed with exit code 1" and Amazon CloudWatch logs show the "java.lang.OutOfMemoryError: Java heap space" error.

Short Description

The "java.lang.OutOfMemoryError: Java heap space" error indicates that a driver process is running out memory. For more information about diagnosing driver out of memory (OOM) exceptions, see Debugging a Driver OOM Exception.

Driver OOM exceptions commonly happen when an Apache Spark job reads a large number of small files from Amazon Simple Storage Service (Amazon S3). Use the grouping feature or enable useS3ListImplementation in your DynamicFrame to resolve the root cause.

Resolution

Use one or more of the following methods to resolve driver OOM exceptions with DynamicFrames.

Grouping

When the grouping feature is enabled, tasks process multiple files instead of individual files. For more information, see Fix the Processing of Multiple Files Using Grouping.

Enable useS3ListImplementation

When AWS Glue lists files, it creates a file index in driver memory. When you set useS3ListImplementation to True, as shown in the following example, AWS Glue doesn't cache the list of files in memory all at once. Instead, the list is cached in batches. This means that the driver is less likely to run out of memory.

Here's an example of how to enable useS3ListImplementation with from_catalog:

datasource0 = glueContext.create_dynamic_frame.from_catalog(database = "database", table_name = "table", additional_options = {'useS3ListImplementation': True}, transformation_ctx = "datasource0")

Here's an example of how to enable useS3ListImplementation with from_options:

datasource0 = glueContext.create_dynamic_frame.from_options(connection_type="s3", connection_options = {"paths": ["s3://input_path"], "useS3ListImplementation":True,"recurse":True}, format="json")

The useS3ListImplementation feature is an implementation of the Amazon S3 ListKeys operation, which splits large results sets into multiple responses. It's a best practice to use useS3ListImplementation with job bookmarks.

Additional troubleshooting

If grouping and useS3ListImplementation don't resolve driver OOM exceptions, try the following:

  • Use CloudWatch logs and CloudWatch metrics to analyze driver memory. Set up CloudWatch alarms to alert you when specific thresholds are breached in your job.
  • Avoid using actions like collect and count. These actions collect results on the driver, which can cause driver OOM exceptions.
  • Analyze your dataset and select the right worker type for your job. You might need to scale up to G.1X or G.2X.