How can I configure an AWS Glue ETL job to output larger files?

Last updated: 2020-03-16

I want to configure an AWS Glue ETL job to output a small number of large files instead of a large number of small files. How do I do that?

Resolution

Use one or both of the following methods to reduce the number of output files for an AWS Glue ETL job.

Increase the value of the groupSize parameter

Grouping is automatically enabled when you use dynamic frames and when the Amazon Simple Storage Service (Amazon S3) dataset has more than 50,000 files. The default groupSize value is 1 MB. Increase this value to create fewer, larger output files. For more information, see Reading Input Files in Larger Groups.

In the following example, groupSize is set to 10485760 bytes (100 MB):

dyf = glueContext.create_dynamic_frame_from_options("s3", {'paths': ["s3://awsexamplebucket/"], 'groupFiles': 'inPartition', 'groupSize': '10485760'}, format="json")

Use coalesce()

Use an Apache Spark .coalesce() operation to reduce number of Spark output partitions before writing to Amazon S3. This reduces the number of output files. Keep in mind:

  • coalesce() performs Spark data shuffles, which can significantly increase the job run time.
  • If you specify a small number of partitions, the job might fail. For example, if you run coalesce(1), Spark tries to put all data into a single partition. This can lead to disk space issues.

Note: You can also use repartition() to decrease the number of partitions. However, repartition() reshuffles all data. The coalesce() operation uses existing partitions to minimize the amount of data shuffles.

To decrease the number of Spark partitions using the .coalesce() operation:

1.    Check the current number of partitions:

dynamic_frame.getNumPartitions()

2.    Run coalesce(). Example:

dynamic_frame_with_less_partitions=dynamic_frame.coalesce(20)

Did this article help you?

Anything we could improve?


Need more help?