How can I configure an AWS Glue ETL job to output larger files?

3 minute read

I want to configure an AWS Glue ETL job to output a small number of large files instead of a large number of small files.

Resolution

Use any of the following methods to reduce the number of output files for an AWS Glue ETL job.

Increase the value of the groupSize parameter

Grouping is automatically enabled when you use dynamic frames and when the Amazon Simple Storage Service (Amazon S3) dataset has more than 50,000 files. Increase this value to create fewer, larger output files. For more information, see Reading input files in larger groups.

In the following example, groupSize is set to 10485760 bytes (10 MB):

dyf = glueContext.create_dynamic_frame_from_options("s3", {'paths': ["s3://awsexamplebucket/"], 'groupFiles': 'inPartition', 'groupSize': '10485760'}, format="json")

Note: The groupSize and groupFiles parameters are supported only in the following data formats: csv, ion, grokLog, json, and xml. This option is not supported for avro, parquet, and orc.

Use coalesce(N) or repartition(N)

1. (Optional) Calculate your target number of partitions (N) based on the input data set size. Use the following formula:

targetNumPartitions = 1 Gb * 1000 Mb/10 Mb = 100

Note: In this example, the input size is 1 GB, and the target output is 10 MB. This calculation allows you to control the size of your output file .

2. Check the current number of partitions using the following code:

currentNumPartitions = dynamic_frame.getNumPartitions()

Note: When you repartition, targetNumPartitions should be smaller than currentNumPartitions.

Use an Apache Spark coalesce() operation to reduce the number of Spark output partitions before writing to Amazon S3. This reduces the number of output files. For example:

dynamic_frame_with_less_partitions=dynamic_frame.coalesce(targetNumPartitions)

Keep in mind:

coalesce() performs Spark data shuffles, which can significantly increase the job run time.
If you specify a small number of partitions, then the job might fail. For example, if you run coalesce(1), Spark tries to put all data into a single partition. This can lead to disk space issues.
You can also use repartition() to decrease the number of partitions. However, repartition() reshuffles all data. The coalesce() operation uses existing partitions to minimize the number of data shuffles. For more information on using repartition(), see Spark Repartition on the eduCBA website.

Use maxRecordsPerFile

Use the Spark write() method to control the maximum record count per file. The following example sets the maximum record count to 20:

df.write.option("compression", "gzip").option("maxRecordsPerFile",20).json(s3_path)

Note: The maxRecordsPerFile option acts only as an upper limit for the record count per file. The record count of each file will be less than or equal to the number specified. If the value is zero or negative, then there is no limit.

Related information

Fix the processing of multiple files using grouping

Topics

Analytics

Relevant content

how to pass S3 file (table) to DynamoDB as part of an ETL Job
rePost-User-1983573
asked a year ago
How do I get the output of an AWS Glue DataBrew job to be a single CSV file?
Accepted Answer
EXPERT
pechung
asked 3 years ago
AWS Glue Job does not log information messages in the output file
Chandra
asked 8 months ago
Glue ETL: Set output files to replaces existing files in source
Accepted Answer
Robert_Arken
asked 7 months ago
How can I use AWS Glue to split a file by number of lines?
Accepted Answer
Yen Trinh
asked 2 years ago
How do I reduce the amount of logs generated by my AWS Glue job?
AWS OFFICIALUpdated 2 years ago
How can I troubleshoot problems with viewing the Spark UI for AWS Glue ETL jobs?
AWS OFFICIALUpdated 3 years ago
How can I trigger an AWS Glue job in one AWS account based on the status of an AWS Glue job in another account?
AWS OFFICIALUpdated a year ago
How do I resolve the "No space left on device" error in an AWS Glue ETL job?
AWS OFFICIALUpdated a year ago
Push down queries when using the Google BigQuery Connector for AWS Glue
EXPERT
Fabrizio@AWS
published 2 years ago