How can I run an AWS Glue job on a specific partition in Amazon S3?

Last updated: 2019-10-10

How can I run an AWS Glue job on a specific partition in an Amazon Simple Storage Service (Amazon S3) location?

Short Description

To filter on partitions in the AWS Glue Data Catalog, use a pushdown predicate. Unlike Filter transforms, pushdown predicates allow you to filter on partitions without having to list and read all the files in your dataset.

Resolution

Create an AWS Glue job and specify the pushdown predicate in the DynamicFrame. In the following example, the job processes data in the s3://awsexamplebucket/product_category=Video partition only:

datasource0 = glueContext.create_dynamic_frame.from_catalog(database = "testdata", table_name = "sampletable", transformation_ctx = "datasource0",push_down_predicate = "(product_category == 'Video')")

Here's an example of a pushdown predicate that filters by date. In this example, the job processes data in the s3://awsexamplebucket/year=2019/month=08/day=02 partition only:

datasource0 = glueContext.create_dynamic_frame.from_catalog(database = "testdata", table_name = "sampletable", transformation_ctx = "datasource0",push_down_predicate = "(year == '2019' and month == '08' and day == '02')")

Here's an example of a pushdown predicate that filters by date for non-Hive style partitions. In this example, the job processes data in the s3://awsexamplebucket/2019/07/03 partition only:

datasource0 = glueContext.create_dynamic_frame.from_catalog(database = "testdata", table_name = "sampletable", transformation_ctx = "datasource0",push_down_predicate ="(partition_0 == '2019' and partition_1 == '07' and partition_2 == '03')" )

Did this article help you?

Anything we could improve?


Need more help?