I want to run an AWS Glue job on a specific partition in an Amazon Simple Storage Service (Amazon S3) location.
Short description
To filter on partitions in the AWS Glue Data Catalog, use a pushdown predicate. Unlike Filter transforms, pushdown predicates let you filter on partitions without having to list and read all the files in your dataset.
Resolution
Create an AWS Glue job, and then specify the pushdown predicate in the DynamicFrame. In the following example, the job processes data in only the s3://awsexamplebucket/product_category=Video partition:
datasource0 = glueContext.create_dynamic_frame.from_catalog(database = "testdata", table_name = "sampletable", transformation_ctx = "datasource0",push_down_predicate = "(product_category == 'Video')")
In the following example, the pushdown predicate filters by date. The job processes data in only the s3://awsexamplebucket/year=2019/month=08/day=02 partition:
datasource0 = glueContext.create_dynamic_frame.from_catalog(database = "testdata", table_name = "sampletable", transformation_ctx = "datasource0",push_down_predicate = "(year == '2019' and month == '08' and day == '02')")
In the following example, the pushdown predicate filters by date for non-Hive style partitions. The job processes data in only the s3://awsexamplebucket/2019/07/03 partition:
datasource0 = glueContext.create_dynamic_frame.from_catalog(database = "testdata", table_name = "sampletable", transformation_ctx = "datasource0",push_down_predicate ="(partition_0 == '2019' and partition_1 == '07' and partition_2 == '03')" )