Why is my AWS Glue crawler not adding new partitions to the table?

3 minute read

My AWS Glue crawler doesn't add new partitions to the table.

Short description

When the crawler scans the source data files under a new partition, the crawler compares the following attributes of the source files with those of the existing table:

File format
Compression type
Schema
Structure of Amazon Simple Storage Service (Amazon S3) partitions

If any of these attributes of the partition differ from attributes of the table, then the partition is skipped and not added to the metadata. A difference in the name, sequence, or number of partitions in the Amazon S3 path is considered as a change in the partition schema or structure.

Resolution

Troubleshoot the issue

Check the crawler logs to identify the issue:

Open the AWS Glue console.
In the navigation pane, choose Crawlers.
Select the crawler, and then choose the Logs link to view the logs on the CloudWatch console.
Review the logs to check if the crawler skipped the new partition.

For example, suppose that the log includes entries look similar to the following:

Folder partition keys do not match table partition keys, skipped folder: doc-example-bucket/doc-example-path/doc-example-table/year=2021/month=01/sday=05/

This entry suggests that the partition structure for the Amazon S3 location doesn't match the partition keys defined for the table. This might happen when the partition structure isn't consistent across the table source location.

If the AWS Glue crawler creates multiple tables, then the log entries look similar to the following:

INFO : Created table doc-example-table in database doxtest_db

If you see similar logs, then compare the schema and partition structure of the location of these tables with those of the original table.

Resolve the issue

Based on the information from the CloudWatch logs, consider one or more of the following solution options:

If the issue is caused by inconsistent partition structure, then make the structure consistent by renaming the S3 path manually or programmatically.
If the partition is skipped due to mismatch in file format, compression format, or schema, and the data isn't required to be included in the intended table, then consider the following:

Use an exclude pattern to skip any unwanted files.
Move the unwanted file to a different location.

If your data has different schemas in some input files and similar schemas in other input files, then combine compatible schemas when you create the crawler. On the Configure the crawler's output page, under Grouping behavior for S3 data (optional), select Create a single schema for each S3 path. When this setting is turned on and the data is compatible, then the crawler ignores the similarity of specific schemas when evaluating S3 objects in the specified include path. For more information, see How to create a single schema for each Amazon S3 include path.
If the crawler is creating multiple tables, then see How can I prevent the AWS Glue crawler from creating multiple tables?

Topics

Analytics

Relevant content

Crawler does not allow to change from recrawl all to only new partitions in aws Glue
KG
asked 5 months ago
Can Glue crawler be configured to include only the most recent partition in a table?
Paul Galbraith
asked 2 years ago
Glue Crawler error: Folder partition keys do not match table partition keys
sks_dk
asked 2 years ago
AWS Glue Crawler not scanning all the S3 buckets with partitions
rePost-User-4177940
asked a year ago
Glue Crawler - skip adding partitions
Accepted Answer
aneeshchandra
asked 4 years ago
How does the AWS Glue crawler detect the schema?
AWS OFFICIALUpdated 2 years ago
Why is the AWS Glue crawler running for a long time?
AWS OFFICIALUpdated 3 years ago
How can I prevent the AWS Glue crawler from creating multiple tables?
AWS OFFICIALUpdated a year ago
Why does the AWS Glue crawler classify my fixed-width data file as UNKNOWN when I use a built-in classifier to parse the file?
AWS OFFICIALUpdated 3 years ago
Migrating Glue Data Catalog tables to use Apache Iceberg open table format using Athena
EXPERT
Hamzah Chaudhry
published 13 days ago