How can I prevent the AWS Glue crawler from creating multiple tables?
Last updated: 2021-07-30
Why is the AWS Glue crawler creating multiple tables from my source data, and how can I prevent that from happening?
The AWS Glue crawler creates multiple tables when your source data files don't use the same:
- Format (such as CSV, Parquet, or JSON)
- Compression type (such as SNAPPY, gzip, or bzip2)
- Structure of Amazon Simple Storage Service (Amazon S3) partitions
Check the crawler logs
Check the crawler logs to identify the files that are causing the crawler to create multiple tables:
1. Open the AWS Glue console.
2. In the navigation pane, choose Crawlers.
3. Select the crawler, and then choose the Logs link to view the logs on the Amazon CloudWatch console.
4. If AWS Glue created multiple tables during the previous crawler run, then the log includes entries, such as the following:
[439d6bb5-ce7b-4fb7-9b4d-805346a37f88] INFO : Created table 2_part_00000_24cab769_750d_4ef0_9663_0cc6228ac858_c000_snappy_parquet in database glue
[439d6bb5-ce7b-4fb7-9b4d-805346a37f88] INFO : Created table 2_part_00000_3518b196_caf5_481f_ba4f_3e968cbbdd67_c000_snappy_parquet in database glue
[439d6bb5-ce7b-4fb7-9b4d-805346a37f88] INFO : Created table 2_part_00000_6d2fffc2_a893_4531_89fa_72c6224bb2d6_c000_snappy_parquet in database glue
The log entries include the names of the files that are causing the crawler to create multiple tables.
Prevent creation of multiple tables
To prevent the creation of multiple tables, consider one or more of the following solution options:
- Confirm that the all the data files use the same schema, format, and compression type. If some files use different schemas (for example, schema A says that field X is type INT, and schema B says that field X is type BOOL), you can do either of the following:
- Run an AWS Glue extract, transform, and load (ETL) job to read the outlier data using the from_options function. Then, transform the outlier data types to the correct or most common data types in your source.
- Use Amazon Athena to manually create the table using the existing table DDL. Then, run an AWS Glue crawler to update the table metadata using the crawler setting to inherit schema from table.
- If your data has different schemas in some input files and similar schemas in some other files, then you can combine compatible schemas when you create the crawler. On the Configure the crawler's output page, under Grouping behavior for S3 data (optional), select Create a single schema for each S3 path. When this setting is activated, and the data is compatible, the crawler ignores the similarity of specific schemas when evaluating Amazon S3 objects in the specified include path. For more information, see How to create a single schema for each Amazon S3 include path.
- If your input files have different Amazon S3 structures/paths, then the crawler creates multiple tables. For example, suppose that the crawler crawls an S3 path s3://doc-example-bucket/doc-example-key/doc-example-table with a partition structure similar to the following:
Then, the crawler creates multiple tables because of inconsistent partition structure.
To avoid this issue, you can do the following:
- Make the schema consistent either manually or programmatically. In the given example, you can either delete the s3 files without the partition dt=xxxx-xx-xx, or add the partition for files doc-example-file4.csv and doc-example-file5.csv.
- Use an exclude pattern to skip any unwanted files or folders.
- When using CSV data, be sure that you're using headers consistently. If some of your files have headers and some don't, then the crawler creates multiple tables.