How can I prevent the AWS Glue crawler from creating multiple tables?

Last updated: 2019-10-31

Why is the AWS Glue crawler creating multiple tables from my source data, and how can I prevent that from happening?

Short Description

The AWS Glue crawler creates multiple tables when your source data doesn't use the same:

  • Format (such as CSV, Parquet, or JSON)
  • Compression type (such as SNAPPY, gzip, or bzip2)
  • Schema

Resolution

Check the crawler logs to identify the files that are causing the crawler to create multiple tables:

1.    Open the AWS Glue console.

2.    In the navigation pane, choose Crawlers.

3.    Choose the Logs link to view the logs on the Amazon CloudWatch console.

4.    If AWS Glue created multiple tables during the previous crawler run, the log includes entries like this:

[439d6bb5-ce7b-4fb7-9b4d-805346a37f88]
 INFO : Created table 
2_part_00000_24cab769_750d_4ef0_9663_0cc6228ac858_c000_snappy_parquet in
 database glue
[439d6bb5-ce7b-4fb7-9b4d-805346a37f88]
 INFO : Created table 
2_part_00000_3518b196_caf5_481f_ba4f_3e968cbbdd67_c000_snappy_parquet in
 database glue
[439d6bb5-ce7b-4fb7-9b4d-805346a37f88]
 INFO : Created table 
2_part_00000_6d2fffc2_a893_4531_89fa_72c6224bb2d6_c000_snappy_parquet in
 database glue

These are the files causing the crawler to create multiple tables. To prevent this from happening:

  • Confirm that these files use the same schema, format, and compression type as the rest of your source data. If some files use different schemas (for example, schema A says field X is type INT, and schema B says field X is type BOOL), run an AWS Glue ETL job to transform the outlier data types to the correct or most common data types in your source. Or, use Amazon Athena to manually create the table using the existing table DDL, and then run an AWS Glue crawler to update the table metadata.
  • If your data has different but similar schemas, you can combine compatible schemas when you create the crawler. On the Configure the crawler's output page, under Grouping behavior for S3 data (optional), choose Create a single schema for each S3 path. When this setting is enabled, and when the data is compatible, the crawler ignores the similarity of the specific schemas when evaluating Amazon S3 objects in the specified include path. For more information, see How to Create a Single Schema for Each Amazon S3 Include Path.
  • When using CSV data, be sure that you're using headers consistently. If some of your files have headers and some don't, the crawler creates multiple tables.

Did this article help you?

Anything we could improve?


Need more help?