How can I prevent the AWS Glue crawler from creating multiple tables?

Last updated: 2021-04-12

Why is the AWS Glue crawler creating multiple tables from my source data, and how can I prevent that from happening?

Short description

The AWS Glue crawler creates multiple tables when your source data doesn't use the same:

  • Format (such as CSV, Parquet, or JSON)
  • Compression type (such as SNAPPY, gzip, or bzip2)
  • Schema

Resolution

Check the crawler logs to identify the files that are causing the crawler to create multiple tables:

1.    Open the AWS Glue console.

2.    In the navigation pane, choose Crawlers.

3.    Select the crawler, and then choose the Logs link to view the logs on the Amazon CloudWatch console.

4.    If AWS Glue created multiple tables during the previous crawler run, then the log includes entries such as the following:

[439d6bb5-ce7b-4fb7-9b4d-805346a37f88]
 INFO : Created table 
2_part_00000_24cab769_750d_4ef0_9663_0cc6228ac858_c000_snappy_parquet in
 database glue
[439d6bb5-ce7b-4fb7-9b4d-805346a37f88]
 INFO : Created table 
2_part_00000_3518b196_caf5_481f_ba4f_3e968cbbdd67_c000_snappy_parquet in
 database glue
[439d6bb5-ce7b-4fb7-9b4d-805346a37f88]
 INFO : Created table 
2_part_00000_6d2fffc2_a893_4531_89fa_72c6224bb2d6_c000_snappy_parquet in
 database glue

The log entries include the names of the files that are causing the crawler to create multiple tables. To prevent the creation of multiple tables, do the following:

  • Confirm that the files use the same schema, format, and compression type as the rest of your source data. If some files use different schemas (for example, schema A says field X is type INT, and schema B says field X is type BOOL), you can do either of the following:
    • Run an AWS Glue ETL job to read the outlier data using the from_options function, and then transform the outlier data types to the correct or most common data types in your source.
    • Use Amazon Athena to manually create the table using the existing table DDL. Then, run an AWS Glue crawler to update the table metadata using the crawler setting to inherit schema from table.
  • If your data has different but similar schemas, you can combine compatible schemas when you create the crawler. On the Configure the crawler's output page, under Grouping behavior for S3 data (optional), select Create a single schema for each S3 path. When this setting is enabled and the data is compatible, the crawler ignores the similarity of the specific schemas when evaluating Amazon S3 objects in the specified include path. For more information, see How to create a single schema for each Amazon S3 include path.
  • When using CSV data, be sure that you're using headers consistently. If some of your files have headers and some don't, then the crawler creates multiple tables.

Did this article help?


Do you need billing or technical support?