Why is the AWS Glue crawler running for a long time?

Last updated: 2019-09-24

The AWS Glue crawler has been running for several hours or longer, and is still not able to identify the schema in my data store. Why is the crawler taking so long and how can I speed it up?

Short Description

Here are some common causes of long crawler run times:

  • Frequently adding new data: During the first crawler run, the crawler reads the first megabyte of each file to infer the schema. During subsequent crawler runs, the crawler lists files in the target—including files that were crawled during the first run—and reads the first megabyte of new files. The crawler doesn't read files that were read in the previous crawler run. This means that subsequent crawler runs are often faster. However, when you add a lot of files or folders to your data store between crawler runs, the run time increases each time.
  • Crawling compressed files: Compressed files take longer to crawl. That's because the crawler must download the file and decompress it before reading the first megabyte or listing the file.
    Note: For Apache Parquet, Apache Avro, and Apache Orc files, the crawler doesn't crawl the first megabyte. Instead, the crawler reads the metadata stored in each file.

Resolution

Before you start troubleshooting, consider whether or not you need to run a crawler. Unless you need to create a table in the AWS Glue Data Catalog and use the table in an ETL job or a downstream service such as Amazon Athena, you don't need to run a crawler. For ETL jobs, you can use from_options to read the data directly from the data store and use the transformations on the DynamicFrame. When you do this, you don't need a crawler in your ETL pipeline. If you determine that a crawler makes sense for your use case, use one or more of the following methods to reduce crawler run times.

Use an exclude pattern

An exclude pattern tells the crawler to skip certain files or paths. Exclude patterns reduce the number of files that the crawler must list, which means that the crawler runs faster. For example, use an exclude pattern to exclude meta files and files that have already been crawled. For more information, including examples of exclude patterns, see Include and Exclude Patterns.

Run multiple crawlers

Instead of running one crawler on the entire data store, consider running multiple crawlers. Running multiple crawlers for a short amount of time is better than running one crawler for a long time. For example, assume that you are partitioning your data by year, and that each partition contains a large amount of data. If you run a different crawler on each partition (each year), the crawlers finish faster.

Combine smaller files to create larger ones

It takes more time to crawl a large number of small files than a small number of large files. That's because the crawler must list each file and must read the first megabyte of each new file.