Why is the AWS Glue crawler running for a long time?

4 minute read

The AWS Glue crawler has been running for several hours or longer, and is still not able to identify the schema in my data store.

Short description

Here are some common causes of long crawler run times:

Frequently adding new data: During the first crawler run, the crawler reads the first megabyte of each file to infer the schema. During subsequent crawler runs, the crawler lists files in the target, including files that were crawled during the first run, and reads the first megabyte of new files. The crawler doesn't read files that were read in the previous crawler run. This means that subsequent crawler runs are often faster. This is due to the incremental crawl feature, if activated. With this option, crawler only reads new data in subsequent crawl runs. However, when you add a lot of files or folders to your data store between crawler runs, the run time increases each time.
**Crawling compressed files:**Compressed files take longer to crawl. That's because the crawler must download the file and decompress it before reading the first megabyte or listing the file.
Note: For Apache Parquet, Apache Avro, and Apache Orc files, the crawler doesn't crawl the first megabyte. Instead, the crawler reads the metadata stored in each file.

Resolution

Before you start troubleshooting, consider whether or not you need to run a crawler. Unless you need to create a table in the AWS Glue Data Catalog and use the table in an extract, transform, and load (ETL) job or a downstream service, such as Amazon Athena, you don't need to run a crawler. For ETL jobs, you can use from_options to read the data directly from the data store and use the transformations on the DynamicFrame. When you do this, you don't need a crawler in your ETL pipeline. If you determine that a crawler makes sense for your use case, use one or more of the following methods to reduce crawler run times.

Use an exclude pattern

An exclude pattern tells the crawler to skip certain files or paths. Exclude patterns reduce the number of files that the crawler must list, making the crawler run faster. For example, use an exclude pattern to exclude meta files and files that have already been crawled. For more information, including examples of exclude patterns, see Include and exclude patterns.

Use the sample size feature

The AWS Glue crawler supports the sample size feature. With this feature, you can specify the number of files in each leaf folder to be crawled when crawling sample files in a dataset. When this feature is turned on, the crawler randomly selects some files in each leaf folder to crawl instead of crawling all the files in the dataset. If you have previous knowledge about your data formats and know that schemas in your folders do not change, then use the sampling crawler. Turning on this feature significantly reduces the crawler run time.

Run multiple crawlers

Instead of running one crawler on the entire data store, consider running multiple crawlers. Running multiple crawlers for a short amount of time is better than running one crawler for a long time. For example, assume that you are partitioning your data by year, and that each partition contains a large amount of data. If you run a different crawler on each partition (each year), the crawlers complete faster.

Combine smaller files to create larger ones

It takes more time to crawl a large number of small files than a small number of large files. That's because the crawler must list each file and must read the first megabyte of each new file.

Related information

How do I resolve the "Unable to infer schema" exception in AWS Glue?

Why does the AWS Glue crawler classify my fixed-width data file as UNKNOWN when I use a built-in classifier to parse the file?

Topics

Analytics

Relevant content

No 'Time with time zone' data type in AWS Glue Crawler
jazir
asked 2 years ago
Can multiple glue crawlers run concurrently?
Accepted Answer
MODERATOR
AWS-User-9812337
asked 5 years ago
AWS Glue - DataSink is taking long time to write
rePost-User-3581525
asked 9 months ago
AWS glue studio node long run time for data preview
aws-user-4575
asked 2 months ago
Error Running Glue Crawler
Accepted Answer
Wajahat_A
asked 4 years ago
Why do my Amazon Athena queries take a long time to run?
AWS OFFICIALUpdated 4 months ago
Why is my AWS Glue ETL job running for a long time?
AWS OFFICIALUpdated 5 months ago
Why is my AWS Glue crawler not adding new partitions to the table?
AWS OFFICIALUpdated 3 years ago
How do I resolve a MSCK REPAIR TABLE command that takes too long to run or times out in Amazon Athena?
AWS OFFICIALUpdated 4 months ago
How to avoid network timeout issues when invoking long running Lambda functions from .NET6+ applications on Linux platforms
EXPERT
Gabriele
published a year ago