Posted On: Oct 14, 2022

AWS Glue includes crawlers based on Amazon S3 Event Notifications, a capability that make discovering datasets simpler by scanning only data based on events in Amazon S3. The Glue crawler extracts the data schema and automatically populates the AWS Glue Data Catalog, which keeps the metadata current. By crawling datasets based on S3 events, this reduces the time to insight by making newly ingested data quickly available for analysis with your favorite analytics and machine learning tools.

Today we are extending this support to incremental crawling and updating catalog tables that are created by non-crawler methods such as API calls executed inside data pipelines. With this feature, incremental crawling can now be offloaded from data pipelines to the scheduled Glue Crawler, reducing crawls to incremental S3 events.

To accomplish incremental crawling, customers can configure Amazon S3 Event Notifications to be sent to an Amazon Simple Queue Service (SQS) queue. Customers can then use the SQS queue as a source to identify changes and can schedule or run Glue Crawler with Glue Data Catalog tables as a target. With each run of the crawler, the SQS queue is inspected for new events. If no new events are found, the crawler stops. If events are found in the queue, the crawler inspects their respective folders, processes through built-in classifiers (for CSV, JSON, AVRO, XML etc.), and determines the changes. The crawler then updates the Glue Data Catalog with new information, such as newly added or deleted partitions or columns. This feature reduces the cost and time to crawl large and frequently changing Amazon S3 data.

This feature is available in all commercial regions where AWS Glue is available, see the AWS Region Table. To learn more, read the blog, and visit the AWS Glue Crawler documentation.