Posted On: Oct 15, 2021

AWS Glue includes crawlers, a capability that make discovering datasets simpler by scanning data in Amazon S3 and relational databases, extracting their schema and automatically populating the AWS Glue Data Catalog, which keeps the metadata current. This reduces the time to insight by making newly ingested data quickly available for analysis with your favorite analytics and machine learning tools.

When configuring the AWS Glue crawler to discover data in Amazon S3, you can choose from a full scan, where all objects in a given path are processed every time the crawler runs, or incremental scan, where only the objects in a newly added folder are processed. Full scan is useful when changes to the table are non-deterministic and can effect any object or partition. Incremental crawl is useful when new partitions, or folders, are added to the table. For large, frequently changing tables, the incremental crawling mode can be enhanced to reduce the time it takes the crawler to determine which objects changed.

Today we are launching support for Amazon S3 Event Notifications as a source for AWS Glue crawlers to incrementally update AWS Glue Data Catalog tables. Customers will be able to configure Amazon S3 Event Notifications to be sent to an Amazon Simple Queue Service (SQS) queue, which the crawler will use to identify the newly added or deleted objects. With each run of the crawler, the SQS queue is inspected for new events, if none are found, the crawler stops. If events are found in the queue, the crawler will inspect their respective folders and process the new objects. This new mode reduces the cost and time a crawler needs to update large and frequently changing tables.

AWS Glue crawler support for Amazon S3 Event Notifications is available in all regions where AWS Glue is available, see the AWS Region Table. To learn more, visit the AWS Glue crawler documentation.