Design patterns: Set up AWS Glue Crawlers using S3 event notifications

The AWS Well-Architected Data Analytics Lens provides a set of guiding principles for analytics applications on AWS. One of the best practices it talks about is build a central Data Catalog to store, share, and track metadata changes. AWS Glue provides a Data Catalog to fulfill this requirement. AWS Glue also provides crawlers that automatically discover datasets stored in multiple source systems, including Amazon Redshift, Amazon DynamoDB, Amazon Simple Storage Service (Amazon S3), MongoDB, Amazon DocumentDB (with MongoDB compatibility), and various other data stores using JDBC. A crawler extracts schemas of tables from these sources and stores the information in the AWS Glue Data Catalog. You can run a crawler on-demand or on a schedule.

When you schedule a crawler to discover data in Amazon S3, you can choose to crawl all folders or crawl new folders only. In the first mode, every time the crawler runs, it scans data in every folder under the root path it was configured to crawl. This can be slow for large tables because on every run, the crawler must list all objects and then compare metadata to identify new objects. In the second mode, commonly referred as incremental crawls, every time the crawler runs, it processes only S3 folders that were added since the last crawl. Incremental crawls can reduce runtime and cost when used with datasets that append new objects with consistent schema on a regular basis.

AWS Glue also supports incremental crawls using Amazon S3 Event Notifications. You can configure Amazon S3 Event Notifications to be sent to an Amazon Simple Queue Service (Amazon SQS) queue, which the crawler uses to identify the newly added or deleted objects. With each run of the crawler, the SQS queue is inspected for new events; if none are found, the crawler stops. If events are found in the queue, the crawler inspects their respective folders and processes the new objects. This new mode reduces cost and crawler runtime to update large and frequently changing tables.

In this post, we present two design patterns to create a crawler pipeline using this new feature. A crawler pipeline refers to components required to implement incremental crawling using Amazon S3 Event Notifications.

Crawler pipeline design patterns

We define design patterns for the crawler pipeline based on a simple question: do I have any applications other than the crawler that consume S3 event notifications?

If the answer is no, you can send event notifications directly to an SQS queue that has no other consumers. The crawler consumes events from the queue.

If you have multiple applications that want to consume the event notifications, send the notifications directly to an Amazon Simple Notification Service (Amazon SNS) topic, and then broadcast them to an SQS queue. If you have an application or microservice that consumes notifications, you can subscribe it to the SNS topic. This way, you can populate metadata in the Data Catalog while still supporting use cases around the files ingested into the S3 bucket.

The following are some considerations for these options:

S3 event notifications can only be sent to standard Amazon SNS; Amazon SNS FIFO is not supported. Refer to Amazon S3 Event Notifications for more details.
Similarly, S3 event notifications sent to Amazon SNS can only be forwarded to standard SQS queues and not Amazon SQS FIFO queues. For more information, see FIFO topics example use case.
The AWS Identity and Access Management (IAM) role used by the crawler needs to include an IAM policy for Amazon SQS. We provide an example policy later in this post.

Let’s take a deeper look into each design pattern to understand the architecture and its pros and cons.

Option 1: Publish events to an SQS queue

The following diagram represents a design pattern where S3 event notifications are published directly to a standard SQS queue. First, you need to configure an SQS queue as a target for S3 event notification on the S3 bucket where the table you want to crawl is stored. Next, attach an IAM policy to the queue including permissions for Amazon S3 to send messages to Amazon SQS, and permissions for the crawler IAM role to read and delete messages from Amazon SQS. This approach is useful when the SQS queue is used only for incremental crawling and no other application or service is depending on it. The crawler removes events from the queue after they are processed, so they’re not available for other applications. The following diagram illustrates this architecture.

Figure 1: Crawler pipeline using Amazon SQS queue

Option 2: Publish events to an SNS topic and forward to an SQS queue

The following diagram represents a design pattern where S3 event notifications are sent to an SNS topic, which are then forwarded to an SQS queue for the crawler to consume. First, you need to configure an SNS topic as a target for S3 event notification on the S3 bucket where the table you want to crawl is stored. Next, attach an IAM policy to the topic including permissions for Amazon S3 to send messages to Amazon SNS. Then, create an SQS queue and subscribe it to the SNS topic to receives S3 events. Finally, attach an IAM policy to the queue that includes permissions for Amazon SNS to publish messages to Amazon SQS and permissions for the crawler IAM role to read and delete messages from Amazon SQS. This approach is useful when other applications depend on the S3 event notifications. For more information about fanout capabilities in Amazon SNS, see Fanout S3 Event Notifications to Multiple Endpoints.

Figure 2: Crawler pipeline using Amazon SNS topic and Amazon SQS queue

Solution overview

It’s common to have multiple applications consuming S3 event notifications, so in this post we demonstrate how to implement the second design pattern using Amazon SNS and Amazon SQS.

We create the following AWS resources:

S3 bucket – The location where table data is stored. Event notifications are enabled.
SNS topic and access policy – Amazon S3 sends event notifications to the SNS topic. The topic must have a policy that gives permissions to Amazon S3.
SQS queue and access policy – The SNS topic publishes messages to SQS queue. The queue must have a policy that gives the SNS topic permission to write messages.
Three IAM policies – The policies are as follows:
- SQS queue policy – Lets the crawler consume messages from the SQS queue.
- S3 policy – Lets the crawler read files from the S3 bucket.
- AWS Glue crawler policy – Lets the crawler make changes to the AWS Glue Data Catalog.
IAM role – The IAM role used by the crawler. This role uses the three preceding policies.
AWS Glue crawler – Crawls the table’s objects and updates the AWS Glue Data Catalog.
AWS Glue database – The database in the Data Catalog.
AWS Glue table – The crawler creates a table in the Data Catalog.

In the following sections, we walk you through the steps to create your resources and test the solution.

Create an S3 bucket and set up a folder

To create your Amazon S3 resources, complete the following steps:

On the Amazon S3 console, choose Buckets in the navigation pane.
Choose Create bucket.
For Bucket name, enter s3-event-notifications-bucket-<random-number>.
Select Block all public access.
Choose Create bucket.
In the buckets list, select the bucket and choose Create a folder.
For Folder name, enter nyc_data.
Choose Create folder.

Create an IAM policy with permissions on Amazon S3

To create your IAM policy with Amazon S3 permissions, complete the following steps:

On the IAM console, choose Policies.
Choose Create policy.
On the JSON tab, enter the policy code from s3_event_notifications_iam_policy_s3.json.
Update the S3 bucket name.
Choose Next: Tags.
Choose Next: Review.
For Name, enter s3_event_notifications_iam_policy_s3.
Choose Create policy.

Create an IAM policy with permissions on Amazon SQS

To create your IAM policy with Amazon SQS permissions, complete the following steps:

On the IAM console, choose Policies.
Choose Create policy.
On the JSON tab, enter the policy code from s3_event_notifications_iam_policy_sqs.json.
Update the AWS account number.
Choose Next: Tags.
Choose Next: Review.
For Name, enter s3_event_notifications_iam_policy_sqs.
Choose Create policy.

Create an IAM role for the crawler

To create your IAM policy with for the AWS Glue crawler, complete the following steps:

On the IAM console, choose Roles.
Choose Create role.
For Choose a use case, choose Glue.
Choose Next: Permissions.
Attach the two policies you just created: s3_event_notifications_iam_policy_s3 and s3_event_notifications_iam_policy_sqs.
Attach the AWS managed policy AWSGlueServiceRole.
Choose Next: Tags.
Choose Next: Review.
For Role name, enter s3_event_notifications_crawler_iam_role.
Review to confirm that all three policies are attached.
Choose Create role.

Create an SNS topic

To create your SNS topic, complete the following steps:

On the Amazon SNS console, choose Topics.
Choose Create topic.
For Type, choose Standard (FIFO isn’t supported).
For Name, enter s3_event_notifications_topic.
Choose Create topic.
On the Access policy tab, choose Advanced.
Enter the policy contents from s3_event_notifications_sns_topic_access_policy.json.
Update the account number and S3 bucket.
Choose Create topic.

Create an SQS queue

To create your SQS queue, complete the following steps.

On the Amazon SQS console, choose Create a queue.
For Type, choose Standard.
For Name, enter s3_event_notifications_queue.
Keep the remaining settings at their default.
On the Access policy tab, choose Advanced.
Enter the policy contents from s3_event_notifications_sqs_queue_policy.json.
Update the account number.
Choose Create queue.
On the SNS subscription tab, choose Subscribe to SNS topic.
Choose the topic you created, s3_event_notifications_topic.
Choose Save.

Create event notifications on the S3 bucket

To create event notifications for your S3 bucket, complete the following steps:

Navigate to the Properties tab of the S3 bucket you created.
In the Event notifications section, choose Create event notification.
For Event name, enter crawler_event.
For Prefix, enter nyc_data/.
For Event Types, choose All Object Create Event.
For Destination, choose SNS topic and the topic s3_event_notifications_topic.

Create a crawler

To create your AWS Glue crawler, complete the following steps:

On the AWS Glue console, choose Crawlers.
Choose Add crawler.
For Crawler name, enter s3_event_notifications_crawler.
Choose Next.
For Crawler source type, choose data stores.
For Repeat crawls of S3 data stores, choose Crawl changes identified by Amazon S3 events.
Choose Next.
For Include path, enter an S3 path.
For Include SQS ARN, add your Amazon SQS ARN.

Including a dead-letter queue is optional; we skip it for this post. Dead-letter queues help you isolate problematic event notifications a crawler can’t process successfully. To understand general benefits of dead-letter queues and how it gets messages from the main SQS queue, refer to Amazon SQS dead-letter queues.

Choose Next.
When asked to add another data store, choose No.
For IAM role, select “Choose an existing role” and enter the IAM role created above.
Choose Next.
For Frequency, choose Run on demand.
Choose Next.
Under Database, choose Add database.
For Database name, enter s3_event_notifications_database.
Choose Create.
Choose Next.
Choose Finish to create your crawler.

Test the solution

The following steps show how adding new objects triggers an event notification that propagates to Amazon SQS, which the crawler uses on subsequent runs. For sample data, we use NYC taxi records from January and February, 2020.

Download the following datasets:
1. green_tripdata_2020-01.csv
2. green_tripdata_2020-02.csv
On the Amazon S3 console, navigate to the bucket you created earlier.
Create a folder called nyc_data.
Create a subfolder called dt=202001.

This sends a notification to the SNS topic, and a message is sent to the SQS queue.

In the folder dt=202001, upload the file green_tripdata_2020-01.csv.
To validate that this step generated an S3 event notification, navigate to the queue on the Amazon SQS console.
Choose Send and receive messages.
Under Receive messages, Messages available should show as 1.
Return to the Crawlers page on the AWS Glue console and select the crawler s3_event_notifications_crawler.
Choose Run crawler. After a few seconds, the crawler status changes to Starting and then to Running. The crawler should complete in 1–2 minutes and display a success message.
Confirm that a new table, nyc_data, is in your database.
Choose the table to verify its schema.

The dt column is marked as a partition key.

Choose View partitions to see partition details.
To validate that the crawler consumed this event, navigate to the queue on the Amazon SQS console and choose Send and receive messages.
Under Receive messages, Messages available should show as 0.

Now upload another file and see how the S3 event triggers a crawler to run.

On the Amazon S3 console, in your nyc_data folder, create the subfolder dt=202002.
Upload the file green_tripdata_2020-02.csv to this subfolder.
Run the crawler again and wait for the success message.
Return to the AWS Glue table and choose View partitions to see a new partition added.

Additional notes

Keep in mind the following when using this solution:

You can update existing crawlers with a single Amazon S3 target to use this new feature. You can do this either via the AWS Glue console or by calling the update_crawler API.
As the table schema evolves with newly added files, the crawler updates the AWS Glue Data Catalog accordingly.
To try this feature programmatically, refer to GitHub repo AWS Glue Crawler Utilities, where we’ve provided you with a AWS SDK for Python (Boto3) script. We will add samples for AWS CloudFormation and AWS Cloud Development Kit (AWS CDK) in coming weeks.

Clean up

When you’re finished evaluating this feature, you should delete the SNS topic and SQS queue, AWS Glue crawler, and S3 bucket and objects to avoid any further charges.

Conclusion

In this post, we discussed a new way for AWS Glue crawlers to use S3 Event Notifications to reduce the time and cost needed to incrementally process table data updates in the AWS Glue Data Catalog. We discussed two design patterns to implement this approach. The first pattern publishes events directly to an SQS queue, which is useful when only the crawler needs these events. The second pattern publishes events to an SNS topic, which are forwarded to an SQS queue for the crawler to process. This is useful when other applications also depend on these events. We also discussed how to implement the second design pattern to incrementally crawl your data. Incremental crawlers using S3 event notifications reduces the runtime and cost of your crawlers for large and frequently changing tables.

Let us know your feedback in the comments section. Happy crawling!

About the Authors

Pradeep Patel is a Sr. Software Engineer at AWS Glue. He is passionate about helping customers solve their problems by using the power of the AWS Cloud to deliver highly scalable and robust solutions. In his spare time, he loves to hike and play with web applications.

Vivek Shrivastava is a Principal Data Architect, Data Lake in AWS Professional Services. He is a Bigdata enthusiast and holds 13 AWS Certifications. He is passionate about helping customers build scalable and high-performance data analytics solutions in the cloud. In his spare time, he loves reading and finds areas for home automation.

Ravi Itha is a Sr. Data Architect at AWS. He works with customers to design and implement data lakes, analytics, and microservices on AWS. He is an open-source committer and has published more than a dozen solutions using AWS CDK, AWS Glue, AWS Lambda, AWS Step Functions, Amazon ECS, Amazon MQ, Amazon SQS, Amazon Kinesis Data Streams, and Amazon Kinesis Data Analytics for Apache Flink. His solutions can be found at his GitHub handle. Outside of work, he is passionate about books, cooking, movies, and yoga.

AWS Big Data Blog