AWS Big Data Blog

Building an event-driven application with AWS Lambda and the Amazon Redshift Data API

Eventdriven applications are becoming popular with many customers, where applications run in response to events. A primary benefit of this architecture is the decoupling of producer and consumer processes, allowing greater flexibility in application design and building decoupled processes.

An example of an even-driven application is an automated workflow being triggered by an event, which runs a series of transformations in the data warehouse. At the end of this workflow, another event gets initiated to notify end-users about the completion of those transformations and that they can start analyzing the transformed dataset.

In this post, we explain how you can easily design a similar event-driven application with Amazon Redshift, AWS Lambda, and Amazon EventBridge. In response to a scheduled event defined in EventBridge, this application automatically triggers a Lambda function to run a stored procedure performing extract, load, and transform (ELT) operations in an Amazon Redshift data warehouse, using its out-of-the-box integration with the Amazon Redshift Data API. This stored procedure copies the source data from Amazon Simple Storage Service (Amazon S3) to Amazon Redshift and aggregates the results. When complete, it sends an event to EventBridge, which triggers a Lambda function to send notification to end-users through Amazon Simple Notification Service (Amazon SNS) to inform them about the availability of updated data in Amazon Redshift.

This event-driven server-less architecture offers greater extensibility and simplicity, making it easier to maintain and faster to release new features, and also reduces the impact of changes. It also simplifies adding other components or third-party products to the application without many changes.

Prerequisites

As a prerequisite for creating the application in this post, you need to set up an Amazon Redshift cluster and associate it with an AWS Identity and Access Management (IAM) role. For more information, see Getting Started with Amazon Redshift.

Solution overview

The following architecture diagram highlights the end-to-end solution, which you can provision automatically with an AWS CloudFormation template.

The workflow includes the following steps:

  1. The EventBridge rule EventBridgeScheduledEventRule is initiated based on a cron schedule.
  2. The rule triggers the Lambda function LambdaRedshiftDataApiETL, with the action run_sql as an input parameter. The Python code for the Lambda function is available in the GitHub repo.
  3. The function performs an asynchronous call to the stored procedure run_elt_process in Amazon Redshift, performing ELT operations using the Amazon Redshift Data API.
  4. The stored procedure uses the Amazon S3 location event-driven-app-with-lambda-redshift/nyc_yellow_taxi_raw/ as the data source for the ELT process. We have pre-populated this with the NYC Yellow Taxi public dataset for the year 2015 to test this solution.
  5. When the stored procedure is complete, the EventBridge rule EventBridgeRedshiftEventRule is triggered automatically to capture the event based on the source parameter redshift-data from the Amazon Redshift Data API.
  6. The rule triggers the Lambda function LambdaRedshiftDataApiETL, with the action notify as an input parameter.
  7. The function uses the SNS topic RedshiftNotificationTopicSNS to send an automated email notification to end-users that the ELT process is complete.

The Amazon Redshift database objects required for this solution are provisioned automatically by the Lambda function LambdaSetupRedshiftObjects as part of the CloudFormation template initiation by invoking the function LambdaRedshiftDataApiETL, which creates the following objects in Amazon Redshift:

  • Table nyc_yellow_taxi, which we use to copy the New York taxi dataset from Amazon S3
  • Materialized view nyc_yellow_taxi_volume_analysis, providing an aggregated view of table
  • Stored procedure run_elt_process to take care of data transformations

The Python code for this function is available in the GitHub repo.

We also use the IAM role LambdaRedshiftDataApiETLRole for the Lambda function and  LambdaRedshiftDataApiETL to allow the following permissions:

  • Federate to the Amazon Redshift cluster through getClusterCredentials permission, avoiding password credentials
  • Initiate queries in the Amazon Redshift cluster through redshift-data API calls
  • Log with Amazon CloudWatch for troubleshooting purposes
  • Send notifications through Amazon SNS

A sample IAM role for this function is available in the GitHub repo.

Lambda is a key service in this solution because it initiates queries in Amazon Redshift using the redshift-data client. Based on the input parameter action, this function can asynchronously initiate Structured Query Language (SQL) statements in Amazon Redshift, thereby avoiding chances of timing out in case of long-running SQL statements[MOU1] [MOU2]   [MOU1]I think we should put reference to Redshift Data API and highlight that there is no need to configure drivers and connections [MOU2]done. It can also publish custom notifications through Amazon SNS. Also, it uses the Amazon Redshift Data API temporary credentials functionality, which allows it to communicate with Amazon Redshift using IAM permissions without the need of any password-based authentication. With the Data API, you also don’t need to configure drivers and connections for your Amazon Redshift cluster, because it’s handled automatically.

Deploying the CloudFormation template

When your Amazon Redshift cluster is set up, use the provided CloudFormation template to automatically create all required resources for this solution in your AWS account. For more information, see Getting started with AWS CloudFormation.

The template requires you to provide the following parameters:

  • RedshiftClusterIdentifier – Cluster identifier for your Amazon Redshift cluster.
  • DbUsername – Amazon Redshift database user name that has access to run the SQL script.
  • DatabaseName – Name of the Amazon Redshift database where the SQL script runs.
  • RedshiftIAMRoleARN – ARN of the IAM role associated with the Amazon Redshift cluster.
  • NotificationEmailId – Email to send event notifications through Amazon SNS.
  • ExecutionSchedule – Cron expression to schedule the ELT process through an EventBridge rule.
  • SqlText – SQL text to run as part of the ELT process. Don’t change the default value call run_elt_process(); if you want to test this solution with the test dataset provided for this post.

The following screenshot shows the stack details on the AWS CloudFormation console.

Testing the pipeline

After setting up the architecture, you should have an automated pipeline to trigger based on the schedule you defined in the EventBridge rule’s cron expression. You can view the CloudWatch logs and troubleshoot issues in the Lambda function. The following screenshot shows the logs for our setup.

You can also view the query status on the Amazon Redshift console, which allows you to view detailed execution plans for the queries you ran. Although the stored procedure may take around 6 minutes to complete, the Lambda function finishes in seconds. This is primarily because the execution from Lambda on Amazon Redshift was asynchronous. Therefore, the function is complete after initiating the process in Amazon Redshift without caring about the query completion.

When this process is complete, you receive the email notification that the ELT process is complete.

You may then view the updated data in your business intelligence tool, like Amazon QuickSight, or query data directly in Amazon Redshift Query Editor (see the following screenshot) to view the most recent data processed by this event-driven architecture.

Conclusion

The Amazon Redshift Data API enables you to painlessly interact with Amazon Redshift and enables you to build event-driven and cloud-native applications. We demonstrated how to build an event-driven application with Amazon Redshift, Lambda, and EventBridge. For more information about the Data API, see Using the Amazon Redshift Data API to interact with Amazon Redshift clusters and Using the Amazon Redshift Data API.


About the Authors

Manash Deb is a Senior Analytics Specialist Solutions Architect. He has worked in different database and data warehousing technologies for more than 15 years.

 

 

 

Debu Panda, a senior product manager at AWS, is an industry leader in analytics, application platform, and database technologies. He has more than 20 years of experience in the IT industry and has published numerous articles on analytics, enterprise Java, and databases and has presented at multiple conferences. He is lead author of the EJB 3 in Action (Manning Publications 2007, 2014) and Middleware Management (Packt).

 

Fei Peng is a Software Dev Engineer working in the Amazon Redshift team.