AWS Big Data Blog
Accessing external components using Amazon Redshift Lambda UDFs
Amazon Redshift is a fast, scalable, secure, and fully managed cloud data warehouse. It makes it simple and cost-effective to analyze all your data using standard SQL, your existing ETL (extract, transform, and load), business intelligence (BI), and reporting tools. Tens of thousands of customers use Amazon Redshift to process exabytes of data per day and power analytics workloads.
Previously, you could create custom scalar user-defined functions (UDFs) using either a SQL SELECT statement or a Python program. With the general availability of the Amazon Redshift Lambda UDF, you can also create functions backed by AWS Lambda code. These Lambda UDFs enable you to interact and integrate with external components outside of Amazon Redshift. You can write Lambda UDFs in any supported programming language, such as Java, Go PowerShell, Node.js, C#, Python, Ruby, or a custom runtime. This functionality enables new Amazon Redshift use cases, including but not limited to the following:
- Data enrichment from external data stores such as Amazon DynamoDB, Amazon RDS for MySQL, and Amazon ElastiCache
- Data enrichment from external APIs such as the Melissa Global Address Web API
- Data masking and tokenization from external providers such as Protegrity
- Conversion of legacy UDFs written in other languages such as C, C++, and Java
This post describes how the Amazon Redshift Lambda UDF works and walks you through creating your first Amazon Redshift Lambda UDF.
Solution architecture
Amazon Redshift Lambda UDFs are architected to perform efficiently and securely. When you run an Amazon Redshift Lambda UDF, each slice in the Amazon Redshift cluster accumulates the applicable rows of arguments and sends those rows in batches to your Lambda function in parallel. The data is transferred between the Amazon Redshift compute nodes and function in a separate isolated network that is inaccessible by clients. For Amazon Redshift to reconstruct the results, the function receives the input data in an ordered nested array of values and is expected to return the results in a similarly ordered array. Because each Lambda invocation runs in parallel, the result returns to the compute nodes quickly. The compute nodes further process the data (if needed) and return the results to the leader node and SQL client. The following diagram illustrates this architecture.
Solution overview
For this post, we explore the use case of data enrichment from an external data source, using DynamoDB. In our function, we can pass the DynamoDB table, key field, and value. The function should return a JSON string containing the document associated to that key. In our use case, the transaction data is loaded into Amazon Redshift via a pipeline that is batch loaded from the POS system but contains only the CustomerId
. The customer metadata is used by the web application and is stored in DynamoDB. The goal is to run the following SQL statement to look up customer details by CustomerId
and retrieve the most up-to-date information:
The following table shows the expected results, in which the customer information is returned when it’s available, and shows as NULL when it’s not.
CustomerId | Customer |
0 | |
1 | {“lname”: “Doe”, “id”: “1”, “fname”: “John”} |
2 | {“lname”: “Doe”, “id”: “2”, “fname”: “Jane”} |
3 | {“lname”: “Salazar”, “id”: “3”, “fname”: “Carlos”} |
4 | {“lname”: “Ramirez”, “id”: “4”, “fname”: “Diego”} |
12 |
To implement this solution, we create the following:
- The DynamoDB and Amazon Redshift tables
- The Lambda function with an AWS Identity and Access Management (IAM) role that has access to the DynamoDB table.
- The Amazon Redshift UDF mapped to the Lambda function with an IAM role that has access to run the function
Creating the DynamoDB and Amazon Redshift tables
Create a DynamoDB table containing the following customer data:
id | fname | lname |
1 | John | Doe |
2 | Jane | Doe |
3 | Carlos | Salazar |
4 | Diego | Ramirez |
5 | Mary | Major |
6 | Richard | Roe |
To create a table in Amazon Redshift with transactions
data that refers to a CustomerId
, enter the following code:
Creating the Lambda function
The Lambda function receives an input event with metadata about the invocation. The arguments
attribute is an ordered nested array of input values. For our use case, you can expect the arguments to be sent as follows:
This function is written generically so that any table and field combination can be used to lookup. For example, if I need to enrich my data with Store metadata, I run the same function: udf_dynamodb_lookup ('Store', 'id', StoreId)
.
The code within the Lambda function needs to traverse through the input parameters and for each row, and retrieve the corresponding record from DynamoDB. The function is expected to return an output containing metadata such as success
, results
, and error_msg
. The success
attribute determines if the function was successful. In the case of failure, you can use error_msg
to pass a custom message to the user about the error. The results
attribute is an ordered array of output values. For our use case, the output is as follows:
The following code is of the Lambda function Lambda_DynamoDB_Lookup
, which takes the preceding input arguments containing the CustomerId
values and produces the output results, containing the corresponding customer metadata when available. The error handling uses the error_msg
if the table doesn’t exist, but populates None
if a particular key doesn’t exist.
For the Lambda function to query the DynamoDB Customer
table, you need to modify the execution role and grant DescribeTable
and GetItem
privileges. Add the following policy to the IAM role, replacing the account number with your AWS account number:
For the Amazon Redshift cluster to invoke the Lambda function you created, you need to associate the function to an IAM role that Amazon Redshift can assume and has the InvokeFunction
privilege. Add the following policy to the IAM role associated to Amazon Redshift, replacing the account number with your AWS account number. If you don’t already have an IAM role associated to Amazon Redshift, you can create one.
Creating the Amazon Redshift UDF
You can now register this function to Amazon Redshift. The following statement creates the function udf_dynamodb_lookup
within Amazon Redshift mapped to the earlier Lambda function. Be sure to associate the function to the IAM role modified earlier.
Finally, run the function and you should receive the expected results from earlier:
Summary
In this post, I introduced you to Amazon Redshift Lambda UDFs. I also provided a step-by-step guide for creating your first function, which enriches data in Amazon Redshift using DynamoDB. For more information about creating an Amazon Redshift Lambda UDF, see online documentation. If you want to share the UDFs you’ve created with other Amazon Redshift customers, please reach out to us.
We look forward to hearing from you about your experience. If you have questions or suggestions, please leave a comment.
About the Author
Rajiv Gupta is a data warehouse specialist solutions architect with Amazon Web Services.