Accessing external components using Amazon Redshift Lambda UDFs

Amazon Redshift is a fast, scalable, secure, and fully managed cloud data warehouse. It makes it simple and cost-effective to analyze all your data using standard SQL, your existing ETL (extract, transform, and load), business intelligence (BI), and reporting tools. Tens of thousands of customers use Amazon Redshift to process exabytes of data per day and power analytics workloads.

Previously, you could create custom scalar user-defined functions (UDFs) using either a SQL SELECT statement or a Python program. With the general availability of the Amazon Redshift Lambda UDF, you can also create functions backed by AWS Lambda code. These Lambda UDFs enable you to interact and integrate with external components outside of Amazon Redshift. You can write Lambda UDFs in any supported programming language, such as Java, Go PowerShell, Node.js, C#, Python, Ruby, or a custom runtime. This functionality enables new Amazon Redshift use cases, including but not limited to the following:

Data enrichment from external data stores such as Amazon DynamoDB, Amazon RDS for MySQL, and Amazon ElastiCache
Data enrichment from external APIs such as the Melissa Global Address Web API
Data masking and tokenization from external providers such as Protegrity
Conversion of legacy UDFs written in other languages such as C, C++, and Java

This post describes how the Amazon Redshift Lambda UDF works and walks you through creating your first Amazon Redshift Lambda UDF.

Solution architecture

Amazon Redshift Lambda UDFs are architected to perform efficiently and securely. When you run an Amazon Redshift Lambda UDF, each slice in the Amazon Redshift cluster accumulates the applicable rows of arguments and sends those rows in batches to your Lambda function in parallel. The data is transferred between the Amazon Redshift compute nodes and function in a separate isolated network that is inaccessible by clients. For Amazon Redshift to reconstruct the results, the function receives the input data in an ordered nested array of values and is expected to return the results in a similarly ordered array. Because each Lambda invocation runs in parallel, the result returns to the compute nodes quickly. The compute nodes further process the data (if needed) and return the results to the leader node and SQL client. The following diagram illustrates this architecture.

Solution overview

For this post, we explore the use case of data enrichment from an external data source, using DynamoDB. In our function, we can pass the DynamoDB table, key field, and value. The function should return a JSON string containing the document associated to that key. In our use case, the transaction data is loaded into Amazon Redshift via a pipeline that is batch loaded from the POS system but contains only the CustomerId. The customer metadata is used by the web application and is stored in DynamoDB. The goal is to run the following SQL statement to look up customer details by CustomerId and retrieve the most up-to-date information:

select
  CustomerId,
  udf_dynamodb_lookup ('Customer', 'id', CustomerId) Customer
from transactions;

The following table shows the expected results, in which the customer information is returned when it’s available, and shows as NULL when it’s not.

CustomerId	Customer
0
1	{“lname”: “Doe”, “id”: “1”, “fname”: “John”}
2	{“lname”: “Doe”, “id”: “2”, “fname”: “Jane”}
3	{“lname”: “Salazar”, “id”: “3”, “fname”: “Carlos”}
4	{“lname”: “Ramirez”, “id”: “4”, “fname”: “Diego”}
12

To implement this solution, we create the following:

The DynamoDB and Amazon Redshift tables
The Lambda function with an AWS Identity and Access Management (IAM) role that has access to the DynamoDB table.
The Amazon Redshift UDF mapped to the Lambda function with an IAM role that has access to run the function

Creating the DynamoDB and Amazon Redshift tables

Create a DynamoDB table containing the following customer data:

id	fname	lname
1	John	Doe
2	Jane	Doe
3	Carlos	Salazar
4	Diego	Ramirez
5	Mary	Major
6	Richard	Roe

To create a table in Amazon Redshift with transactions data that refers to a CustomerId, enter the following code:

create table transactions (CustomerId varchar, StoreId varchar, TransactionAmount decimal(10,4));
insert into transactions values 
('0', '123', '10.34'),
('1', '123', '9.99'),
('2', '234', '10.34'),
('3', '123', '4.15'),
('4', '234', '17.25'),
('12', '123', '9.99');

Creating the Lambda function

The Lambda function receives an input event with metadata about the invocation. The arguments attribute is an ordered nested array of input values. For our use case, you can expect the arguments to be sent as follows:

{
 "arguments": [
  ["Customer", "id", "0"],
  ["Customer", "id", "1"],
  ["Customer", "id", "2"],
  ["Customer", "id", "3"],
  ["Customer", "id", "4"],
  ["Customer", "id", "12"]
 ]
}

This function is written generically so that any table and field combination can be used to lookup. For example, if I need to enrich my data with Store metadata, I run the same function: udf_dynamodb_lookup ('Store', 'id', StoreId).

The code within the Lambda function needs to traverse through the input parameters and for each row, and retrieve the corresponding record from DynamoDB. The function is expected to return an output containing metadata such as success, results, and error_msg. The success attribute determines if the function was successful. In the case of failure, you can use error_msg to pass a custom message to the user about the error. The results attribute is an ordered array of output values. For our use case, the output is as follows:

{
 "success": true,
 "results": [
   "NULL",
   "{\"lname\": \"Doe\", \"id\": \"1\", \"fname\": \"John\"}",
   "{\"lname\": \"Doe\", \"id\": \"2\", \"fname\": \"Jane\"}",
   "{\"lname\": \"Doe\", \"id\": \"3\", \"fname\": \"Bob\"}",
   "{\"lname\": \"Doe\", \"id\": \"4\", \"fname\": \"Frank\"}",
   "NULL"
 ]
}

The following code is of the Lambda function Lambda_DynamoDB_Lookup, which takes the preceding input arguments containing the CustomerId values and produces the output results, containing the corresponding customer metadata when available. The error handling uses the error_msg if the table doesn’t exist, but populates None if a particular key doesn’t exist.

import json
import boto3
dynamodb = boto3.resource('dynamodb')

def lambda_handler(event, context):
 ret = dict()
 try: 
  tableName = event["arguments"][0][0]
  columnName = event["arguments"][0][1]

  table = dynamodb.Table(tableName)
  table.item_count 
  res = []
  for argument in event['arguments']:
   try:
    columnValue = argument[2]
    response = table.get_item(Key={columnName: columnValue })
    res.append(json.dumps(response["Item"]))
   except: 
    res.append(None)
  ret['success'] = True
  ret['results'] = res
 except Exception as e:
  ret['success'] = False
  ret['error_msg'] = str(e)
 return json.dumps(ret)

For the Lambda function to query the DynamoDB Customer table, you need to modify the execution role and grant DescribeTable and GetItem privileges. Add the following policy to the IAM role, replacing the account number with your AWS account number:

{
 "Version": "2012-10-17",
 "Statement": [
  {
   "Effect": "Allow",
   "Action": [
     "dynamodb:DescribeTable",
     "dynamodb:GetItem"
   ],
   "Resource": "arn:aws:dynamodb:*:xxxxxxxxx999:table/Customer"
  }
 ]
}

For the Amazon Redshift cluster to invoke the Lambda function you created, you need to associate the function to an IAM role that Amazon Redshift can assume and has the InvokeFunction privilege. Add the following policy to the IAM role associated to Amazon Redshift, replacing the account number with your AWS account number. If you don’t already have an IAM role associated to Amazon Redshift, you can create one.

{
 "Version": "2012-10-17",
 "Statement": [
  {
   "Effect": "Allow",
   "Action": "lambda:InvokeFunction",
   "Resource": "arn:aws:lambda:*:xxxxxxxxx999:function:Lambda_DynamoDB_Lookup"
  }
 ]
}

Creating the Amazon Redshift UDF

You can now register this function to Amazon Redshift. The following statement creates the function udf_dynamodb_lookup within Amazon Redshift mapped to the earlier Lambda function. Be sure to associate the function to the IAM role modified earlier.

CREATE OR REPLACE EXTERNAL FUNCTION udf_dynamodb_lookup (tableName varchar, columnName varchar, columnValue varchar)
RETURNS varchar STABLE
LAMBDA 'Lambda_DynamoDB_Lookup'
IAM_ROLE '<Role ARN>';

Finally, run the function and you should receive the expected results from earlier:

select
  CustomerId,
  udf_dynamodb_lookup ('Customer', 'id', CustomerId) Customer
from transactions;

Summary

In this post, I introduced you to Amazon Redshift Lambda UDFs. I also provided a step-by-step guide for creating your first function, which enriches data in Amazon Redshift using DynamoDB. For more information about creating an Amazon Redshift Lambda UDF, see online documentation. If you want to share the UDFs you’ve created with other Amazon Redshift customers, please reach out to us.

We look forward to hearing from you about your experience. If you have questions or suggestions, please leave a comment.

About the Author

Rajiv Gupta is a data warehouse specialist solutions architect with Amazon Web Services.

AWS Big Data Blog

Accessing external components using Amazon Redshift Lambda UDFs

Solution architecture

Solution overview

Creating the DynamoDB and Amazon Redshift tables

Creating the Lambda function

Creating the Amazon Redshift UDF

Summary

About the Author

Resources

Follow

Learn

Resources

Developers

Help