Use SQL queries to define Amazon Redshift datasets in AWS Glue DataBrew

July 2023: This post was reviewed for accuracy.

In the post Data preparation using Amazon Redshift with AWS Glue DataBrew, we saw how to create an AWS Glue DataBrew job using a JDBC connection for Amazon Redshift. In this post, we show you how to create a DataBrew profile job and a recipe job using an Amazon Redshift connection with custom SQL.

DataBrew is a visual data preparation tool that can help you simplify your extract, transform, and load (ETL) process. You can now define a dataset from Amazon Redshift by applying custom SQL statements. Applying a custom SQL statement to a large source table allows you to select, join, and filter the data before cleaning, normalizing, and transforming it in a DataBrew project. Filtering and joining the data from your data source and only bringing in the data you want to transform simplifies the ETL process.

In this post, we demonstrate how to use custom SQL queries to define your Amazon Redshift datasets in DataBrew.

Solution overview

To implement this solution, you complete the following high-level steps:

Create an Amazon Redshift connection.
Create your dataset and use SQL queries to define your Amazon Redshift source datasets.
Create a DataBrew profile job to profile the source data.
Create a DataBrew project and recipe job to transform the data and load it to Amazon Simple Storage Service (Amazon S3).

The following diagram illustrates the architecture for our solution.

Prerequisites

To use this solution, complete the following prerequisite steps:

Have an AWS account.
Create an Amazon Redshift cluster in a private subnet within a VPC as a security best practice.
Because DataBrew commands require that the cluster has access to Amazon S3, make sure you create a gateway VPC endpoint to Amazon S3. The gateway endpoint provides reliable connectivity to Amazon S3 without requiring an internet gateway or NAT device from your VPC.
Enable the enhanced VPC routing in the Amazon Redshift cluster. Enhanced VPC routing forces all Amazon Redshift commands to use the connectivity to the gateway VPC endpoint to Amazon S3 in the same AWS Region as your cluster.
Create a database and tables, and load the sample data in the Amazon Redshift cluster.
Prepare a SQL query to extract the source dataset. You use this SQL query later in this post to create an Amazon Redshift source dataset in DataBrew.
Create an S3 bucket to store data from the profile and recipe jobs. The DataBrew connection temporarily stores intermediate data in Amazon S3.
For our use case, we use a mock dataset. You can download the DDL and data files from GitHub.

Security best practices

Consider the following best practices in order to mitigate security threats:

Review the shared responsibility model when using DataBrew.
Restrict network access for inbound and outbound traffic to least privilege. Take advantage of the routing traffic within the VPC by using an Amazon S3 gateway endpoint and enhanced VPC routing in Amazon Redshift.
Enable the lifecycle policy in Amazon S3 to retain only necessary data, and delete unnecessary data.
Enable Amazon S3 versioning and cross-Region replication for critical datasets to protect against accidental deletes.
Enable server-side encryption using AWS KMS (SSE-KMS) or Amazon S3 (SSE-S3).
DataBrew uses Amazon CloudWatch for logging, so you should update your log retention period to retain logs for the appropriate length of time.

Create an Amazon Redshift connection

In this section, you create a connection in DataBrew to connect to your Amazon Redshift cluster.

On the DataBrew console, choose Datasets in the navigation pane.
On the Connections tab, choose Create connection.
For Connection name, enter a name, such as order-db-connection.
For Connection type, select Amazon Redshift.
Under Connection access, provide the Amazon Redshift cluster name, database name, database user, and database password.
Choose Create connection.

Create your dataset by applying a custom SQL statement to filter the source data

In this section, you create a Amazon Redshift connection, add your custom SQL statement, and validate it. You can also validate your SQL statement directly in your Amazon Redshift cluster by using the Amazon Redshift query editor v2. The purpose of validating the SQL statement is to help you avoid failure in loading your dataset into a project or job. Also, checking the query runtime ensures that it runs in under 3 minutes, avoiding timeouts during project loading. To analyze and improve query performance in Amazon Redshift, see Tuning query performance.

On the DataBrew console, choose Datasets in the navigation pane.
On the Datasets tab, choose Connect new dataset.
For Dataset name, enter a name, such as order-data.
In the left pane, choose Amazon Redshift under Database connections.
Add your Amazon Redshift connection and select Enter custom SQL.
Enter the SQL query and choose Validate SQL.
Under Additional configurations, for Enter S3 destination, provide an S3 destination to temporarily store the intermediate results.
Choose Create dataset.

Create a DataBrew profile job

In this section, you use the newly created Amazon Redshift dataset to create a profile job. Data profiling helps you understand your dataset and plan the data preparation steps needed in running your recipe jobs.

On the DataBrew console, choose Jobs in the navigation pane.
On the Profile jobs tab, choose Create job.
For Job name, enter a name, such as order-data-profile-job.
For Job type¸ select Create a profile job.
Under Job input, choose Browse datasets and choose the dataset you created earlier (order-data).
For Data sample, select Full dataset.
Under Job output settings¸ for S3 location, enter the S3 bucket for the job output files.
For Role name, choose an AWS Identity and Access Management (IAM) role with permission for DataBrew to connect to the data on your behalf. For more information, refer to Adding an IAM role with data resource permissions.
Choose Create and run job.

Check the status of your profile job. A profile output file is created and stored in Amazon S3 upon completion. You can choose View data profile to see more information.

In addition to an output file, DataBrew also provides visualizations. On the Dataset profile overview tab, you can see data visualizations that can help you understand your data better. Next, you can see detailed statistics about your data on the Column statistics tab, illustrated with graphics and charts. You can define data quality rules on the Data quality rules tab, and then see the results from the data quality ruleset that applies to this dataset.

For example, in the following screenshot, the amount column has 2% missing values, as shown on the Column statistics tab. You can provide rules that avoid triggering a recipe job in case of an anomaly. You can also notify the source teams to handle or acknowledge the missing values. DataBrew users can also add steps in the recipe job to handle the anomalies and missing values.

Create a DataBrew project and recipe job

In this section, you start analyzing and transforming your Amazon Redshift dataset in a DataBrew project. The custom SQL statement runs in Amazon Redshift when the project is loaded. Databrew performs read-only access to your source data.

Create a project

To create your project, complete the following steps:

On the DataBrew console, choose Projects in the navigation pane.
Choose Create project.
For Project name, enter a name, such as order-data-proj.
Under Recipe details¸ choose Create new recipe and enter a recipe name, such as order-data-proj-recipe.
For Select a dataset, select My datasets.
Select the dataset you created earlier (order-data).
Under Permissions, for Role name, choose your DataBrew role.
Choose Create project.

DataBrew starts a session, constructs a DataFrame, extracts sample data, infers basic statistics, and displays the sample data in a grid view. You can add steps to build a transformation recipe. As of this writing, DataBrew offers over 350 transformations, with more on the way.

For our example use case, Company ABC has set a target to ship all orders within 7 days after the order date (internal SLA). They want a list of orders that didn’t meet the 7-day SLA for additional investigation. The following sample recipe contains steps to handle the missing values, filter the values by amount, change the date format, calculate the date difference, and filter the values by shipping days. The detailed steps are as follows:

Fill missing values with 0 for the amount column.
Filter values by amount greater than 0.
Change the format of order_timestamp to align with ship_date.
Create a new column called days_for_shipping using the dateTime function DATEDIFF to show the difference between order_timestamp and ship_date in days.
Filter the values by days_for_shipping greater than 7.

Create a recipe job

To create your DataBrew recipe job, complete the following steps:

On the DataBrew console, choose Jobs in the navigation pane.
Choose Create job.
For Job name¸ enter a name, such as SHIPPING-SLA-MISS.
Under Job output settings, configure your Amazon S3 output settings.
For S3 location, enter the location of your output bucket.
For Role name, choose the IAM role that contains permissions for DataBrew to connect on your behalf.
Choose Create and run job.

You can check the status of your job on the Jobs page.

The output file is in Amazon S3 as specified, and your data transformation is now complete.

Clean up

To avoid incurring future charges, we recommend deleting the resources you created during this walkthrough.

Conclusion

In this post, we walked through applying custom SQL statements to an Amazon Redshift data source in your dataset, which you can use in profiling and transformation jobs. You can now focus on building your data transformation steps knowing that you’re working on only the needed data.

To learn more about the various supported data sources for DataBrew, see Connecting to data with AWS Glue DataBrew.

About the authors

Suraj Shivananda is a Solutions Architect at AWS. He has over a decade of experience in Software Engineering, Data and Analytics, DevOps specifically for data solutions, automating and optimizing cloud based solutions. He’s a trusted technical advisor and helps customers build Well Architected solutions on the AWS platform.

Marie Yap is a Principal Solutions Architect for Amazon Web Services based in Hawaii. In this role, she helps various organizations begin their journey to the cloud. She also specializes in analytics and modern data architectures.

Dhiraj Thakur is a Solutions Architect with Amazon Web Services. He works with AWS customers and partners to provide guidance on enterprise cloud adoption, migration, and strategy. He is passionate about technology and enjoys building and experimenting in the analytics and AI/ML space.