Data preparation using Amazon Redshift with AWS Glue DataBrew

July 2023: This post was reviewed for accuracy.

With AWS Glue DataBrew, data analysts and data scientists can easily access and visually explore any amount of data across their organization directly from their Amazon Simple Storage Service (Amazon S3) data lake, Amazon Redshift data warehouse, Amazon Aurora, and other Amazon Relational Database Service (Amazon RDS) databases. You can choose from over 250 built-in functions to merge, pivot, and transpose the data without writing code.

Now, with added support for JDBC-accessible databases, DataBrew also supports additional data stores, including PostgreSQL, MySQL, Oracle, and Microsoft SQL Server. In this post, we use DataBrew to clean data from an Amazon Redshift table, and transform and use different feature engineering techniques to prepare data to build a machine learning (ML) model. Finally, we store the transformed data in an S3 data lake to build the ML model in Amazon SageMaker.

Use case overview

For our use case, we use mock student datasets that contain student details like school, student ID, name, age, student study time, health, country, and marks. The following screenshot shows an example of our data.

For our use case, the data scientist uses this data to build an ML model to predict a student’s score in upcoming annual exam. However, this raw data requires cleaning and transformation. A data engineer must perform the required data transformation so the data scientist can use the transformed data to build the model in SageMaker.

Solution overview

The following diagram illustrates our solution architecture.

The workflow includes the following steps:

Create a JDBC connection for Amazon Redshift and a DataBrew project.
AWS DataBrew queries sample student performance data from Amazon Redshift and does the transformation and feature engineering to prepare the data to build ML model.
The DataBrew job writes the final output to our S3 output bucket.
The data scientist builds the ML model in SageMaker to predict student marks in an upcoming annual exam.

We cover steps 1–3 in this post.

Prerequisites

To complete this solution, you should have an AWS account.

Prelab setup

Before beginning this tutorial, make sure you have the required permissions to create the resources required as part of the solution.

For our use case, we use a mock dataset. You can download the DDL and data files from GitHub.

Create the Amazon Redshift cluster to capture the student performance data.
Set up a security group for Amazon Redshift.
Create a schema called student_schema and a table called study_details. You can use DDLsql to create database objects.
We recommend using the COPY command to load a table in parallel from data files on Amazon S3. However, for this post, you can use study_details.sql to insert the data in the tables.

Create an Amazon Redshift connection

To create your Amazon Redshift connection, complete the following steps:

On the DataBrew console, choose Datasets.
On the Connections tab, choose Create connection.
For Connection name, enter a name (for example, student-db-connection).
For Connection type, select JDBC.
Provide other parameters like the JDBC URL and login credentials.
In the Network options section, choose the VPC, subnet, and security groups of your Amazon Redshift cluster.
Choose Create connection.

Create datasets

To create the datasets, complete the following steps:

On the Datasets page of the DataBrew console, choose Connect new dataset.
For Dataset name, enter a name (for example, student).
For Your JDBC source, choose the connection you created (AwsGlueDatabrew-student-db-connection).
Select the study_details table.
For Enter S3 destination, enter an S3 bucket for Amazon Redshift to store the intermediate result.
Choose Create dataset.

You can also configure a lifecycle rule to automatically clean up old files from the S3 bucket.

Create a project using the datasets

To create your DataBrew project, complete the following steps:

On the DataBrew console, on the Projects page, choose Create project.
For Project Name, enter student-proj.
For Attached recipe, choose Create new recipe.

The recipe name is populated automatically.

For Select a dataset, select My datasets.
Select the student dataset.
For Role name, choose the AWS Identity and Access Management (IAM) role to be used with DataBrew.
Choose Create project.

You can see a success message along with our Amazon Redshift study_details table with 500 rows.

After the project is opened, a DataBrew interactive session is created. DataBrew retrieves sample data based on your sampling configuration selection.

Create a profiling job

DataBrew helps you evaluate the quality of your data by profiling it to understand data patterns and detect anomalies.

To create your profiling job, complete the following steps:

On the DataBrew console, choose Jobs in the navigation pane.
On the Profile jobs tab, choose Create job.
For Job name, enter student-profile-job.
Choose the student dataset.
Provide the S3 location for job output.
For Role name, choose the role to be used with DataBrew.
Choose Create and run job.

Wait for the job to complete.

Choose the Columns statistics tab.

You can see that the age column has some missing values.

You can also see that the study_time_in_hr column has two outliers.

Build a transformation recipe

All ML algorithms use input data to generate outputs. Input data comprises features usually in structured columns. To work properly, the features need to have specific characteristics. This is where feature engineering comes in. In this section, we perform some feature engineering techniques to prepare our dataset to build the model in SageMaker.

Let’s drop the unnecessary columns from our dataset that aren’t required for model building.

Choose Column and choose Delete.
For Source columns, choose the columns school_name, first_name, and last_name.
Choose Apply.

We know from the profiling report that the age value is missing in two records. Let’s fill in the missing value with the median age of other records.

Choose Missing and choose Fill with numeric aggregate.
For Source column, choose age.
For Numeric aggregate, choose Median.
For Apply transform to, select All rows.
Choose Apply.

We know from the profiling report that the study_time_in_hr column has two outliers, which we can remove.

Choose Outliers and choose Remove outliers.
For Source column, choose study_time_in_hr.
Select Z-score outliers.
For Standard deviation threshold, choose 3.
Select Remove outliers.
Under Remove outliers, select All outliers.
Under Outlier removal options¸ select Delete outliers.
Choose Apply.
Choose Delete rows and click Apply.

The next step is to convert the categorical value to a numerical value for the gender column.

Choose Mapping and choose Categorical mapping.
For Source column, choose gender.
For Mapping options, select Map top 1 values.
For Map values, select Map values to numeric values.
For M, choose 1.
For Others, choose 2.
For Destination column, enter gender_mapped.
For Apply transform to, select All rows.
Choose Apply.

ML algorithms often can’t work on label data directly, requiring the input variables to be numeric. One-hot encoding is one technique that converts categorical data that doesn’t have an ordinal relationship with each other to numeric data.

To apply one-hot encoding, complete the following steps:

Choose Encode and choose One-hot encode column.
For Source column, choose health.
For Apply transform to, select All rows.
Choose Apply.

The following screenshot shows the full recipe that we applied to our dataset before we can use it to build our model in SageMaker.

Run the DataBrew recipe job on the full data

Now that we have built the recipe, we can create and run a DataBrew recipe job.

On the project details page, choose Create job.
For Job name¸ enter student-performance.

We use CSV as the output format.

For File type, choose CSV.
For Role name, choose an existing role or create a new one.
Choose Create and run job.
Navigate to the Jobs page and wait for the student-performance job to complete.
Choose the Destination link to navigate to Amazon S3 to access the job output.

Clean up

Delete the following resources that might accrue cost over time:

The Amazon Redshift cluster
The recipe job student-performance
The job output stored in your S3 bucket
The IAM roles created as part of projects and jobs
The DataBrew project student-proj and its associated recipe student-proj-recipe
The DataBrew datasets

Conclusion

In this post, we saw how to create a JDBC connection for an Amazon Redshift data warehouse. We learned how to use this connection to create a DataBrew dataset for an Amazon Redshift table. We also saw how easily we can bring data from Amazon Redshift into DataBrew, seamlessly apply transformations and feature engineering techniques, and run recipe jobs that refresh the transformed data for ML model building in SageMaker.

About the Author

Dhiraj Thakur is a Solutions Architect with Amazon Web Services. He works with AWS customers and partners to provide guidance on enterprise cloud adoption, migration, and strategy. He is passionate about technology and enjoys building and experimenting in the analytics and AI/ML space.

AWS Big Data Blog