Simplify Snowflake data loading and processing with AWS Glue DataBrew

May 2024: Connecting to Snowflake as a data source is now supported natively. To learn more, visit our documentation.

Historically, inserting and retrieving data from a given database platform has been easier compared to a multi-platform architecture for the same operations. To simplify bringing data in from a multi-database platform, AWS Glue DataBrew supports bringing your data in from multiple data sources via the AWS Glue Data Catalog. However, this requires you to have a Data Catalog database and a table that you may have to create by crawling your data sources.

With the launch of DataBrew database support, you can securely connect and bring your data from Amazon JDBC databases as well as warehouses natively (without requiring a Data Catalog or AWS Glue crawler). This feature also offers connecting directly to the popular warehousing provider Snowflake with a highly simplified user experience. DataBrew is a visual data preparation service that makes it easy to clean and normalize data by providing over 250 pre-built transforms to automate data preparation tasks, all without writing any code.

In this post, we walk you through how to securely connect to your Snowflake database on the DataBrew console.

Solution overview

The following diagram illustrates our solution architecture.

The architecture includes the following components:

We use an AWS Glue Data Catalog connection of CUSTOM type in DataBrew to store the connection information:
1. JDBC connection URL including parameters db, warehouse, role, and schema.
2. An AWS Secrets Manager secret ID containing your Snowflake credentials, which must include user and password keys.
3. Optionally, Amazon Virtual Private Cloud (Amazon VPC) information in case your Snowflake account is AWS PrivateLink enabled.
4. The Amazon Simple Storage Service (Amazon S3) URI location to the Snowflake JDBC driver and Spark connector JAR files.
5. The AWS Glue connection name and the Snowflake table name, which are stored as part of the DataBrew dataset.
We use the DataBrew dataset to create a DataBrew project in which DataBrew gets the secret value if the secret ID was provided when creating the connection.
DataBrew queries sample data from Snowflake using the connection and credentials information including the table name. You can apply any of the over 250 built-in transforms on the sample data and build a recipe. You use the recipe to run a data transformation job on the full data.
The output of the recipe job is written to your S3 bucket.

Let’s dive into how to securely bring data from Snowflake into DataBrew. Our solution contains the following steps:

Create a secret (optional).
Download Snowflake Spark and JDBC drivers.
Create a Snowflake connection.
Create your datasets.
Create a DataBrew project using the datasets.
Open a Snowflake project and build a transformation recipe.
Run the DataBrew recipe job on the full data.

We also briefly discuss how to bring data from Amazon S3 back into Snowflake.

Prerequisites

To complete this solution, you should have the following prerequisites:

An AWS account
A Snowflake account

Create a secret

To be able to securely store the database access credentials, DataBrew supports using AWS Secrets Manager with Snowflake connections. We explore storing the credentials as plaintext. If you already have a secret or would like to store credentials as part of the AWS Glue Data Catalog connection, you can skip this step.

On the Secrets Manager console, choose Secrets.
Choose Store a new secret.
Select Other types of secrets.
Select Secret key/value.
Enter the key as user and enter the Snowflake user name.
Choose Add row.
Enter password key and enter the password for the Snowflake user.
Select the encryption key and choose Next.
Enter a name for your secret and a description.
Choose Next.
Choose automatic rotation option as per your requirement and choose Next.
Choose Store.

You can now view your secret on the Secrets page.

Download Snowflake Spark and JDBC drivers

DataBrew supports connecting to Snowflake via the Spark-Snowflake connector in combination with the Snowflake JDBC driver. The Spark+JDBC drivers offer better performance for large jobs. To understand the working of the Snowflake Spark+JDBC drivers, see Overview of the Spark Connector.

DataBrew requires Spark v2.4.3 and Scala v2.11. When searching for Snowflake Spark + JDBC drivers, you must look for both drivers compatible with the DataBrew Spark and Scala versions. For latest information on Supported Versions for the Snowflake Spark connectors, see Supported Version – Snowflake Spark Connector.

Here are the steps to find out latest Snowflake drivers, compatible with DataBrew environment:

Go to Central Repository and download the jar for latest version of Spark-connector suffixed with _2.4. The 2.4 suffix indicates the Spark version, which is compatible with the given Snowflake Spark connector.
Checkout the Spark-connector Github release page to find out the JDBC driver compatible with the Spark-connector you downloaded in step #1 and go to Central Repository and download the jar for JDBC driver for Snowflake.

For instructions from Snowflake user guide, navigate to Download the Latest Version of the Snowflake Connector for Spark.

Create a Snowflake connection

To create your snowflake connection, complete the following steps:

On the DataBrew console, choose Datasets.
On the Connections tab, choose Create connection.

For Connection name, enter a name (for example, my-new-snowflake-connection).
Select External Custom connectors.

For JDBC URL, enter the JDBC URL for your database. For example:
```
jdbc:snowflake://<account_name>.snowflakecomputing.com/?db=<database_name>&warehouse=<warehouse_name>&role=<role_name>
```
Only the db parameter is required; the schema, warehouse, and role parameters are optional. Parameters must be separated by &.

For Database access credentials, select Connect with Secrets Manager.
For Secrets, enter the secret you created earlier.
You can also enter your username and password directly, but this is a less secure option because it stores the information in plaintext.

You now set up the required connectors and drivers.

For Snowflake Spark connector, select Upload JAR file.
Choose the Spark connector JAR file from your local hard drive.

Alternatively, you can enter the S3 location of the JAR file.

For Custom JDBC driver, select Upload JAR file.
Upload the JDBC JAR file from your local hard drive.

Alternatively, you can enter the S3 location of the JAR file.

For Class name, enter the class name of the supported connector version (for this post, net.snowflake.spark.snowflake).
Under Network options, if your Snowflake account is AWS Private Link enabled, you can choose the Amazon VPC with which your Snowflake instance is associated.
Choose Create connection.

You also can bring your existing connections to Snowflake. For information about verifying that your connection contains all the required information, see Database dataset developer guide.

Create datasets

We have two tables in Snowflake: ORDER_DATA and CUSTOMER_PRIMARY. To use these tables, we first need to create a dataset for each table.

The ORDER_DATA table has attributes like order_id, item_id, customer_id, product_id, amount, currency, and time_stamp (the transaction date and time).

The CUSTOMER_PRIMARY table has customer_id, first_name, and last_name attributes.

To create the datasets, complete the following steps:

On the Datasets page of the DataBrew console, choose Connect new dataset.
For Dataset name, enter a name.
Choose Snowflake as the database connection.
Enter the connection you created (AwsGlueDatabrew-my-new-snowflake-connection).
Verify the JDBC URL shown on the console.
For Table name, enter the name in schema.table format (enter only table if you have already added &schema= in the JDBC URL parameters).
Choose Create dataset.
Repeat these steps for the CUSTOMER_PRIMARY table and name the new dataset customer-primary-data.

You can see both datasets are available to use on the Datasets page.

Create a project using the datasets

To create your DataBrew project, complete the following steps:

On the DataBrew console, choose Create project.
For Project Name, enter my-snowflake-proj.
For Attached recipe, choose Create new recipe.The recipe name is populated automatically.

For Select a dataset, select My datasets.
For Dataset name, select order-dataset.
For Role name, choose your role to be used with DataBrew.
Choose Create project.

You can see a success message along with our Snowflake ORDER_DATA with 500 rows.

After the project is opened, a DataBrew interactive session is created. DataBrew retrieves sample data based on your sampling configuration selection.

Open a Snowflake project and build a transformation recipe

In a DataBrew interactive session, you can cleanse and normalize your data using over 250 built-in transforms. In this post, we use DataBrew to identify premium customers by performing a few transforms and finding customers with order amounts greater than $3,000.

To do this, we perform the following steps:

1. Batch delete the unneeded columns: ORDER_ID, ITEM_ID, TIME_STAMP, and TRANSACTION_DATE.
2. Because the AMOUNT column is the data type string, we convert it to float.
3. We filter the rows based on an AMOUNT value greater than or equal to $3,000 and add the condition as a recipe step.

Now we have all customer orders with amounts greater than $3,000, which are our premium customers. But we don’t know their names. We join the customer-primary-data dataset to find out the names based on the common column CUSTOMER_ID.

Choose Join.
For Select dataset, choose customer-primary.
Choose Next.
For Select join type, select Left join.
For Join keys, choose CUSTOMER_ID for Table A and deselect CUSTOMER_ID for Table B.
Choose Finish.
Choose Schema and move the customer name columns FIRST_NAME and LAST_NAME to the top columns.

The final data shows customer names with orders greater than $3,000.

Run the DataBrew recipe job on the full data

Now that we have built the recipe, we can create and run a DataBrew recipe job.

On the project details page, choose Create job.
For Job name¸ enter premium-customer-orders.

DataBrew supports writing job outputs into all file formats supported by Snowflake. For this post, we use Parquet as the output format.

For File type, choose PARQUET.
For Role name, choose an existing role or create a new one.
Choose Create and run job.
Navigate to the Jobs page and wait for the premium-customer-orders job to complete.
Choose the Destination link to navigate to Amazon S3 to access the job output.

Bring data from Amazon S3 back into Snowflake

Snowflake offers several methods to bring data from an S3 data lake back into Snowflake, including ways to automate the process or incorporate it into your existing data pipelines. For more information, see Bulk Loading from Amazon S3 and Automating Snowpipe from Amazon S3.

Clean up

Delete the following resources that might accrue cost over time:

The recipe job premium-customer-orders
The job output stored in your S3 bucket
The IAM roles created as part of projects and jobs
The DataBrew project my-snowflake-project and its associated recipe my-snowflake-project-recipe
The DataBrew datasets and associated AWS Glue connections
The uploaded JAR files from your S3 bucket

Conclusion

In this post, we saw how to store credentials securely and create an AWS Glue connection for a Snowflake database. We learned how to use this connection to create a DataBrew dataset for each table, and how to reuse this connection multiple times. We also saw how easily we can bring data from Snowflake into DataBrew and seamlessly apply transformations and run recipe jobs.

DataBrew database support is generally available in the following Regions:

US East (Ohio)
US East (N. Virginia)
US West (N. California)
US West (Oregon)
Asia Pacific (Mumbai)
Asia Pacific (Seoul)
Asia Pacific (Singapore)
AP Pacific (Sydney)
AP Pacific (Tokyo)
Canada (Central)
Europe (Frankfurt)
Europe (Ireland)
Europe (London)
Europe (Paris)
Europe (Stockholm)
South America (São Paulo)

For more information about JDBC database support in DataBrew, see Database dataset developer guide. For more information about running and automating DataBrew jobs, see Creating, running, and scheduling AWS Glue DataBrew jobs.

About the Authors

Nitya Sheth is a Software Engineer working on AWS Glue DataBrew. He has also worked on AWS Synthetics as well as on user experience implementations for Database, Analytics, and AI AWS consoles. He divides his free time between exploring new hiking places and new books.

Dhiraj Thakur is a Solutions Architect with Amazon Web Services. He works with AWS customers and partners to provide guidance on enterprise cloud adoption, migration, and strategy. He is passionate about technology and enjoys building and experimenting in the analytics and AI/ML space.

AWS Big Data Blog