Join third-party data in Amazon Redshift with Snowflake using Amazon Athena

Large enterprises can be spread across different geographies working in silos. De-centralized departments will make independent decisions for their infrastructure and data requirements, and can lead to choosing disparate data warehouses. In the long term, this can lead to the generation of data silos where data sharing becomes difficult across departments. This makes consolidating data from different teams a tedious task that may require multiple ETL pipelines, data duplication, and increased effort to move the data between different sources.

One common scenario is subscription of a third-party dataset through AWS Data Exchange in Amazon Redshift and the requirement to combine it with data in Snowflake. This requires building a pipeline and moving the data into either of the data warehouses to make it visible from a single pane of glass. However, this requires an ongoing ETL pipeline that can add cost and requires maintenance effort.

Amazon Athena is a serverless analytics service that lets you run ad-hoc SQL queries from an Amazon Simple Storage Service (Amazon S3) datalake and 30 different sources. You can query different sources both inside and outside of AWS via connectors. Athena supports sources such as Snowflake, SQL Server, Teradata, Amazon Neptune, and Amazon DocumentDB. You will find all the supported data source connectors in our documentation.

In this blog post, we demonstrate joining data from Snowflake with data shared from a third party provider via AWS Data Exchange in Amazon Redshift. This solution lets you access and combine data from all these resources without needing to build and maintain complex data pipelines.

Prerequisites

Verify that the following prerequisites are met:

Confirm that you have an AWS account.
Amazon Redshift cluster with RA3 node type. If you do not have an existing cluster, you can create one using the steps outlined in our public documentation for launching an Amazon Redshift cluster.
Snowflake instance up and running before implementing the solution. If you do not have a Snowflake cluster, you can subscribe to it using the AWS Marketplace listing for Snowflake. Alternatively, you can also sign up on the Snowflake signup page.
The username and password for Snowflake and the Amazon Redshift cluster must be stored in AWS Secrets Manager to configure the Athena connector. Follow the steps to set up the Amazon Redshift secret and the Snowflake secret and choose secret type as “Other type of secret”.
If you haven’t worked with Athena before, the initial setup will require you to provide an Amazon S3 bucket to store results.

Solution overview

We start by subscribing to a dataset from AWS Data Exchange. Once the subscription is complete and active, the consumer cluster can access it via datashare feature. We then create a local copy of the data from the datashare, enabling us to query the data directly from Athena. We will also prepare Snowflake TPC-DS benchmark data and create a subset of the data we need. Finally, we will use Athena connectors to connect to and query Amazon Redshift and Snowflake.

Architecture diagram

The following architecture diagram represents various components:

Third party data is subscribed and shared with Amazon Redshift via AWS Data Exchange.
Athena Federated Query to query Amazon Redshift and Snowflake data in place.
Athena uses AWS Lambda based data source connectors to help connect data source with Athena.
In addition to writing query results to the Athena query results location in Amazon S3, data connectors also write to a spill bucket in Amazon S3.

Solution walkthrough:

Step 1: Subscribe to AWS Data Exchange TPC-DS data

Sign in to the AWS Management Console.
Open the TPC-DS benchmark data (test product) and select Continue to subscribe.
Choose whether to configure auto-renewal of the subscription and choose Subscribe.
Once the subscription has completed, you will see the product in the Subscriptions section.

Step 2: Add data sharing with Amazon Redshift

Navigate to the Amazon Redshift datashares console.
Choose the Subscriptions tab and select the datashare TPC-DS benchmark data.
Choose Connect to database.
Select the RA3 node type cluster and then choose Connect.
Once the connection has been established, from the top right of the console, choose Create database from datashare.
In the Create database from datashare dialog box, specify tpc-ds-benchmark as the database name.
Choose Create. You will see a message confirming the successful database creation. You are now ready to run read-only queries on this database.

Step 3: Connect to the shared database and make a local copy of the data

Navigate to the Amazon Redshift Query Editor v2.0.
Select the RA3 node type cluster that you connected to in the previous step, and then choose Connect to database. If you have difficulty connecting or this is your first time connecting, refer to Working with Query Editor v2.
In order to access the third-party data through Athena, we create the table in the Amazon Redshift cluster using the following CTAS query:

create table AS select * from database.table;

For this example, we use the following query, which is specific to the data in TPC-DS.

create table "dev.public.customer" as select * from "tpc-ds-benchmark"."tpcds1"."customer" ;

Note: Once the table is created, you will find it in your dev database under public schema in your Amazon Redshift cluster endpoint.

Step 4: Establish connection to Amazon Redshift using Athena

Navigate to Athena data sources.
Select Create data source.
Choose Amazon Redshift and click on Next.
Provide a data source name and under Connection details, select Create Lambda function.
Provide details of the Lambda function. For further information on creating a Lambda function, refer to Configure and optimize performance of Amazon Athena federation with Amazon Redshift.

Note: Provide details from your own account and change the Amazon S3 spill bucket, secret name, security groups, and subnet IDs

For the JDBC URL in Lambda function, use the following sample:

redshift://jdbc:redshift://redshift-cluster-1.cxkuxxxxxx.us-east-1.redshift.amazonaws.com:5439/dev?${my-secret-name}

Once the Lambda function has been created, navigate to Athena Query Editor. Under datasources, select the name of your Athena connector that you provided for data source name in the preceding steps.
If the connection is successful, you will see your databases in the Database dropdown.

Step 5: Establish connection to Snowflake using Athena

We use the TPCDS_SF10CL dataset in Snowflake for our use case, so verify that you can query it within your Snowflake instance
Query:

create table date_dim as select * from snowflake_sample_data.TPCDS_SF10TCL.date_dim limit 10;

Navigate to Athena data sources.
Select Create data source.
Choose Snowflake and click on Next.
Provide data source name, and select Create Lambda function under connection details.

Follow the steps for configuring the Lambda function for the Snowflake Athena connector. These are described in Query Snowflake using Athena Federated Query and join with data in your Amazon S3 data lake.

Note: Provide details from your own account and change the Amazon S3 spill bucket, secret name, security groups, and subnet IDs.

Once the Lambda function has been created, navigate to Athena Query Editor. Under Data source, select the name of your Athena connector that you provided for data source name in the preceding steps.
If the connection is successful, you will see your databases in the Database dropdown.

Step 6: Run query in Athena joining data from Amazon Redshift and Snowflake

Once we have successfully established a connection to shared data residing in Amazon Redshift and Snowflake, we can combine data between the two sources.
We use the following query to test this out:

select i_brand_id brand_id, i_brand brand from snowflakeathenaadx.public.“date_dim”, redshiftadx.public.“dev.public.item” where d_year=1900 group by i_brand,i_brand_id order by i_brand_id limit 10;

You will see successful test results as shown in the following screenshot:

Conclusion

In this blog post, we demonstrated how to subscribe to AWS Data Exchange data for Amazon Redshift and how to combine it with data in your Snowflake data warehouse. The join is carried out using an Athena connector, which enables us to connect to Amazon Redshift and Snowflake from a single query. This helps us avoid additional effort and cost associated with movement of data through ETL pipelines.

Cleanup

To avoid charges, delete resources created while following this post and unsubscribe from any paid subscriptions.

AWS Marketplace