AWS Marketplace

Join third-party data in Amazon Redshift with Snowflake using Amazon Athena  

Large enterprises can be spread across different geographies working in silos. De-centralized departments will make independent decisions for their infrastructure and data requirements, and can lead to choosing disparate data warehouses. In the long term, this can lead to the generation of data silos where data sharing becomes difficult across departments. This makes consolidating data from different teams a tedious task that may require multiple ETL pipelines, data duplication, and increased effort to move the data between different sources.

One common scenario is subscription of a third-party dataset through AWS Data Exchange in Amazon Redshift and the requirement to combine it with data in Snowflake. This requires building a pipeline and moving the data into either of the data warehouses to make it visible from a single pane of glass. However, this requires an ongoing ETL pipeline that can add cost and requires maintenance effort.

Amazon Athena is a serverless analytics service that lets you run ad-hoc SQL queries from an Amazon Simple Storage Service (Amazon S3) datalake and 30 different sources. You can query different sources both inside and outside of AWS via connectors. Athena supports sources such as Snowflake, SQL Server, Teradata, Amazon Neptune, and Amazon DocumentDB. You will find all the supported data source connectors in our documentation.

In this blog post, we demonstrate joining data from Snowflake with data shared from a third party provider via AWS Data Exchange in Amazon Redshift. This solution lets you access and combine data from all these resources without needing to build and maintain complex data pipelines.

Prerequisites

Verify that the following prerequisites are met:

Solution overview

We start by subscribing to a dataset from AWS Data Exchange. Once the subscription is complete and active, the consumer cluster can access it via datashare feature. We then create a local copy of the data from the datashare, enabling us to query the data directly from Athena. We will also prepare Snowflake TPC-DS benchmark data and create a subset of the data we need. Finally, we will use Athena connectors to connect to and query Amazon Redshift and Snowflake.

Architecture diagram

The following architecture diagram represents various components:

  1. Third party data is subscribed and shared with Amazon Redshift via AWS Data Exchange.
  2. Athena Federated Query to query Amazon Redshift and Snowflake data in place.
  3. Athena uses AWS Lambda based data source connectors to help connect data source with Athena.
  4. In addition to writing query results to the Athena query results location in Amazon S3, data connectors also write to a spill bucket in Amazon S3.

Solution walkthrough:

Step 1: Subscribe to AWS Data Exchange TPC-DS data

Step 2: Add data sharing with Amazon Redshift

  • Navigate to the Amazon Redshift datashares console.
  • Choose the Subscriptions tab and select the datashare TPC-DS benchmark data.
  • Choose Connect to database.
  • Select the RA3 node type cluster and then choose Connect.
  • Once the connection has been established, from the top right of the console, choose Create database from datashare.
  • In the Create database from datashare dialog box, specify tpc-ds-benchmark as the database name.
  • Choose Create. You will see a message confirming the successful database creation. You are now ready to run read-only queries on this database.

Step 3: Connect to the shared database and make a local copy of the data

  • Navigate to the Amazon Redshift Query Editor v2.0.
  • Select the RA3 node type cluster that you connected to in the previous step, and then choose Connect to database. If you have difficulty connecting or this is your first time connecting, refer to Working with Query Editor v2.
  • In order to access the third-party data through Athena, we create the table in the Amazon Redshift cluster using the following CTAS query:

create table AS select * from database.table;

  • For this example, we use the following query, which is specific to the data in TPC-DS.

create table "dev.public.customer" as select * from "tpc-ds-benchmark"."tpcds1"."customer" ;

Note: Once the table is created, you will find it in your dev database under public schema in your Amazon Redshift cluster endpoint.

Step 4: Establish connection to Amazon Redshift using Athena

Note: Provide details from your own account and change the Amazon S3 spill bucket, secret name, security groups, and subnet IDs

For the JDBC URL in Lambda function, use the following sample:

redshift://jdbc:redshift://redshift-cluster-1.cxkuxxxxxx.us-east-1.redshift.amazonaws.com:5439/dev?${my-secret-name}

  • Once the Lambda function has been created, navigate to Athena Query Editor.  Under datasources, select the name of your Athena connector that you provided for data source name in the preceding steps.
  • If the connection is successful, you will see your databases in the Database dropdown.

Step 5: Establish connection to Snowflake using Athena

  • We use the TPCDS_SF10CL dataset in Snowflake for our use case, so verify that you can query it within your Snowflake instance
    Query:

create table date_dim as select * from snowflake_sample_data.TPCDS_SF10TCL.date_dim limit 10;

  • Navigate to Athena data sources.
  • Select Create data source.
  • Choose Snowflake and click on Next.
  • Provide data source name, and select Create Lambda function under connection details.

Follow the steps for configuring the Lambda function for the Snowflake Athena connector. These are described in Query Snowflake using Athena Federated Query and join with data in your Amazon S3 data lake.

Note: Provide details from your own account and change the Amazon S3 spill bucket, secret name, security groups, and subnet IDs.

  • Once the Lambda function has been created, navigate to Athena Query Editor. Under Data source, select the name of your Athena connector that you provided for data source name in the preceding steps.
  • If the connection is successful, you will see your databases in the Database dropdown.

Step 6: Run query in Athena joining data from Amazon Redshift and Snowflake

  • Once we have successfully established a connection to shared data residing in Amazon Redshift and Snowflake, we can combine data between the two sources.
  • We use the following query to test this out:
select i_brand_id brand_id, i_brand brand from snowflakeathenaadx.public.“date_dim”, redshiftadx.public.“dev.public.item” where d_year=1900 group by i_brand,i_brand_id order by i_brand_id limit 10;
  • You will see successful test results as shown in the following screenshot:

Conclusion

In this blog post, we demonstrated how to subscribe to AWS Data Exchange data for Amazon Redshift and how to combine it with data in your Snowflake data warehouse. The join is carried out using an Athena connector, which enables us to connect to Amazon Redshift and Snowflake from a single query. This helps us avoid additional effort and cost associated with movement of data through ETL pipelines.

Cleanup

To avoid charges, delete resources created while following this post and unsubscribe from any paid subscriptions.


About Authors

Gopal Krishna Bhatia

Gopal is a Senior Data and AI Partner Solutions Architect at Amazon Web Services (AWS). He is a technologist with over 7 years of experience across several domains including Generative AI, data analytics and databases.

Gitika Vijh

Gitika is an Analytics Partner Solutions Architect at Amazon Web Services (AWS). She works with partners on technical projects, providing architectural guidance and enablement to build their analytics practice.