AWS Big Data Blog

Sharing Amazon Redshift data securely across Amazon Redshift clusters for workload isolation

Amazon Redshift data sharing allows for a secure and easy way to share live data for read purposes across Amazon Redshift clusters. Amazon Redshift is a fast, fully managed cloud data warehouse that makes it simple and cost-effective to analyze all your data using standard SQL and your existing business intelligence (BI) tools. It allows you to run complex analytic queries against terabytes to petabytes of structured data, using sophisticated query optimization, columnar storage on high-performance storage, and massively parallel query runs.

In this post, we discuss how to use Amazon Redshift data sharing to achieve workload isolation across diverse analytics use cases and achieve business-critical SLAs. For more information about this new feature, see Announcing Amazon Redshift data sharing (preview).

How to use Amazon Redshift data sharing

Amazon Redshift data sharing allows a producer cluster to share data objects to one or more Amazon Redshift consumer clusters for read purposes without having to copy the data. With this approach, workloads isolated to different clusters can share and collaborate frequently on data to drive innovation and offer value-added analytic services to your internal and external stakeholders. You can share data at many levels, including databases, schemas, tables, views, columns, and user-defined functions, to provide fine-grained access controls that can be tailored for different users and businesses that all need access to Amazon Redshift data.

Data sharing between Amazon Redshift clusters is a two-step process. First, the producer cluster administrator that wants to share data creates an Amazon Redshift data share, a new named object introduced with this release to serve as a unit of sharing. The producer cluster adds the needed database objects such as schemas, tables, and views to the data share and specifies a list of consumer clusters with which to share the data share. Following that, privileged users on consumer clusters create an Amazon Redshift local database reference from the data share made available to them and grant permissions on the database objects to appropriate users and groups. Users and groups can then list the shared objects as part of the standard metadata queries and start querying immediately.

Solution overview

For this post, we use a use case in which the producer cluster is a central ETL cluster hosting enterprise sales data, a 3 TB Cloud DW benchmark dataset based on the TPC-DS benchmark dataset. This cluster serves multiple BI and data science clusters purpose-built for distinct business groups within the organization. One such group is the sales BI team, who runs BI reports using customer sales data created in the central ETL cluster and joined with the product reviews dataset that they loaded into the BI cluster they manage.

This approach helps the sales BI team isolate data lifecycle management between the enterprise sales dataset in the ETL producer from the product reviews data that they fully manage in the BI consumer cluster to simplify data stewardship. It also allows for agility, allows sizing clusters independently to provide workload isolation, and creates a simple cost charge-back model.

As depicted in the following diagram, the central ETL cluster etl_cluster hosts the sales data in a schema named sales. We demonstrate how to build the semantic layer later in this post. A superuser in etl_cluster then creates a data share named salesdatashare, adds the bi_semantic schema and all objects in that schema to the data share, and grants usage permissions to the BI consumer cluster named bi_cluster. Keep in mind that a data share is simply a metadata container and represents what data is shared from producer to consumer. No data is actually moved.

As depicted in the following diagram, the central ETL cluster etl_cluster hosts the sales data in a schema named sales and performs transformations to create a semantic layer required for BI reports in a new schema named bi_semantic.

The superuser in the BI consumer cluster creates a local database reference named sales_semantic from the data share (step 2 in the preceding diagram). The BI users use the product reviews dataset in the local schema named product_reviews and join with bi_semantic data for reporting purposes (step 3).

You can find the script in the products review dataset, which we use in this post to load the dataset into bi_cluster. You can load the DW benchmark dataset into etl_cluster using this github link. Loading these datasets into the respective Amazon Redshift clusters is outside the scope of this post, and is a prerequisite to following the instructions we outline.

The following diagram depicts the cloud DW benchmark data model used.

The following diagram depicts the cloud DW benchmark data model used.

The following table summarizes the data.

Table Name Rows
STORE_SALES 8,639,936,081
CUSTOMER_ADDRESS 15,000,000
CUSTOMER 30,000,000
CUSTOMER_DEMOGRAPHICS 1,920,800
ITEM 360,000
DATE_DIM 73,049

Building a BI semantic layer

A BI semantic layer is a representation of enterprise data in a way that simplifies BI reporting requirements and offers better performance. In our use case, the BI semantic layer transforms sales data to create a customer denormalized dataset and another dataset for all store sales by product in a given year. The following queries are run on the etl_cluster to create the BI semantic layer.

  1. Create a new schema to host BI semantic tables with the following SQL:
    Create schema bi_semantic;
  2. Create a denormalized customer view with select columns required for sales BI team:
    create view bi_semantic.customer_denorm 
    as
    select
    	c_customer_sk,
    	c_customer_id,
    	c_birth_year,
    	c_birth_country,
    	c_last_review_date_sk,
    	ca_city,
    	ca_state,
    	ca_zip,
    	ca_country,
    	ca_gmt_offset,
    	cd_gender,
    	cd_marital_status,
    	cd_education_status
    from sales.customer c, sales.customer_address ca, sales.customer_demographics cd
    where
    c.c_current_addr_sk=ca.ca_address_sk
    and c.c_current_cdemo_sk=cd.cd_demo_sk;
  1. Create a second view for all product sales with columns required for BI team:
    create view bi_semantic.product_sales
    as 
    select 
    	i_item_id,
    	i_product_name,
    	i_current_price,
    	i_wholesale_cost,
    	i_brand_id,
    	i_brand,
    	i_category_id,
    	i_category,
    	i_manufact,
    	d_date,
    	d_moy,
    	d_year,
    	d_quarter_name,
    	ss_customer_sk,
    	ss_store_sk,
    	ss_sales_price,
    	ss_list_price,
    	ss_net_profit,
    	ss_quantity,
    	ss_coupon_amt
    from sales.store_sales ss, sales.item i, sales.date_dim d
    where ss.ss_item_sk=i.i_item_sk
    and ss.ss_sold_date_sk=d.d_date_sk;

Sharing data across Amazon Redshift clusters

Now, let’s share the bi_semantic schema in the etl_cluster with the bi _cluster.

  1. Create a data share in the etl_cluster using the following command when connected to the etl_cluster. The producer cluster superuser and database owners can create data share objects. By default, PUBLICACCESSIBLE is false. If the producer cluster is publicly accessible, you can add PUBLICACCESSIBLE = true to the following command:
    CREATE DATASHARE SalesDatashare;
  1. Add the BI semantic views to the data share. To add objects to the data share, add the schema before adding objects. Use ALTER DATASHARE to share the entire schema; to share tables, views, and functions in a given schema; and to share objects from multiple schemas:
    ALTER DATASHARE SalesDatashare ADD SCHEMA bi_semantic;
    ALTER DATASHARE SalesDatashare ADD ALL TABLES IN SCHEMA bi_semantic;

The next step requires a cluster namespace GUID from the bi_cluster. One way to find the namespace value of a cluster is to run the SQL statement select current_namespace when connected to the bi_cluster. Another way is on the Amazon Redshift console: choose your Amazon Redshift consumer cluster, and find the value under Namespace located in the General information section.

  1. Add consumers to the data share using the following command:
    GRANT USAGE ON DATASHARE SalesDatashare TO NAMESPACE '1m137c4-1187-4bf3-8ce2-e710b7100eb2';
  1. View the list of the objects added to the share using the following command. The share type is outbound on the producer cluster.
    DESC DATASHARE salesdatashare;

The following screenshot shows our list of objects.

The following screenshot shows our list of objects.

Consuming the data share from the consumer BI Amazon Redshift cluster

From the bi_cluster, let’s review, consume, and set permissions on the data share for end-user consumption.

  1. On the consumer BI cluster, view the data shares using the following command as any user:
    SHOW DATASHARES;

The following screenshot shows our results. Consumers should be able to see the objects within the incoming share but not the full list of consumers associated with the share. For more information about querying the metadata of shares, see DESC DATASHARE.

The following screenshot shows our results.

  1. Start the consumption by creating a local database from the salesdatashare. Cluster users with the permission to do so can create a database from the shares. We use the namespace from the etl_cluster.
    CREATE DATABASE Sales_semantic from DATASHARE SalesDatashare OF NAMESPACE '45b137c4-1287-4vf3-8cw2-e710b7138nd9'; 

Consumers should be able to see databases that they created from the share, along with the databases local to the cluster, at any point by querying SVV_REDSHIFT* tables. Data share objects aren’t available for queries until a local database reference is created using a create database statement.

  1. Run the following command to list the databases in bi_cluster:
    select * from svv_redshift_databases;

The following screenshot shows that both the local and shared databases are listed so that you can explore and navigate metadata for shared datasets.

The following screenshot shows that both the local and shared databases are listed so that you can explore and navigate metadata for shared datasets.

  1. Grant usage on the database to bi_group, where bi_group is a local Amazon Redshift group with BI users added to that group:
    GRANT USAGE ON DATABASE sales_semantic TO bi_group;

Querying as the BI user

In this section, you connect as a user in the bi_group who got access to the shared data. The user is still connected to the local database on the bi_cluster but can query the shared data via the new cross-database query functionality in Amazon Redshift.

  1. Review the list of objects in the share by running the following SQL:
    SELECT schema_name, table_name, table_type FROM  svv_redshift_tables
         where database_name = 'sales_semantic'

The following screenshot shows our results.

The following screenshot shows our results.

  1. Review the list of columns in the customer_denorm view::
    SELECT * FROM  svv_redshift_columns 
       where database_name = 'sales_semantic' and table_name = 'customer_denorm';

The following screenshot shows our results.

The following screenshot shows our results.

  1. Query the shared objects using three-part notation just like querying any other local database object, using a notation <database>.<schema>.<view/table>:
    select count(*) from sales_semantic.bi_semantic.customer_denorm;

Following is your result:

28950139

  1. Analyze the local product reviews data by joining the shared customer_denorm data to identify the top ratings by customer states for this BI report:
    SELECT PR.product_category, c.ca_state AS customer_state,
                  count(PR.star_rating) AS cnt
          FROM product_reviews.amazon_reviews PR,               --local data
               sales_semantic.bi_semantic.customer_denorm  C    –-shared data
          WHERE  PR.customer_id = C.c_customer_sk
             AND PR.marketplace = 'US'
          GROUP BY 1, 2
          order by cnt desc
          Limit 10;

The following screenshot shows our results.

The following screenshot shows our results.

Adding a data science consumer

Now, let’s assume the company has decided to spin up a data science team to help with new sales strategies, and this team performs analytics on the sales data. The data science team is new and has very different access patterns and SLA requirements compared to the BI team. Thanks to the data sharing feature, onboarding new use cases such as this is easy.

We add a data science consumer cluster named ds_cluster. Because the data science users need access to data in salesdatashare, the superuser in the etl_cluster can simply grant access to the ds_cluster by adding them as another consumer for the share without moving any data:

GRANT USAGE ON DATASHARE SalesDatashare TO NAMESPACE ''1h137c4-1187-4w53-8de2-e710b7100es2';

The following diagram shows our updated architecture with the data science consumer (step 4).

The following diagram shows our updated architecture with the data science consumer (step 4).

This way, multiple clusters of different sizes can access the same dataset and isolate workloads to meet their SLA requirements. Users in these respective clusters are granted access to shared objects to meet their stringent security requirements. The producer keeps control of the data and at any point can remove certain objects from the share or remove access to the share for any of these clusters, and the consumers immediately lose access to the data. Also, as more data is ingested into the producer cluster, the consumer sees transactionally consistent data instantly.

Monitoring and security

Amazon Redshift offers comprehensive auditing capabilities using system tables and AWS CloudTrail to allow you to monitor the data sharing permissions and usage across all the consumers and revoke access instantly when necessary. The permissions are granted by the superusers from both the producer and the consumer clusters to define who gets access to what objects, similar to the grant commands used in the earlier scenario. You can use the following commands to audit the usage and activities for the data share.

Track all changes to the data share and the shared database imported from the data share with the following code:

Select username, share_name, recordtime, action, 
         share_object_type, share_object_name 
  from svl_datashare_change_log
   order by recordtime desc;

The following screenshot shows our results.

The following screenshot shows our results.

Track data share access activity (usage), which is relevant only on the producer, with the following code:

Select * from svl_datashare_usage;

The following screenshot shows our results.

The following screenshot shows our results.

Summary

Amazon Redshift data sharing provides workload isolation by allowing multiple consumers to share data seamlessly without the need to unload and load data. We also presented a step-by-step guide for securely sharing data from a producer to multiple consumer clusters.


About the Authors

Harsha Tadiparthi is a Specialist Sr. Solutions Architect, AWS Analytics. He enjoys solving complex customer problems in Databases and Analytics and delivering successful outcomes. Outside of work, he loves to spend time with his family, watch movies, and travel whenever possible.

 

 

Harshida Patel is a Specialist Sr. Solutions Architect, Analytics with AWS.