AWS Big Data Blog
Sharing Amazon Redshift data securely across Amazon Redshift clusters for workload isolation
Amazon Redshift data sharing allows for a secure and easy way to share live data for read purposes across Amazon Redshift clusters. Amazon Redshift is a fast, fully managed cloud data warehouse that makes it simple and cost-effective to analyze all your data using standard SQL and your existing business intelligence (BI) tools. It allows you to run complex analytic queries against terabytes to petabytes of structured data, using sophisticated query optimization, columnar storage on high-performance storage, and massively parallel query runs.
In this post, we discuss how to use Amazon Redshift data sharing to achieve workload isolation across diverse analytics use cases and achieve business-critical SLAs. For more information about this new feature, see Announcing Amazon Redshift data sharing (preview).
How to use Amazon Redshift data sharing
Amazon Redshift data sharing allows a producer cluster to share data objects to one or more Amazon Redshift consumer clusters for read purposes without having to copy the data. With this approach, workloads isolated to different clusters can share and collaborate frequently on data to drive innovation and offer value-added analytic services to your internal and external stakeholders. You can share data at many levels, including databases, schemas, tables, views, columns, and user-defined functions, to provide fine-grained access controls that can be tailored for different users and businesses that all need access to Amazon Redshift data.
Data sharing between Amazon Redshift clusters is a two-step process. First, the producer cluster administrator that wants to share data creates an Amazon Redshift data share, a new named object introduced with this release to serve as a unit of sharing. The producer cluster adds the needed database objects such as schemas, tables, and views to the data share and specifies a list of consumer clusters with which to share the data share. Following that, privileged users on consumer clusters create an Amazon Redshift local database reference from the data share made available to them and grant permissions on the database objects to appropriate users and groups. Users and groups can then list the shared objects as part of the standard metadata queries and start querying immediately.
Solution overview
For this post, we use a use case in which the producer cluster is a central ETL cluster hosting enterprise sales data, a 3 TB Cloud DW benchmark dataset based on the TPC-DS benchmark dataset. This cluster serves multiple BI and data science clusters purpose-built for distinct business groups within the organization. One such group is the sales BI team, who runs BI reports using customer sales data created in the central ETL cluster and joined with the product reviews dataset that they loaded into the BI cluster they manage.
This approach helps the sales BI team isolate data lifecycle management between the enterprise sales dataset in the ETL producer from the product reviews data that they fully manage in the BI consumer cluster to simplify data stewardship. It also allows for agility, allows sizing clusters independently to provide workload isolation, and creates a simple cost charge-back model.
As depicted in the following diagram, the central ETL cluster etl_cluster hosts the sales data in a schema named sales
. We demonstrate how to build the semantic layer later in this post. A superuser in etl_cluster
then creates a data share named salesdatashare
, adds the bi_semantic
schema and all objects in that schema to the data share, and grants usage permissions to the BI consumer cluster named bi_cluster
. Keep in mind that a data share is simply a metadata container and represents what data is shared from producer to consumer. No data is actually moved.
The superuser in the BI consumer cluster creates a local database reference named sales_semantic
from the data share (step 2 in the preceding diagram). The BI users use the product reviews dataset in the local schema named product_reviews
and join with bi_semantic
data for reporting purposes (step 3).
You can find the script in the products review dataset, which we use in this post to load the dataset into bi_cluster. You can load the DW benchmark dataset into etl_cluster
using this github link. Loading these datasets into the respective Amazon Redshift clusters is outside the scope of this post, and is a prerequisite to following the instructions we outline.
The following diagram depicts the cloud DW benchmark data model used.
The following table summarizes the data.
Table Name | Rows |
STORE_SALES |
8,639,936,081 |
CUSTOMER_ADDRESS |
15,000,000 |
CUSTOMER |
30,000,000 |
CUSTOMER_DEMOGRAPHICS |
1,920,800 |
ITEM |
360,000 |
DATE_DIM |
73,049 |
Building a BI semantic layer
A BI semantic layer is a representation of enterprise data in a way that simplifies BI reporting requirements and offers better performance. In our use case, the BI semantic layer transforms sales data to create a customer denormalized dataset and another dataset for all store sales by product in a given year. The following queries are run on the etl_cluster
to create the BI semantic layer.
- Create a new schema to host BI semantic tables with the following SQL:
- Create a denormalized customer view with select columns required for sales BI team:
- Create a second view for all product sales with columns required for BI team:
Sharing data across Amazon Redshift clusters
Now, let’s share the bi_semantic
schema in the etl_cluster
with the bi _cluster
.
- Create a data share in the
etl_cluster
using the following command when connected to theetl_cluster
. The producer cluster superuser and database owners can create data share objects. By default,PUBLICACCESSIBLE
is false. If the producer cluster is publicly accessible, you can addPUBLICACCESSIBLE = true
to the following command:
- Add the BI semantic views to the data share. To add objects to the data share, add the schema before adding objects. Use
ALTER DATASHARE
to share the entire schema; to share tables, views, and functions in a given schema; and to share objects from multiple schemas:
The next step requires a cluster namespace GUID from the bi_cluster
. One way to find the namespace value of a cluster is to run the SQL statement select current_namespace
when connected to the bi_cluster
. Another way is on the Amazon Redshift console: choose your Amazon Redshift consumer cluster, and find the value under Namespace located in the General information section.
- Add consumers to the data share using the following command:
- View the list of the objects added to the share using the following command. The share type is outbound on the producer cluster.
The following screenshot shows our list of objects.
Consuming the data share from the consumer BI Amazon Redshift cluster
From the bi_cluster
, let’s review, consume, and set permissions on the data share for end-user consumption.
- On the consumer BI cluster, view the data shares using the following command as any user:
The following screenshot shows our results. Consumers should be able to see the objects within the incoming share but not the full list of consumers associated with the share. For more information about querying the metadata of shares, see DESC DATASHARE.
- Start the consumption by creating a local database from the
salesdatashare
. Cluster users with the permission to do so can create a database from the shares. We use the namespace from theetl_cluster
.
Consumers should be able to see databases that they created from the share, along with the databases local to the cluster, at any point by querying SVV_REDSHIFT*
tables. Data share objects aren’t available for queries until a local database reference is created using a create database
statement.
- Run the following command to list the databases in
bi_cluster
:
The following screenshot shows that both the local and shared databases are listed so that you can explore and navigate metadata for shared datasets.
- Grant usage on the database to
bi_group
, wherebi_group
is a local Amazon Redshift group with BI users added to that group:
Querying as the BI user
In this section, you connect as a user in the bi_group
who got access to the shared data. The user is still connected to the local database on the bi_cluster
but can query the shared data via the new cross-database query functionality in Amazon Redshift.
- Review the list of objects in the share by running the following SQL:
The following screenshot shows our results.
- Review the list of columns in the
customer_denorm
view::
The following screenshot shows our results.
- Query the shared objects using three-part notation just like querying any other local database object, using a notation
<database>
.<schema>
.<view/table>
:
Following is your result:
28950139
- Analyze the local product reviews data by joining the shared
customer_denorm
data to identify the top ratings by customer states for this BI report:
The following screenshot shows our results.
Adding a data science consumer
Now, let’s assume the company has decided to spin up a data science team to help with new sales strategies, and this team performs analytics on the sales data. The data science team is new and has very different access patterns and SLA requirements compared to the BI team. Thanks to the data sharing feature, onboarding new use cases such as this is easy.
We add a data science consumer cluster named ds_cluster
. Because the data science users need access to data in salesdatashare
, the superuser in the etl_cluster
can simply grant access to the ds_cluster
by adding them as another consumer for the share without moving any data:
The following diagram shows our updated architecture with the data science consumer (step 4).
This way, multiple clusters of different sizes can access the same dataset and isolate workloads to meet their SLA requirements. Users in these respective clusters are granted access to shared objects to meet their stringent security requirements. The producer keeps control of the data and at any point can remove certain objects from the share or remove access to the share for any of these clusters, and the consumers immediately lose access to the data. Also, as more data is ingested into the producer cluster, the consumer sees transactionally consistent data instantly.
Monitoring and security
Amazon Redshift offers comprehensive auditing capabilities using system tables and AWS CloudTrail to allow you to monitor the data sharing permissions and usage across all the consumers and revoke access instantly when necessary. The permissions are granted by the superusers from both the producer and the consumer clusters to define who gets access to what objects, similar to the grant commands used in the earlier scenario. You can use the following commands to audit the usage and activities for the data share.
Track all changes to the data share and the shared database imported from the data share with the following code:
The following screenshot shows our results.
Track data share access activity (usage), which is relevant only on the producer, with the following code:
The following screenshot shows our results.
Summary
Amazon Redshift data sharing provides workload isolation by allowing multiple consumers to share data seamlessly without the need to unload and load data. We also presented a step-by-step guide for securely sharing data from a producer to multiple consumer clusters.
About the Authors
Harsha Tadiparthi is a Specialist Sr. Solutions Architect, AWS Analytics. He enjoys solving complex customer problems in Databases and Analytics and delivering successful outcomes. Outside of work, he loves to spend time with his family, watch movies, and travel whenever possible.
Harshida Patel is a Specialist Sr. Solutions Architect, Analytics with AWS.