AWS Big Data Blog
Improve your ETL performance using multiple Redshift warehouses to write to your data sets
Amazon Redshift is a fast, petabyte-scale, cloud data warehouse that tens of thousands of customers rely on to power their analytics workloads. Thousands of customers use Amazon Redshift read data sharing to enable instant, granular, and fast data access across Redshift provisioned clusters and serverless workgroups. This allows you to scale your read workloads to thousands of concurrent users without having to move or copy the data.
Now, at Amazon Redshift, we are announcing the general availability of multi-data warehouse writes through data sharing. This new capability allows you to achieve better performance for extract, transform, and load (ETL) workloads by using different warehouses of different types and sizes based on your workload needs. Additionally, this allows you to easily keep your ETL jobs running more predictably as you can split them between warehouses in a few clicks, monitor and control costs as each warehouse has its own monitoring and cost controls, and foster collaboration as you can enable different teams to write to another team’s databases in just a few clicks.
The data is live and available across all warehouses as soon as it is committed, even when it’s written to cross-account or cross-region. Warehouses can be a combination of ra3 instances or serverless workgroups.
In this post, we discuss when you should consider using multiple warehouses to write to the same databases, explain how multi-warehouse writes through data sharing works, and walk you through an example on how to use multiple warehouses to write to the same database.
Reasons for using multiple warehouses to write to the same databases
In this section, we discuss some of the reasons why you should consider using multiple warehouses to write to the same database.
Better performance and predictability for mixed workloads
Customers often start with a warehouse sized to fit their initial workload needs. For example, if you need to support occasional user queries and nightly ingestion of 10 million rows of purchase data, a 32 RPU workgroup may be perfectly suited for your needs. However, adding a new hourly ingestion of 400 million rows of user website and app interactions could slow existing users’ response times as the new workload consumes significant resources. You could resize to a larger workgroup so read and write workloads complete quickly without fighting over resources. However, this may provide unneeded power and cost for existing workloads. Also, because workloads share compute, a spike in one workload can affect the ability of other workloads to meet their SLAs.
The following diagram illustrates a single-warehouse architecture.
With the ability to write through datashares, you can now separate the new user website and app interactions ETL into a separate, larger workgroup so that it completes quickly with the performance you need without impacting the cost or completion time of your existing workloads. The following diagram illustrates this multi-warehouse architecture.
The multi-warehouse architecture enables you to have all write workloads complete on time with less combined compute, and subsequently lower cost, than a single warehouse supporting all workloads.
Control and monitor costs
When you use a single warehouse for all your ETL jobs, it can be difficult to understand which workloads are contributing to your costs. For instance, you may have one team running an ETL workload ingesting data from a CRM system while another team is ingesting data from internal operational systems. It’s hard for you to monitor and control the costs for the workloads because queries are running together using the same compute in the warehouse. By splitting the write workloads into separate warehouses, you can separately monitor and control costs while ensuring the workloads can progress independently without resource conflict.
Collaborate on live data with ease
There are times when two teams use different warehouses for data governance, compute performance, or cost reasons, but also at times need to write to the same shared data. For instance, you may have a set of customer 360 tables that need to be updated live as customers interact with your marketing, sales, and customer service teams. When these teams use different warehouses, keeping this data live can be difficult because you may have to build a multi-service ETL pipeline using tools like Amazon Simple Storage Service (Amazon S3), Amazon Simple Notification Service (Amazon SNS), Amazon Simple Queue Service (Amazon SQS), and AWS Lambda to track live changes in each team’s data and ingest it into a single source.
With the ability to write through datashares, you can grant granular permissions on your database objects (for example, SELECT on one table, and SELECT, INSERT, and TRUNCATE on another) to different teams using different warehouses in a few clicks. This enables teams to start writing to the shared objects using their own warehouses. The data is live and available to all warehouses as soon as it is committed, and this even works if the warehouses are using different accounts and regions.
In the following sections, we walk you through how to use multiple warehouses to write to the same databases via data sharing.
Solution overview
We use the following terminology in this solution:
- Namespace – A logical container for database objects, users and roles, their permissions on database objects, and compute (serverless workgroups and provisioned clusters).
- Datashare – The unit of sharing for data sharing. You grant permissions on objects to datashares.
- Producer – The warehouse that creates the datashare, grants permissions on objects to datashares, and grants other warehouses and accounts access to the datashare.
- Consumer – The warehouse that is granted access to the datashare. You can think of consumers as datashare tenants.
This use case involves a customer with two warehouses: a primary warehouse used for attached to the primary namespace for most read and write queries, and a secondary warehouse attached to a secondary namespace that is primarily used to write to the primary namespace. We use the publicly available 10 GB TPCH dataset from AWS Labs, hosted in an S3 bucket. You can copy and paste many of the commands to follow along. Although it’s small for a data warehouse, this dataset allows easy functional testing of this feature.
The following diagram illustrates our solution architecture.
We set up the primary namespace by connecting to it via its warehouse, creating a marketing database in it with a prod
and staging
schema, and creating three tables in the prod
schema called region
, nation
, and af_customer
. We then load data into the region
and nation
tables using the warehouse. We do not ingest data into the af_customer
table.
We then create a datashare in the primary namespace. We grant the datashare the ability to create objects in the staging
schema and the ability to select, insert, update, and delete from objects in the prod
schema. We then grant usage on the schema to another namespace in the account.
At that point, we connect to the secondary warehouse. We create a database from a datashare in that warehouse as well as a new user. We then grant permissions on the datashare object to the new user. Then we reconnect to the secondary warehouse as the new user.
We then create a customer table in the datashare’s staging
schema and copy data from the TPCH 10 customer dataset into the staging table. We insert staging customer table data into the shared af_customer
production table, and then truncate the table.
At this point, the ETL is complete and you are able to read the data in the primary namespace, inserted by the secondary ETL warehouse, from both the primary warehouse and the secondary ETL warehouse.
Prerequisites
To follow along with this post, you should have the following prerequisites:
- Two warehouses created on any maintenance track. The warehouses can be a mix of serverless workgroups or ra3 clusters.
- Access to a superuser in both warehouses.
- An AWS Identity and Access Management (IAM) role that is able to ingest data from Amazon Redshift to Amazon S3 (Amazon Redshift creates one by default when you create a cluster or serverless workgroup).
- For cross-account only, you need access to an IAM user or role that is allowed to authorize datashares. For the IAM policy, refer to Sharing datashares.
See our documentation for the most up-to-date information.
Set up the primary namespace (producer)
In this section, we show how to set up the primary (producer) namespace we will use to store our data.
Connect to producer
Complete the following steps to connect to the producer:
- On the Amazon Redshift console, choose Query editor v2 in the navigation pane.
In the query editor v2, you can see all the warehouses you have access to in the left pane. You can expand them to see their databases.
- Connect to your primary warehouse using a superuser.
- Run the following command to create the
marketing
database:
Create the database objects to share
Complete the following steps to create your database objects to share:
- After you create the
marketing
database, switch your database connection to themarketing
database.
You may need to refresh your page to be able to see it.
- Run the following commands to create the two schemas you intend to share:
- Create the tables to share with the following code. These are standard DDL statements coming from the AWS Labs DDL file with modified table names.
Copy data into the region
and nation
tables
Run the following commands to copy data from the AWS Labs S3 bucket into the region
and nation
tables. If you created a cluster while keeping the default created IAM role, you can copy and paste the following commands to load data into your tables:
Create the datashare
Create the datashare using the following command:
The publicaccessible
setting specifies whether or not a datashare can be used by consumers with publicly accessible provisioned clusters and serverless workgroups. If your warehouses are not publicly accessible, you can ignore that field.
Grant permissions on schemas to the datashare
To add objects with permissions to the datashare, use the grant syntax, specifying the datashare you’d like to grant the permissions to:
This allows the datashare consumers to use objects added to the prod
schema and use and create objects added to the staging
schema. To maintain backward compatibility, if you use the alter datashare
command to add a schema, it will be the equivalent of granting usage on the schema.
Grant permissions on tables to the datashare
Now you can grant access to tables to the datashare using the grant syntax, specifying the permissions and the datashare. The following code grants all privileges on the af_customer
table to the datashare:
To maintain backward compatibility, if you use the alter datashare command to add a table, it will be the equivalent of granting select on the table.
Additionally, we’ve added scoped permissions that allow you to grant the same permission to all current and future objects within the datashare. We add the scoped select permission on the prod schema tables to the datashare:
After this grant, the customer will have select permissions on all current and future tables in the prod schema. This gives them select access on the region
and nation
tables.
Grant permissions to the secondary ETL namespace
You can grant permissions to the secondary ETL namespace using the existing syntax. You do this by specifying the namespace ID. You can find the namespace on the namespace details page if your secondary ETL namespace is serverless, as part of the namespace ID in the cluster details page if your secondary ETL namespace is provisioned, or by connecting to the secondary ETL warehouse in the query editor v2 and running select current_namespace.
You can then grant access to the other namespace with the following command (change the consumer namespace to the namespace UID of your own secondary ETL warehouse):
Set up the secondary ETL namespace (consumer)
At this point, you’re ready to set up your secondary (consumer) ETL warehouse to start writing to the shared data.
Create a database from the datashare
Complete the following steps to create your database:
- In the query editor v2, switch to the secondary ETL warehouse.
- Run the command
show datashares
to see the marketing datashare as well as the datashare producer’s namespace. - Use that namespace to create a database from the datashare, as shown in the following code:
Specifying with permissions
allows you to grant granular permissions to individual database users and roles. Without this, if you grant usage permissions on the datashare database, users and roles get all permissions on all objects within the datashare database.
Create a user and grant permissions to that user
Create a user using the CREATE USER command:
With these grants, you’ve given the user data_engineer
all permissions on all objects in the datashare. Additionally, you’ve granted all permissions available in the schemas as scoped permissions for data_engineer
. Any permissions on any objects added to those schemas will be automatically granted to data_engineer
.
At this point, you can continue the steps using either the admin user you’re currently signed in as or the data_engineer
.
Options for writing to the datashare database
You can write data to the datashare database three ways.
Use three-part notation while connected to a local database
Like with read data sharing, you can use three-part notation to reference the datashare database objects. For instance, insert into marketing_ds_db.prod.customer
. Note that you can’t use multi-statement transactions to write to objects in the datashare database like this.
Connect directly to the datashare database
You can connect directly to the datashare database via the Redshift JDBC, ODBC, or Python driver, in addition to the Amazon Redshift Data API (new). To connect like this, specify the datashare database name in the connection string. This allows you to write to the datashare database using two-part notation and use multi-statement transactions to write to the datashare database. Note that some system and catalog tables are not available this way.
Run the use command
You can now specify that you want to use another database with the command use <database_name>
. This allows you to write to the datashare database using two-part notation and use multi-statement transactions to write to the datashare database. Note that some system and catalog tables are not available this way. Also, when querying system and catalog tables, you will be querying the system and catalog tables of the database you are connected to, not the database you are using.
To try this method, run the following command:
Start writing to the datashare database
In this section, we show how to write to the datashare database using the second and third options we discussed (direct connection or use command). We use the AWS Labs provided SQL to write to the datashare database.
Create a staging table
Create a table within the staging schema, because you’ve been granted create privileges. We create a table within the datashare’s staging
schema with the following DDL statement:
You can use two-part notation because you used the USE command or directly connected to the datashare database. If not, you need to specify the datashare database names as well.
Copy data into the staging table
Copy the customer TPCH 10 data from the AWS Labs public S3 bucket into the table using the following command:
As before, this requires you to have set up the default IAM role when creating this warehouse.
Ingest African customer data to the table prod.af_customer
Run the following command to ingest only the African customer data to the table prod.af_customer
:
This requires you to join on the nation and region tables you have select permission for.
Truncate the staging table
You can truncate the staging table so that you can write to it without recreating it in a future job. The truncate action will run transactionally and can be rolled back if you are connected directly to the datashare database or you are using the use command (even if you’re not using a datashare database). Use the following code:
At this point, you’ve completed ingesting the data to the primary namespace. You can query the af_customer
table from both the primary warehouse and secondary ETL warehouse and see the same data.
Conclusion
In this post, we showed how to use multiple warehouses to write to the same database. This solution has the following benefits:
- You can use provisioned clusters and serverless workgroups of different sizes to write to the same databases
- You can write across accounts and regions
- Data is live and available to all warehouses as soon as it is committed
- Writes work even if the producer warehouse (the warehouse that owns the database) is paused
To learn more about this feature, please see our documentation. Additionally, if you have any feedback, please email us at dsw-feedback@amazon.com.
About the authors
Ryan Waldorf is a Senior Product Manager at Amazon Redshift. Ryan focuses on features that enable customers to define and scale compute including data sharing and concurrency scaling.
Harshida Patel is a Analytics Specialist Principal Solutions Architect, with Amazon Web Services (AWS).
Sudipto Das is a Senior Principal Engineer at Amazon Web Services (AWS). He leads the technical architecture and strategy of multiple database and analytics services in AWS with special focus on Amazon Redshift and Amazon Aurora.
Saurav Das is part of the Amazon Redshift Product Management team. He has more than 16 years of experience in working with relational databases technologies and data protection. He has a deep interest in solving customer challenges centered around high availability and disaster recovery.