AWS Big Data Blog
Cross-account data collaboration with Amazon DataZone and AWS analytical tools
Data sharing has become a crucial aspect of driving innovation, contributing to growth, and fostering collaboration across industries. According to this Gartner study, organizations promoting data sharing outperform their peers on most business value metrics. A straightforward data access and sharing mechanism is crucial for enabling effective data sharing across an organization. There are challenges such as complexity in managing cross-account permissions and difficulty in discovering the right data across accounts that organizations face when trying to share data products across AWS accounts. Amazon DataZone is a fully managed data management service that customers can use to catalog, discover, share, and govern data stored across Amazon Web Services (AWS).
In this post, we will cover how you can use Amazon DataZone to facilitate data collaboration between AWS accounts.
Solution overview
This solution provides a streamlined way to enable cross-account data collaboration using Amazon DataZone domain association while maintaining security and governance. This post describes the process of using the business data catalog resource of Amazon DataZone to publish data assets so they’re discoverable by other accounts. After they’ve been published, you can query the published assets from another AWS account using analytical tools such as Amazon Athena and the Amazon Redshift query editor, as shown in the following figure.
In this solution (as shown in the preceding figure), the AWS account that contains the data assets is referred to as the producer account. The AWS account that needs to access or use the data from the producer account is referred to as the consumer account. The Amazon DataZone domain is created and managed within the producer account and then the consumer account is associated with that domain.
As part of Amazon DataZone domain association, Amazon DataZone uses AWS Resource Access Manager (AWS RAM) to share the resource. When the producer and consumer AWS accounts are in the same organization within AWS Organizations, the domain association happens automatically. If the producer and consumer AWS accounts are in different organizations, AWS RAM sends an invitation to the consumer AWS account to accept or reject the resource grant.
This solution presents three Amazon DataZone user personas as:
- Data administrators: Account owners in both producer and consumer AWS accounts. The data administrators are responsible for creating Amazon DataZone domains, configuring domain associations, and accepting domain associations within the Amazon DataZone domain.
- Data publishers: Users in producer AWS accounts. The data publishers are responsible for creating Amazon DataZone publish projects and environments, producing and publishing data assets, and accepting subscription requests.
- Data subscribers: Users in consumer AWS accounts. The data subscribers are responsible for creating Amazon DataZone subscribe projects and environments, searching for and subscribing to data assets, and querying the data and deriving insights.
Prerequisites
To follow along with the instructions, you will need:
- Two AWS accounts, one serving as producer and other account serving as consumer. Create new AWS accounts if necessary.
- An Amazon Redshift provisioned cluster or Amazon Redshift Serverless workgroup in the producer and consumer AWS accounts provisioned by a data administrator.
- A secret in AWS Secrets Manager storing the master user credentials for the Amazon Redshift cluster or workgroup in the producer and consumer AWS accounts.
- The data administrators are responsible for creating secrets.
- The data producers and consumers can obtain the Amazon Resource Name (ARN) of the secrets from the data administrators during the environment or environment profile creation steps.
Amazon DataZone uses Amazon Redshift Datashares to share data across clusters and accounts. There are specific requirements and limitations for using Amazon Redshift datashares.
- For cross-account data sharing, both the producer and consumer clusters must be encrypted. See Cluster encryption section of datashare-considerations for more information about the encryption process.
- Data sharing is supported only for provisioned ra3 cluster types (ra3.16xlarge, ra3.4xlarge, and ra3.xlplus) and Amazon Redshift Serverless.
Walkthrough:
The following are the high level steps to configure cross-account access. We’ve provided step-by-step instructions in the following sections.
- Create an Amazon DataZone domain in the producer account. The data administrator creates an Amazon DataZone domain.
- Request Amazon DataZone domain association from the producer account to the consumer account.
- Accept the domain association request in the consumer account. The data administrator accepts the domain association.
- Add data users to the Amazon DataZone domain.
- Create the necessary publish project for AWS Glue and Amazon Redshift in the producer account.
- Create AWS Glue and Amazon Redshift environments to publish the data assets in the producer account.
- Create and run a data source for AWS Glue and Amazon Redshift to publish assets into the business catalog.
- Create subscribe projects for AWS Glue and Amazon Redshift.
- Create AWS Glue and Amazon Redshift environment profiles and environments in the subscribe project
- Subscribe to AWS Glue and Amazon Redshift tables. Consume the data using Athena and Amazon redshift editors. This step is performed by the data subscriber.
Create the Amazon DataZone domain in the producer account
Amazon DataZone domains serve as high-level organizational units for assets, users, and projects, facilitating cross-team and cross-account collaboration. This step focusses on creating the Amazon DataZone domain in the producer account.
- Sign in to the producer account AWS Management Console for Amazon DataZone using the data administrator credentials.
- Create an Amazon DataZone domain titled
Demo_cross_account_domain
using the instructions at create domains. - On the Create domain screen, select Quick setup checkbox to automate several configuration steps, saving time and reducing the potential for setup errors. Quick setup enables two default blueprints and creates the default environment profiles for the data lake and data warehouse default blueprints.
Request Amazon DataZone domain association from the producer account to the consumer account
To associate the Amazon DataZone domain with the consumer account, the producer account requests a domain association. This involves providing necessary information about the consumer account and granting appropriate permissions for data access and management.
- Sign in to the Amazon DataZone console of the producer account using the data administrator credentials.
- Navigate to the domain detail page, and then scroll down and select the Associated Accounts tab.
- Enter the consumer account IDs that you want to request association. Choose Add another account if you want to add more than one account. When you’re satisfied with the list of account IDs, choose Request association.
- Use the latest (AWS RAM
DataZonePortalReadWrite
policy when requesting the account association. This policy allows users in the consumer account to execute Amazon DataZone APIs and to use the data portal interface.
- Use the latest (AWS RAM
Accept an account association request from an Amazon DataZone domain
This step focuses on accepting the account association request from the Amazon DataZone domain in the consumer account. This allows the consumer account to be linked with the Amazon DataZone domain to enable data sharing and collaboration between the producer and consumer accounts.
- Sign in to the consumer account and go to the Amazon DataZone console in the same AWS Region as the domain. On the Amazon DataZone home page, choose View requests.
- Select the name of the inviting Amazon DataZone domain and choose Review request.
- Choose Accept association, you should see the
Demo_cross_account_domain
state as associated in the Associated domains screen
- Choose the domain for which you want to enable an environment blueprint.
- From the Blueprints list, choose either the DefaultDataLake blueprint
- On the Permissions and resources page, for enabling the DefaultDataLake blueprint, for Glue Manage Access role, specify a new role that grants Amazon DataZone authorization to ingest and manage access to tables in AWS Glue and AWS Lake Formation.
- Repeat steps 4 to 6 to enable the DefaultDataWarehouse blueprint by choosing DefaultDataWarehouse instead of DefaultDataLake
Add data users to the Amazon DataZone domain
To grant access to the Amazon DataZone data portal from the console for data publisher and data Subscriber IAM users, use the following steps to add them in the User Management section of the Amazon DataZone domain. See Manage users in the Amazon DataZone console for additional details.
- Sign in to the Amazon DataZone console as a data administrator using the producer account.
- Select the Amazon DataZone domain and, in the User management section, choose Add and select Add IAM users.
- On the Add users page, choose Current account and add the user ARN of the data producer and choose Add users.
- Next choose Associated account, and enter the data subscriber user’s ARN and add the user by choosing Add users.
Create the publish project for AWS Glue and Amazon Redshift
This step focuses on creating the publish project for AWS Glue and Amazon Redshift in the producer account. The project will be used to publish data from your data sources to the appropriate AWS services.
- Using the producer account, sign in to the Amazon DataZone console as a data publisher.
- Select View domains and select the demo_cross_account_domain.
- Choose the Open data portal link and sign in to the data portal.
- Choose Create New Project and create a project named
Glue_Publish_Project
for publishing AWS Glue data assets and create the project under demo_cross_account_domain. - Create another project named
Redshift_Publish_Project
for publishing Amazon Redshift data assets, also under the demo_cross_account_domain.
Create AWS Glue and Amazon Redshift environments to publish the data assets
In this step, you set up AWS Glue and Amazon Redshift environments in the producer account to share data assets. The required infrastructure, such as the AWS Glue Data Catalog and Redshift cluster for storing data, should already be in place. After setup, this will allow the consumer account to access and use the shared data assets. See Create a new environment for detailed instructions on creating a new environment.
Create the AWS Glue environment and a new AWS Glue table
- In the same Amazon DataZone domain demo_cross_account_domain, choose Browse Project and select the Glue_Publish_Project and create Glue_Publish_Environment using the default DataLakeProfile.
- Leave the producer_glue_db_name, consumer_glue_db_name and Workgroup_name blank.
- Choose Create Environment and wait for the process to complete.
- After the environment is created, browse the list of available projects and choose Glue_publish_project.
- Next, navigate to the Glue_Publish_Environment, and under Analytics tools, choose Amazon Athena to open the Athena query editor
- Choose Open Athena and make sure that Glue_Publish_Environment is selected in the Amazon DataZone environment dropdown at the upper right and that in Data on the left, glue_publish_environment_pub_db is selected as the Database.
- Create a new AWS Glue table for publishing to Amazon DataZone. Paste the following create table as select (CTAS) query script in the Query window and run it to create a new table named
mkt_sls_table
. The script creates a table with sample marketing and sales data. - Go to the Tables and Views section and verify that the
mkt_sls_table
table was successfully created.
Create the Amazon Redshift publish environment and a new Redshift table
- Staying in the same Amazon DataZone domain demo_cross_account_domain, choose Browse Project, to create an Amazon Redshift publish environment, select the Redshift_Publish_Project and create Redshift_Publish_Environment using the default data warehouse profile.
- To configure environment parameters, enter the name of your Amazon Redshift cluster or workgroup, specify the database name and enter the AWS Secrets Manager secret ARN for the Redshift cluster or workgroup. You need to make sure that the secret in Secrets Manager includes the following tags. These tags help Amazon DataZone implement proper access control so that only authorized users within the correct Amazon DataZone project and domain can access the Amazon Redshift resource:
- For Amazon Redshift cluster:
DataZone.rs.cluster: <cluster_name:database name>
- For Amazon Redshift Serverless workgroup:
DataZone.rs.workgroup: <workgroup_name:database_name>
- AmazonDataZoneProject:
<projectID>
- AmazonDataZoneDomain:
<domainID>
For more information for creating redshift database user secret in secret manager, see Storing database credentials in AWS Secrets Manager.
- For Amazon Redshift cluster:
For more information for creating redshift database user secret in secret manager, see Storing database credentials in AWS Secrets Manager.
- Note that the database user you provide in Secrets Manager must have superuser permissions. Data publishers should work with the data administrator to get the details of the Redshift cluster or workgroup, database name, and secret ARN.
- The schema is optional.
- Choose Create Environment and wait for the process to complete.
- Verify that the environment is created successfully without errors.
- Browse the list of available projects and select
Redshift_publish_project
. Navigate toRedshift_publish_environment
. - Under Analytics tools, choose Amazon Redshift to open the Amazon Redshift query editor.
- Select the Redshift cluster that you want to connect, choose Save and then choose Create Connection using temporary credentials with your IAM identity.
- Create a new Redshift table. You can use the CTAS query to create a new table named
rs_sls_tbl
. Use the provided CTAS script, which creates a table with sample sales data in thedatazone_env_redshift_publish_environment
schema. - Make sure that the
rs_sls_tbl
table is successfully created.
Publish assets into the common business catalog
In this step, you create and run the Amazon DataZone data sources for AWS Glue and Amazon Redshift. You will then publish the data assets from these data sources.
The Amazon DataZone data sources allow you to connect to various data sources, including databases, data warehouses, and data lakes, and ingest metadata into Amazon DataZone. By creating and running these data sources, you can make your data available for analysis, transformation, and sharing within your organization.
After the data sources are set up, you can publish the data assets from these sources to make them accessible to other users and applications. This process involves mapping the data assets to the appropriate business terms and metadata, making sure that the data is properly described and categorized.
Add an AWS Glue data source to publish the new AWS Glue table.
- Stay signed in the producer account and Amazon DataZone console as a data publisher.
- Choose Select project from the top navigation pane and select the Glue_Publish_Project that you want to add the data source to.
- Select the Glue_Publish_Environment.
- Choose Create data source. Enter
glue-publish-datasource
as the name. - Under Data source type, choose AWS Glue.
- Under Select an environment, select Glue_Publish_Environment.
- Under Data selection, select the AWS Glue database glue_publish_environment_pub_db, enter your table selection criteria as “*“, and then and choose Next.
- Leave all other setting as default and choose Next.
- For Run Preference, select Run on demand to ingest metadata from the specified AWS Glue tables into Amazon DataZone.
- Review and choose Create.
- After the data source has been created choose Run. The
mkt_sls_table
will be listed in the inventory and available to publish. - Select the
mkt_sls_table
table and review the metadata that was generated. Choose Accept All if you’re satisfied with the metadata. - Choose Publish Asset and the
mkt_sls_table
table will be published to the business data catalog, making it discoverable and understandable across your organization.
Add an Amazon Redshift data source to publish the new Amazon Redshift table.
- Stay signed in the producer account and Amazon DataZone console as a data publisher.
- Choose Select project from the top navigation pane and select the Redshift_Publish_Project that you want to add the data source to.
- Choose the Redshift_Publish_Environment.
- Choose Create data source. Enter
rs-publish-datasource
as the name. - Under Data source type, select Amazon Redshift.
- Under Select an environment, select Redshift_Publish_Environment.
- Under Redshift Credentials, enter the Redshift cluster and secret details provided by the data administrator.
- Under Data Selection, select the database dev and schema datazone_env_redshift_publish_environment.
- Keep other setting as default and choose Next.
- For Run Preference, select Run on Demand.
- Choose Save. After the data source is created, choose Run. The data source runs and the
rs_sls_tbl
will be listed in the inventory and available to publish. - Select the
rs_sls_tbl
table and review the metadata that was generated. Choose Accept All if you are satisfied with the metadata. - Choose Publish Asset and the
rs_sls_table
table will be published to the business data catalog.
Create subscribe projects for AWS Glue and Amazon Redshift
In this step, you create the projects for subscribing to AWS Glue and Amazon Redshift data assets within your Amazon DataZone domain.
- Sign in to the Amazon DataZone console as a data subscriber IAM user using the consumer account.
- Choose Associated domains and select the demo_cross_account_domain.
- Select the Open data portal link and sign in to the data portal.
- Choose Create New Project and create a project named
Glue_Subscribe_Project
for subscribing to the AWS Glue data assets. - Create another project named
Redshift_Subscribe_Project
for subscribing to the Redshift data assets.
Create AWS Glue and Amazon Redshift environment profiles
In this step, you will set up the environment profiles and environments for AWS Glue and Amazon Redshift in your Amazon DataZone projects. This will allow you to connect and interact with resources across AWS accounts.
The purpose of environment profiles in Amazon DataZone is to streamline the process of environment creation. By using environment profiles, you can preconfigure essential placement information such as AWS account and AWS Region. In this solution, you will configure environment profiles with placement information pointing to your consumer account.
You will also create an Amazon DataZone environment from the profiles you are about to create. This will provision the necessary resources in the consumer account and establish the connections between the Amazon DataZone domain and the consumer account. After the environments are created, you can work with AWS Glue and Amazon Redshift assets seamlessly across different AWS accounts within your Amazon DataZone ecosystem.
Create an AWS Glue profile and environment
- Stay signed in the consumer account’s Amazon DataZone console as a data subscriber IAM, select the Environments tab and then choose Create environment profile.
- Configure the fields as follows:
- Name: Enter
glue_subscribe-env-profile
. - Owner: The project where the profile is being created is selected by default in this field. Verify that it’s
Glue_Subscribe_Project
. - Blueprint: Select Default Data Lake.
- AWS account parameters: Enter the consumer AWS account number and select the Region.
- Authorized projects: Select All projects.
- Publishing: Select Publish from any database.
- Choose Create Environment Profile.
- Name: Enter
- On the Create environment page, enter the following:
- Name: Enter
glue_subscribe_environment
. - Verify that the Environment profile is set to glue_subscribe-env-profile.
- Name: Enter
- (Optional) Parameters: Enter the Producer glue db name, Consumer glue db name, and Workgroup name.
- Choose Create environment.
- It takes a few minutes for the environment to be created. Verify that the environment creation is successful without any errors.
Create a Redshift environment profile and environment
- Staying in the consumer account’s Amazon DataZone management console as a data subscriber IAM user, navigate to the Redshift_Subscribe_Project you created previously.
- Select the Environments tab and then choose Create environment profile.
- Configure the fields as follows:
- Name: Enter
redshift_subscribe-env-profile.
- Owner: Verify that Project is set to Redshift_Subscribe_Project.
- Blueprint: Select Default Data Warehouse.
- Parameter set: Select Enter my own.
- AWS account parameters: Enter the consumer AWS account number and select the Region.
- Parameters: Select either Amazon Redshift Cluster or Amazon Redshift Serverless in the consumer account.
- AWS Secret ARN: Enter the AWS Secrets Manager secret ARN for the Redshift cluster or workgroup. You need to make sure that the secret in Secrets Manager includes the following tags. These tags help Amazon DataZone implement proper access control so that only authorized users within the correct Amazon DataZone project and domain can access the Amazon Redshift resource.
- AmazonDataZoneDomain: [
Domain_ID
] - AmazonDataZoneProject: [
Project_ID
]
- AmazonDataZoneDomain: [
For more information for creating redshift database user secret in secret manager, see Storing database credentials in AWS Secrets Manager.
Note that the database user you provide in AWS Secrets Manager must have superuser permissions. Data publishers should work with the data administrator to get the details of the Redshift cluster or workgroup, database name, and secret ARN.
- Redshift cluster name: Enter the name of the Amazon Redshift cluster or Amazon Redshift Serverless workgroup.
- Database name: Enter the name of the database within the selected Amazon Redshift cluster or Amazon Redshift Serverless workgroup
- AWS Secret ARN: Enter the AWS Secrets Manager secret ARN for the Redshift cluster or workgroup. You need to make sure that the secret in Secrets Manager includes the following tags. These tags help Amazon DataZone implement proper access control so that only authorized users within the correct Amazon DataZone project and domain can access the Amazon Redshift resource.
- Authorized projects: Select All projects.
- Publishing: Select Publish any schema.
- Name: Enter
- Choose Create environment profile.
- Create an environment from this profile: Create an environment from this profile:
- Name: Enter
redshift_subscribe_environment
. - Verify that the Environment profile is set to redshift_subscribe-env-profile.
- Name: Enter
- Choose Create Environment.
It takes a few minutes for the environment to be created. Verify that the environment creation is successful without any errors.
Subscribe to the AWS Glue and Redshift tables
In this step, you will subscribe AWS Glue and Amazon redshift tables published by the data producer.
Subscribe to the AWS Glue table
- Sign in to the Amazon DataZone console of the consumer account using the data subscriber credentials and navigate to the Glue_Subscribe_project you created previously.
- Search for the Market Sales Table in the Search bar.
- Select the Market Sales Table and choose Subscribe.
- In the Subscribe pop-up window, provide the following information:
- Project: Enter the name of the project that you want to subscribe to the asset. By default this will be Glue_Subscribe_Project.
- Enter a justification for your subscription request.
- Choose Subscribe.
- Switch to the data publisher role to approve the subscription request, then back to data subscriber after choosing Approve.
- Select the Glue_subscribe_project and choose Subscribed Assets. Verify that the Market Sales Table is added to your environment.
- Navigate to the Amazon Athena query editor using the link in the project’s home page.
- Choose OPEN AMAZON ATHENA.
- You will now be automatically routed to the Athena console, make sure that the Amazon DataZone Environment is set to glue_subscribe_environment.
- For Database, select glue_subscribe_environment_sub_db.
- You should see the
mkt_sls_table
in the Tables list. Preview the table by choosing the three-dot menu next to the table name and selecting Preview Table
- Review the table preview results. You will be able to see all the sales related data from the
mkt_sls_table
Subscribe to the Redshift table
- Stay signed in to the Amazon DataZone management console as the data subscriber, Choose Select project from the top navigation pane and select the Redshift_Subscribe_project.
- Search for Sales Table in the search bar, and select the Sales Table.
- In the Subscribe pop-up window, provide the following information:
- Project: Enter the name of the project that you want to subscribe to the asset. By default this will be Redshift_Subscribe_Project.
- Enter a justification for your subscription request.
- Choose Subscribe.
- Switch back to the data publisher who is the producer of the Market Sales Table choose Approve.
- After the subscription request is approved, switch back to data subscriber.
- Select the Redshift_subscribe_project and choose Subscribed Assets. After the Sales Table is added to your environment, you can query the data in the table.
- Select the Amazon Redshift link in the right side panel of the project home page and navigate to the Amazon Redshift query editor.
- Select Open Amazon Redshift and the Redshift query editor v2 will open in a new tab.
- In the query editor, right-click your Amazon DataZone environment’s Amazon Redshift cluster and select Create a connection.
- Select Temporary credentials using your IAM identity for authentication.
- Enter the name of the Amazon DataZone environment’s database to create the connection.
- Choose Create connection.
- You can now view the Redshift table
rs_sls_tbl
in the datazone_env_redshift_subscribe_environment. - Execute the following query to make sure the data is accessible
You will be able to preview the rs_sls_tbl
which will show the sale data from the table.
Clean up
To avoid unnecessary future charges, follow these steps:
- Delete the Amazon DataZone project if you created it as part of this post.
- Delete the Amazon DataZone domain if you created it as part of this post.
- Delete the Redshift clusters and the redshift secrets in both the producer and consumer accounts if you created them as part of the post.
Summary
Organizations often face significant challenges when trying to share data products across multiple AWS accounts. These challenges stem from the complexity of configuring proper cross-account access permissions and roles while maintaining robust data governance and security controls.
You can use the solution described in the post to publish and consume data across AWS accounts and make sure that reliable access and consistent data governance is in place. By combining the power of AWS Glue and Amazon Redshift, you can unlock valuable insights and accelerate your data-driven decision-making processes.
In this post, you followed a step-by-step guide to set up cross-account data sharing using Amazon DataZone domain association. You learned how to publish data assets from a producer account. You also learned how to subscribe to and query the published assets from a consumer account. You can optionally use AWS Lake Formation access monitoring to view permissions and data access activities. AWS Lake Formation uses AWS CloudTrail for historical analysis and CloudTrail retains logs for 90 days by default.
Now that you’re familiar with the elements involved in cross-account data sharing using Amazon DataZone and your choice of analytical tool, you’re ready to try it with multiple accounts.
About the Authors
Arun Pradeep Selvaraj is a Senior Solutions Architect at AWS. Arun is passionate about working with his customers and stakeholders on digital transformations and innovation in the cloud while continuing to learn, build and reinvent. He is creative, fast-paced, deeply customer-obsessed, and uses the working backwards process to build modern architectures to help customers solve their unique challenges. Connect with him on LinkedIn.
Piyush Mattoo is a Senior Solution Architect for the Financial Services Data Provider segment at Amazon Web Services. He’s a software technology leader with over a decade of experience building scalable and distributed software systems to enable business value through the use of technology. He has an educational background in Computer Science with a master’s degree in computer and information science from University of Massachusetts. He is based out of Southern California and current interests include camping and nature walks.
Mani Yamaraja is a Senior Customer Solutions Manager for Financial Services Data Provider segment at Amazon Web Services. He has over a decade long experience working with financial services customers enabling their digital transformation journey. Mani adopts a customer centric approach and provides technology solutions working backwards from customer’s business goals. He is passionate about the financial services industry and helps the customers accelerate their cloud based transformation using the proven mechanisms of AWS.