AWS Big Data Blog

Streamline your data governance by deploying Amazon DataZone with the AWS CDK

Managing data across diverse environments can be a complex and daunting task. Amazon DataZone simplifies this so you can catalog, discover, share, and govern data stored across AWS, on premises, and third-party sources.

Many organizations manage vast amounts of data assets owned by various teams, creating a complex landscape that poses challenges for scalable data management. These organizations require a robust infrastructure as code (IaC) approach to deploy and manage their data governance solutions. In this post, we explore how to deploy Amazon DataZone using the AWS Cloud Development Kit (AWS CDK) to achieve seamless, scalable, and secure data governance.

Overview of solution

By using IaC with the AWS CDK, organizations can efficiently deploy and manage their data governance solutions. This approach provides scalability, security, and seamless integration across all teams, allowing for consistent and automated deployments.

The AWS CDK is a framework for defining cloud IaC and provisioning it through AWS CloudFormation. Developers can use any of the supported programming languages to define reusable cloud components known as constructs. A construct is a reusable and programmable component that represents AWS resources. The AWS CDK translates the high-level constructs defined by you into equivalent CloudFormation templates. AWS CloudFormation provisions the resources specified in the template, streamlining the usage of IaC on AWS.

Amazon DataZone core components are the building blocks to create a comprehensive end-to-end solution for data management and data governance. The following are the Amazon DataZone core components. For more details, see Amazon DataZone terminology and concepts.

  • Amazon DataZone domain – You can use an Amazon DataZone domain to organize your assets, users, and their projects. By associating additional AWS accounts with your Amazon DataZone domains, you can bring together your data sources.
  • Data portal – The data portal is outside the AWS Management Console. This is a browser-based web application where different users can catalog, discover, govern, share, and analyze data in a self-service fashion.
  • Business data catalog – You can use this component to catalog data across your organization with business context and enable everyone in your organization to find and understand data quickly.
  • Projects – In Amazon DataZone, projects are business use case-based groupings of people, assets (data), and tools used to simplify access to AWS analytics.
  • Environments – Within Amazon DataZone projects, environments are collections of zero or more configured resources on which a given set of AWS Identity and Access Management (IAM) principals (for example, users with a contributor permissions) can operate.
  • Amazon DataZone data source – In Amazon DataZone, you can publish an AWS Glue Data Catalog data source or Amazon Redshift data source.
  • Publish and subscribe workflows – You can use these automated workflows to secure data between producers and consumers in a self-service manner and make sure that everyone in your organization has access to the right data for the right purpose.

We use an AWS CDK app to demonstrate how to create and deploy core components of Amazon DataZone in an AWS account. The following diagram illustrates the primary core components that we create.

In addition to the core components deployed with the AWS CDK, we provide a custom resource module to create Amazon DataZone components such as glossaries, glossary terms, and metadata forms, which are not supported by AWS CDK constructs (at the time of writing).

Prerequisites

The following local machine prerequisites are required before starting:

Deploy the solution

Complete the following steps to deploy the solution:

  1. Clone the GitHub repository and go to the root of your downloaded repository folder:
    git clone https://github.com/aws-samples/amazon-datazone-cdk-example.git
    cd amazon-datazone-cdk-example
  2. Install local dependencies:
    $ npm ci ### this will install the packages configured in package-lock.json
  3. Sign in to your AWS account using the AWS CLI by configuring your credential file (replace <PROFILE_NAME> with the profile name of your deployment AWS account):
    $ export AWS_PROFILE=<PROFILE_NAME>
  4. Bootstrap the AWS CDK environment (this is a one-time activity and not needed if your AWS account is already bootstrapped):
    $ npm run cdk bootstrap
  5. Run the script to replace the placeholders for your AWS account and AWS Region in the config files:
    $ ./scripts/prepare.sh <<YOUR_AWS_ACCOUNT_ID>> <<YOUR_AWS_REGION>>

The preceding command will replace the AWS_ACCOUNT_ID_PLACEHOLDER and AWS_REGION_PLACEHOLDER values in the following config files:

  • lib/config/project_config.json
  • lib/config/project_environment_config.json
  • lib/constants.ts

Next, you configure your Amazon DataZone domain, project, business glossary, metadata forms, and environments with your data source.

  1. Go to the file lib/constants.ts. You can keep the DOMAIN_NAME provided or update it as needed.
  2. Go to the file lib/config/project_config.json. You can keep the example values for projectName and projectDescription or update them. An example value for projectMembers has also been provided (as shown in the following code snippet). Update the value of the memberIdentifier parameter with an IAM role ARN of your choice that you would like to be the owner of this project.
    "projectMembers": [
                {
                    "memberIdentifier": "arn:aws:iam::AWS_ACCOUNT_ID_PLACEHOLDER:role/Admin",
                    "memberIdentifierType": "UserIdentifier"
                }
            ]
  3. Go to the file lib/config/project_glossary_config.json. An example business glossary and glossary terms are provided for the projects; you can keep them as is or update them with your project name, business glossary, and glossary terms.
  4. Go to the lib/config/project_form_config.json file. You can keep the example metadata forms provided for the projects or update your project name and metadata forms.
  5. Go to the lib/config/project_enviornment_config.json file. Update EXISTING_GLUE_DB_NAME_PLACEHOLDER with the existing AWS Glue database name in the same AWS account where you are deploying the Amazon DataZone core components with the AWS CDK. Make sure you have at least one existing AWS Glue table in this AWS Glue database to publish as a data source within Amazon DataZone. Replace DATA_SOURCE_NAME_PLACEHOLDER and DATA_SOURCE_DESCRIPTION_PLACEHOLDER with your choice of Amazon DataZone data source name and description. An example of a cron schedule has been provided (see the following code snippet). This is the schedule for your data source run; you can keep the same or update it.
    "Schedule":{
       "schedule":"cron(0 7 * * ? *)"
    }

Next, you update the trust policy of the AWS CDK deployment IAM role to deploy a custom resource module.

  1. On the IAM console, update the trust policy of the IAM role for your AWS CDK deployment that starts with cdk-hnb659fds-cfn-exec-role- by adding the following permissions. Replace ${ACCOUNT_ID} and ${REGION} with your specific AWS account and Region.
         {
             "Effect": "Allow",
             "Principal": {
                 "Service": "lambda.amazonaws.com"
             },
             "Action": "sts:AssumeRole",
             "Condition": {
                 "ArnLike": {
                     "aws:SourceArn": [
                         
                         "arn:aws:lambda:${REGION}:{ACCOUNT_ID}:function:DataZonePreqStack-GlossaryLambda*",
                         "arn:aws:lambda:${REGION}:{ACCOUNT_ID}:function:DataZonePreqStack-GlossaryTermLambda*",
                         "arn:aws:lambda:${REGION}:{ACCOUNT_ID}:function:DataZonePreqStack-FormLambda*"
                     ]
                 }
             }
         }

Now you can configure data lake administrators in Lake Formation.

  1. On the Lake Formation console, choose Administrative roles and tasks in the navigation pane.
  2. Under Data lake administrators, choose Add and add the IAM role for AWS CDK deployment that starts with cdk-hnb659fds-cfn-exec-role- as an administrator.

This IAM role needs permissions in Lake Formation to create resources, such as an AWS Glue database. Without these permissions, the AWS CDK stack deployment will fail.

  1. Deploy the solution:
    $ npm run cdk deploy --all
  2. During deployment, enter y if you want to deploy the changes for some stacks when you see the prompt Do you wish to deploy these changes (y/n)?.
  3. After the deployment is complete, sign in to your AWS account and navigate to the AWS CloudFormation console to verify that the infrastructure deployed.

You should see a list of the deployed CloudFormation stacks, as shown in the following screenshot.

  1. Open the Amazon DataZone console in your AWS account and open your domain.
  2. Open the data portal URL available in the Summary section.
  3. Find your project in the data portal and run the data source job.

This is a one-time activity if you want to publish and search the data source immediately within Amazon DataZone. Otherwise, wait for the data source runs according to the cron schedule mentioned in the preceding steps.

Troubleshooting

If you get the message "Domain name already exists under this account, please use another one (Service: DataZone, Status Code: 409, Request ID: 2d054cb0-0 fb7-466f-ae04-c53ff3c57c9a)" (RequestToken: 85ab4aa7-9e22-c7e6-8f00-80b5871e4bf7, HandlerErrorCode: AlreadyExists), change the domain name under lib/constants.ts and try to deploy again.

If you get the message "Resource of type 'AWS::IAM::Role' with identifier 'CustomResourceProviderRole1' already exists." (RequestToken: 17a6384e-7b0f-03b3 -1161-198fb044464d, HandlerErrorCode: AlreadyExists), this means you’re accidentally trying to deploy everything in the same account but a different Region. Make sure to use the Region you configured in your initial deployment. For the sake of simplicity, the DataZonePreReqStack is in one Region in the same account.

If you get the message “Unmanaged asset” Warning in the data asset on your datazone project, you must explicitly provide Amazon DataZone with Lake Formation permissions to access tables in this external AWS Glue database. For instructions, refer to Configure Lake Formation permissions for Amazon DataZone.

Clean up

To avoid incurring future charges, delete the resources. If you have already shared the data source using Amazon DataZone, then you have to remove those manually first in the Amazon DataZone data portal because the AWS CDK isn’t able to automatically do that.

  1. Unpublish the data within the Amazon DataZone data portal.
  2. Delete the data asset from the Amazon DataZone data portal.
  3. From the root of your repository folder, run the following command:
    $ npm run cdk destroy --all
  4. Delete the Amazon DataZone created databases in AWS Glue. Refer to the tips to troubleshoot Lake Formation permission errors in AWS Glue if needed.
  5. Remove the created IAM roles from Lake Formation administrative roles and tasks.

Conclusion

Amazon DataZone offers a comprehensive solution for implementing a data mesh architecture, enabling organizations to address advanced data governance challenges effectively. Using the AWS CDK for IaC streamlines the deployment and management of Amazon DataZone resources, promoting consistency, reproducibility, and automation. This approach enhances data organization and sharing across your organization.

Ready to streamline your data governance? Dive deeper into Amazon DataZone by visiting the Amazon DataZone User Guide. To learn more about the AWS CDK, explore the AWS CDK Developer Guide.


About the Authors

Bandana Das is a Senior Data Architect at Amazon Web Services and specializes in data and analytics. She builds event-driven data architectures to support customers in data management and data-driven decision-making. She is also passionate about enabling customers on their data management journey to the cloud.

Gezim Musliaj is a Senior DevOps Consultant with AWS Professional Services. He is interested in various things CI/CD, data, and their application in the field of IoT, massive data ingestion, and recently MLOps and GenAI.

Sameer Ranjha is a Software Development Engineer on the Amazon DataZone team. He works in the domain of modern data architectures and software engineering, developing scalable and efficient solutions.

Sindi Cali is an Associate Consultant with AWS Professional Services. She supports customers in building data-driven applications in AWS.

Bhaskar Singh is a Software Development Engineer on the Amazon DataZone team. He has contributed to implementing AWS CloudFormation support for Amazon DataZone. He is passionate about distributed systems and dedicated to solving customers’ problems.