[SEO Subhead]
This Guidance shows how you can build a data mesh architecture on AWS to implement a decentralized, domain-driven approach to data management. It gives you the ownership and agility to deliver valuable data products, fostering better decision-making, personalized experiences, and operational efficiencies. The Guidance addresses how various AWS services, users, and key resources can be used for advanced data security challenges through distributed, decentralized ownership in a typical data mesh design. With this Guidance, disparate data sources are effectively united and linked through centrally managed data sharing and governance guidelines. This allows you to maintain control over how shared data is accessed, who accesses it, and the format in which it is accessed.
Please note: [Disclaimer]
Architecture Diagram
-
Overview
-
Architecture and core AWS services
-
Overview
-
This architecture diagram illustrates an overview of a data mesh design that allows for distributed data ownership and control while providing centralized data sharing and governance to address security challenges. The subsequent diagram highlights the essential AWS services used in implementing this design pattern.
Step 1
Multiple data producer accounts exist across different business domains and teams.Step 2
Data producers collect and transform data to generate shareable data assets, which mainly consist of a technical metadata catalog, databases, and scalable storage. It is the responsibility of the data producers to curate and keep the data assets current.Step 3
One central governance account acts as a bridge between data producers and data consumers. It does not save the actual data.Step 4
Data stewards maintain and enrich the enterprise data catalog across accounts with business metadata. Data admins create the necessary permissions for data producers to register data assets and data consumers to access data.Step 5
The central governance accounts maintain the enterprise data catalog and enrich the business catalog with corresponding access policies and encryption keys.Step 6
Central governance accounts save all the logs, including access logs and data shared object logs, and support audit reports.Step 7
Multiple data consumer accounts exist across different business domains and teams.Step 8
Data consumer accounts search the enterprise data catalog, request access to data assets, and bring their own compute resources to analyze the data once access is granted. -
Architecture and core AWS services
-
This architecture diagram shows the pivotal AWS services that allow the various components of this Guidance to function seamlessly within the data mesh architecture on AWS.
Step 1
Data producer users or roles authenticate through AWS Identity and Access Management (IAM) and/or single sign-on (SSO) providers like Okta and Azure Active Directory (Azure AD) integrated through AWS IAM Identity Center. Appropriate policies are attached to allow them to publish data assets.Step 2
Data assets that are ready to share are saved in scalable data storages like Amazon Simple Storage Service (Amazon S3) and Amazon RedShift.Step 4
Amazon DataZone and AWS Lake Formation use the data catalog from AWS Glue and Amazon Redshift to generate shareable technical metadata.Step 5
Data stewards and data admins authenticate users and roles through IAM and/or SSO providers, which are integrated through the IAM Identity Center. Appropriate policies are attached to allow them to manage access.Step 6
AWS Key Management Service (AWS KMS) encrypts the data at rest and in transit. AWS Secrets Manager holds secrets like database credentials.Step 7
Lake Formation grants consumer users and roles access to producer data stored in Amazon Redshift. The Amazon DataZone domain enriches metadata stored in the Data Catalog by adding business metadata.Step 8
All of the access logs are available in Lake Formation, Amazon CloudWatch, and AWS CloudTrail, which users can utilize for monitoring and auditing.Step 9
IAM and/or SSO systems are integrated through IAM Identity Center to authenticate data consumer users and roles.Step 10
Consumers further granularize the access permissions using access permissions based on Lake Formation. Additionally, they use the Amazon DataZone domain to search for data assets based on metadata.Step 11
Consumers bring their own compute services. For instance, data scientists use Amazon SageMaker for machine learning (ML) transformation and Amazon Bedrock for generative artificial intelligence (AI) applications.Data engineers use AWS Glue and Amazon EMR for data transformation. Data analysts use Amazon Athena for analysis, and business intelligence analysts use Amazon QuickSight for data visualization.
Well-Architected Pillars
The AWS Well-Architected Framework helps you understand the pros and cons of the decisions you make when building systems in the cloud. The six pillars of the Framework allow you to learn architectural best practices for designing and operating reliable, secure, efficient, cost-effective, and sustainable systems. Using the AWS Well-Architected Tool, available at no charge in the AWS Management Console, you can review your workloads against these best practices by answering a set of questions for each pillar.
The architecture diagram above is an example of a Solution created with Well-Architected best practices in mind. To be fully Well-Architected, you should follow as many Well-Architected best practices as possible.
-
Operational Excellence
CloudWatch provides comprehensive visibility into your resources and services, enabling proactive monitoring, quick troubleshooting, and prompt incident response. CloudTrail allows you to audit your AWS account, supporting governance and compliance through detailed activity logs. Use these services to maintain the operational excellence of your architecture and respond effectively to events and incidents.
-
Security
Prioritize the security of your data and resources with IAM and AWS KMS. IAM allows you to centrally manage fine-grained permissions, specifying who or what can access your AWS services and resources. AWS KMS, on the other hand, allows you to define encryption keys for data encryption at rest and in transit, preserving the confidentiality and integrity of your sensitive information.
-
Reliability
Safeguard the reliability of your data and applications with Amazon S3 and Data Catalog. Amazon S3 is designed to provide high durability and availability, automatically replicating your data across multiple Availability Zones. The Data Catalog serves as a centralized metadata repository, helping you maintain a consistent and reliable view of your data sources across different data stores.
-
Performance Efficiency
Optimize the performance of your data processing and analytics with Amazon Redshift and Athena. Amazon Redshift is a fully managed, massively parallel processing (MPP) data warehouse service that helps you make fast and cost-effective business decisions. Athena, a serverless interactive query service, allows you to analyze data directly in Amazon S3 using standard SQL without the need to manage any infrastructure.
-
Cost Optimization
As a fully managed, serverless service, Amazon S3 eliminates the need to provision and manage infrastructure, reducing the associated costs. Use the various storage classes offered by Amazon S3, including the Amazon S3 Intelligent-Tiering storage class, S3 Standard, S3 Standard-IA, and S3 Glacier, to match your data storage and access requirements with the most cost-effective options.
-
Sustainability
Amazon DataZone helps reduce data redundancy, enforces data governance policies, and facilitates secure data sharing, leading to optimized storage usage and a reduced environmental impact. By centralizing your data and enabling collaborative data sharing, you can minimize the need for data duplication across your organization, contributing to a more sustainable data environment.
Implementation Resources
A detailed guide is provided to experiment and use within your AWS account. Each stage of building the Guidance, including deployment, usage, and cleanup, is examined to prepare it for deployment.
The sample code is a starting point. It is industry validated, prescriptive but not definitive, and a peek under the hood to help you begin.
Related Content
[Title]
Disclaimer
The sample code; software libraries; command line tools; proofs of concept; templates; or other related technology (including any of the foregoing that are provided by our personnel) is provided to you as AWS Content under the AWS Customer Agreement, or the relevant written agreement between you and AWS (whichever applies). You should not use this AWS Content in your production accounts, or on production or other critical data. You are responsible for testing, securing, and optimizing the AWS Content, such as sample code, as appropriate for production grade use based on your specific quality control practices and standards. Deploying AWS Content may incur AWS charges for creating or using AWS chargeable resources, such as running Amazon EC2 instances or using Amazon S3 storage.
References to third-party services or organizations in this Guidance do not imply an endorsement, sponsorship, or affiliation between Amazon or AWS and the third party. Guidance from AWS is a technical starting point, and you can customize your integration with third-party services when you deploy the architecture.