[SEO Subhead]
This Guidance demonstrates how you can manage and share data to help drive your organization's sustainability initiatives. With a growing number of data sources for tracking the environmental impact of your organization, it becomes challenging to discover, assess validity, and extract values from these assets across multiple teams. This Guidance provides a streamlined framework for enterprise data management. It takes into consideration of data quality, security, cataloging, and lineage—allowing you to seamlessly share applicable datasets. With more reliable data, organizations can solve use cases such as more accurately calculating their estimated carbon emissions, assessing climate risk or understanding the biodiversity impact of the organization. With centralized access to key data assets, you can make informed decisions to achieve your environmental goals more efficiently with proper data governance.
Please note: [Disclaimer]
Architecture Diagram
-
Overview
-
User access
-
Data discovery
-
Automated data asset registration
-
Overview
-
This architecture diagram illustrates how applications can consume and produce data assets, incorporating key data management concepts to quickly discover, share, and extract value from data across your organization. The subsequent tabs cover user access, data discovery, and automated data asset registration workflows tailored for sustainability use cases.
Step 1
Data is stored in various types of data stores, within and/or outside of AWS. These data stores contain data assets that represent a physical data object (such as a database table or a file). These data stores house both source and target datasets in the data fabric.Step 2
Technical metadata is automatically imported into the data catalog for data assets that existed before the implementation of the data fabric.Step 3
The data owners maintain business metadata for their data assets in the data catalog to enrich the data with business context. For example, business context for dataset columns, tags, domain- or enterprise-wide business glossary terms.Step 4
The data consumers search the data catalog for data assets using technical and/or business metadata. The metadata pertaining to data quality and data lineage establishes trust in how data assets can be used.Step 5
The data consumers request access to the relevant data assets from the data owner, who can either grant or deny the request.Step 6
The data products perform export, transform, and load (ETL), data profiling, and data quality operations to create new curated data assets to enable data-driven use cases for the data consumers.Step 7
Data assets created by the data products are registered in the data catalog with the corresponding metadata. -
User access
-
This architecture diagram shows how to manage user access to the data catalog.
Step 1
AWS IAM Identity Center manages all users for both Amazon DataZone and the other APIs.Step 2
Amazon API Gateway uses an Amazon Cognito authorizer. The corresponding user pool uses IAM Identity Center as its identity provider.Step 3
Amazon DataZone integrates directly with IAM Identity Center for user management. -
Data discovery
-
This architecture diagram shows how to search, discover, and request access to data assets in the data catalog.
Step 1
Users explore the data catalog through the search functionality in Amazon DataZone. Assets can be searched for by their associated metadata.Step 2
Data lineage for each asset is stored in an instance of OpenLineage Marquez. Marquez is deployed on an Amazon Elastic Compute Cloud (Amazon ECS) container fronted by an Application Load balancer. Users can view the data lineage of assets through Marquez.Step 3
From the data catalog, the data consumer requests read-only access to a desired dataset from the data asset owner.Step 4
Asset owners approve or deny subscription requests to individual assets that they have published to the catalog.Step 5
Once an asset owner approves a user’s subscription request, the user can access the asset through Amazon Athena, for assets registered as Amazon Glue tables, or through the Amazon Redshift Data API for Amazon Redshift tables. -
Automated data asset registration
-
This architecture diagram shows how to manage data asset registration with profiling, transformation, quality assertion, and lineage tracking.
Step 1
Data is placed into Amazon Simple Storage Service (Amazon S3) or Amazon Redshift.
Step 2
A data owner or data product invokes an API Gateway API backed by AWS Lambda in the Hub account.The API body includes the information on the data location, transformation logic, profiling specifications, and data quality assertions required in future steps. The API writes an event to an Amazon EventBridge event bus which replicates it to an event bus in the spoke account.
Step 3
The event in the spoke account invokes an AWS Step Functions workflow. The workflow creates an AWS Glue connection to the Amazon Redshift or Amazon S3 data source.
Step 4
AWS Glue DataBrew performs data transformations through a recipe job.Step 5
An AWS Glue crawler infers the schema of the resulting dataset and creates a Glue table.Step 6
An AWS Glue DataBrew profile job derives profile statistics against the table.Step 7
AWS Glue evaluates the data quality with user-defined assertions.Step 8
The resulting data lineage is summarized in the event and sent back to the hub account through EventBridge.Step 9
The EventBridge event bus in the hub account invokes another Step Functions workflow.Step 10
The new asset is imported into Amazon DataZone by creating and running a data source.Step 11
The lineage for the asset is published to EventBridge, which invokes an Amazon ECS deployment to register the lineage in a deployment of OpenLineage Marquez.
Get Started
Try out this Guidance
Deploy this Guidance
Well-Architected Pillars
The AWS Well-Architected Framework helps you understand the pros and cons of the decisions you make when building systems in the cloud. The six pillars of the Framework allow you to learn architectural best practices for designing and operating reliable, secure, efficient, cost-effective, and sustainable systems. Using the AWS Well-Architected Tool, available at no charge in the AWS Management Console, you can review your workloads against these best practices by answering a set of questions for each pillar.
The architecture diagram above is an example of a Solution created with Well-Architected best practices in mind. To be fully Well-Architected, you should follow as many Well-Architected best practices as possible.
-
Operational Excellence
Amazon CloudWatch provides centralized monitoring and observability, which tracks operational metrics and logs across services. This integrated visibility into your workload health and performance helps you identify issues and troubleshoot problems, allowing you to continuously improve processes and procedures for efficient operations.
-
Security
Cognito, AWS Identity and Access Management (IAM), and IAM Identity Center help you implement secure authentication and authorization mechanisms. Cognito provides user authentication and authorization for the application APIs, while IAM policies and roles control access to resources based on the principle of least privilege. IAM Identity Center simplifies managing user identities across the components of this Guidance, enabling centralized identity management.
-
Reliability
An Application Load Balancer, Lambda, EventBridge, and Amazon S3 work in tandem so that your workloads perform their intended functions correctly and consistently. For example, the Application Load Balancer distributes traffic to the application containers, providing high availability. EventBridge replicates events across accounts for reliable event delivery, while the automatic scaling of Lambda handles varying workloads without disruption. And as the root data source, Amazon S3 provides highly durable and available storage.
-
Performance Efficiency
The services selected for this Guidance are optimal services to help you both the monitor performance and maintain efficient workloads. Specifically, Athena and the Amazon Redshift Data API provide efficient querying of data assets. AWS Glue DataBrew and crawlers automate data transformation and cataloging, improving overall efficiency. Amazon Redshift Serverless scales compute resources elastically, allowing high-performance data processing without over-provisioning resources. Lastly, Amazon S3 offers high data throughput for efficient querying.
-
Cost Optimization
To optimize costs, this Guidance uses serverless services that automatically scale based on demand, ensuring that you only pay for the resources you use. For example, EventBridge eliminates the need for polling-based architectures, reducing compute costs, and Amazon Redshift Serverless automatically scales compute based on demand, charging only for resources consumed during processing.
-
Sustainability
The serverless services of this Guidance work together to reduce the need for always-on infrastructure, lowering the overall environmental impact of the workload. For example, Amazon Redshift Serverless automatically scales to the required demand, provisioning only the necessary compute resources and minimizing idle resources and their associated energy usage.
Related Content
Streamline your ESG Reporting with AWS Sustainability Data Fabric and Accenture
Disclaimer
The sample code; software libraries; command line tools; proofs of concept; templates; or other related technology (including any of the foregoing that are provided by our personnel) is provided to you as AWS Content under the AWS Customer Agreement, or the relevant written agreement between you and AWS (whichever applies). You should not use this AWS Content in your production accounts, or on production or other critical data. You are responsible for testing, securing, and optimizing the AWS Content, such as sample code, as appropriate for production grade use based on your specific quality control practices and standards. Deploying AWS Content may incur AWS charges for creating or using AWS chargeable resources, such as running Amazon EC2 instances or using Amazon S3 storage.
References to third-party services or organizations in this Guidance do not imply an endorsement, sponsorship, or affiliation between Amazon or AWS and the third party. Guidance from AWS is a technical starting point, and you can customize your integration with third-party services when you deploy the architecture.