Guidance for Sustainability Data Management on AWS

This Guidance demonstrates how you can manage and share data to help drive your organization's sustainability initiatives. With a growing number of data sources for tracking the environmental impact of your organization, it becomes challenging to discover, assess validity, and extract values from these assets across multiple teams. This Guidance provides a streamlined framework for enterprise data management. It takes into consideration of data quality, security, cataloging, and lineage—allowing you to seamlessly share applicable datasets. With more reliable data, organizations can solve use cases such as more accurately calculating their estimated carbon emissions, assessing climate risk or understanding the biodiversity impact of the organization. With centralized access to key data assets, you can make informed decisions to achieve your environmental goals more efficiently with proper data governance.

Please note: [Disclaimer]

Architecture Diagram

Download the architecture diagram PDF

Overview
User access
Data discovery
Automated data asset registration

Overview
This architecture diagram illustrates how applications can consume and produce data assets, incorporating key data management concepts to quickly discover, share, and extract value from data across your organization. The subsequent tabs cover user access, data discovery, and automated data asset registration workflows tailored for sustainability use cases.

Step 1
Data is stored in various types of data stores, within and/or outside of AWS. These data stores contain data assets that represent a physical data object (such as a database table or a file). These data stores house both source and target datasets in the data fabric.

Step 2
Technical metadata is automatically imported into the data catalog for data assets that existed before the implementation of the data fabric.

Step 3
The data owners maintain business metadata for their data assets in the data catalog to enrich the data with business context. For example, business context for dataset columns, tags, domain- or enterprise-wide business glossary terms.

Step 4
The data consumers search the data catalog for data assets using technical and/or business metadata. The metadata pertaining to data quality and data lineage establishes trust in how data assets can be used.

Step 5
The data consumers request access to the relevant data assets from the data owner, who can either grant or deny the request.

Step 6
The data products perform export, transform, and load (ETL), data profiling, and data quality operations to create new curated data assets to enable data-driven use cases for the data consumers.

Step 7
Data assets created by the data products are registered in the data catalog with the corresponding metadata.

Click to enlarge

Step 1
Data is stored in various types of data stores, within and/or outside of AWS. These data stores contain data assets that represent a physical data object (such as a database table or a file). These data stores house both source and target datasets in the data fabric.

Step 2
Technical metadata is automatically imported into the data catalog for data assets that existed before the implementation of the data fabric.

Step 3
The data owners maintain business metadata for their data assets in the data catalog to enrich the data with business context. For example, business context for dataset columns, tags, domain- or enterprise-wide business glossary terms.

Step 4
The data consumers search the data catalog for data assets using technical and/or business metadata. The metadata pertaining to data quality and data lineage establishes trust in how data assets can be used.

Step 5
The data consumers request access to the relevant data assets from the data owner, who can either grant or deny the request.

Step 6
The data products perform export, transform, and load (ETL), data profiling, and data quality operations to create new curated data assets to enable data-driven use cases for the data consumers.

Step 7
Data assets created by the data products are registered in the data catalog with the corresponding metadata.
User access
This architecture diagram shows how to manage user access to the data catalog.

Step 1
AWS IAM Identity Center manages all users for both Amazon DataZone and the other APIs.

Step 2
Amazon API Gateway uses an Amazon Cognito authorizer. The corresponding user pool uses IAM Identity Center as its identity provider.

Step 3
Amazon DataZone integrates directly with IAM Identity Center for user management.

Click to enlarge

Step 1
AWS IAM Identity Center manages all users for both Amazon DataZone and the other APIs.

Step 2
Amazon API Gateway uses an Amazon Cognito authorizer. The corresponding user pool uses IAM Identity Center as its identity provider.

Step 3
Amazon DataZone integrates directly with IAM Identity Center for user management.
Data discovery
This architecture diagram shows how to search, discover, and request access to data assets in the data catalog.

Step 1
Users explore the data catalog through the search functionality in Amazon DataZone. Assets can be searched for by their associated metadata.

Step 2
Data lineage for each asset is stored in an instance of OpenLineage Marquez. Marquez is deployed on an Amazon Elastic Compute Cloud (Amazon ECS) container fronted by an Application Load balancer. Users can view the data lineage of assets through Marquez.

Step 3
From the data catalog, the data consumer requests read-only access to a desired dataset from the data asset owner.

Step 4
Asset owners approve or deny subscription requests to individual assets that they have published to the catalog.

Step 5
Once an asset owner approves a user’s subscription request, the user can access the asset through Amazon Athena, for assets registered as Amazon Glue tables, or through the Amazon Redshift Data API for Amazon Redshift tables.

Click to enlarge

Step 1
Users explore the data catalog through the search functionality in Amazon DataZone. Assets can be searched for by their associated metadata.

Step 2
Data lineage for each asset is stored in an instance of OpenLineage Marquez. Marquez is deployed on an Amazon Elastic Compute Cloud (Amazon ECS) container fronted by an Application Load balancer. Users can view the data lineage of assets through Marquez.

Step 3
From the data catalog, the data consumer requests read-only access to a desired dataset from the data asset owner.

Step 4
Asset owners approve or deny subscription requests to individual assets that they have published to the catalog.

Step 5
Once an asset owner approves a user’s subscription request, the user can access the asset through Amazon Athena, for assets registered as Amazon Glue tables, or through the Amazon Redshift Data API for Amazon Redshift tables.
Automated data asset registration
This architecture diagram shows how to manage data asset registration with profiling, transformation, quality assertion, and lineage tracking.

Step 1
Data is placed into Amazon Simple Storage Service (Amazon S3) or Amazon Redshift.

Step 2
A data owner or data product invokes an API Gateway API backed by AWS Lambda in the Hub account.

The API body includes the information on the data location, transformation logic, profiling specifications, and data quality assertions required in future steps. The API writes an event to an Amazon EventBridge event bus which replicates it to an event bus in the spoke account.

Step 3
The event in the spoke account invokes an AWS Step Functions workflow. The workflow creates an AWS Glue connection to the Amazon Redshift or Amazon S3 data source.

Step 4
AWS Glue DataBrew performs data transformations through a recipe job.

Step 5
An AWS Glue crawler infers the schema of the resulting dataset and creates a Glue table.

Step 6
An AWS Glue DataBrew profile job derives profile statistics against the table.

Step 7
AWS Glue evaluates the data quality with user-defined assertions.

Step 8
The resulting data lineage is summarized in the event and sent back to the hub account through EventBridge.

Step 9
The EventBridge event bus in the hub account invokes another Step Functions workflow.

Step 10
The new asset is imported into Amazon DataZone by creating and running a data source.

Step 11
The lineage for the asset is published to EventBridge, which invokes an Amazon ECS deployment to register the lineage in a deployment of OpenLineage Marquez.

Click to enlarge

Step 1
Data is placed into Amazon Simple Storage Service (Amazon S3) or Amazon Redshift.

Step 2
A data owner or data product invokes an API Gateway API backed by AWS Lambda in the Hub account.

The API body includes the information on the data location, transformation logic, profiling specifications, and data quality assertions required in future steps. The API writes an event to an Amazon EventBridge event bus which replicates it to an event bus in the spoke account.

Step 3
The event in the spoke account invokes an AWS Step Functions workflow. The workflow creates an AWS Glue connection to the Amazon Redshift or Amazon S3 data source.

Step 4
AWS Glue DataBrew performs data transformations through a recipe job.

Step 5
An AWS Glue crawler infers the schema of the resulting dataset and creates a Glue table.

Step 6
An AWS Glue DataBrew profile job derives profile statistics against the table.

Step 7
AWS Glue evaluates the data quality with user-defined assertions.

Step 8
The resulting data lineage is summarized in the event and sent back to the hub account through EventBridge.

Step 9
The EventBridge event bus in the hub account invokes another Step Functions workflow.

Step 10
The new asset is imported into Amazon DataZone by creating and running a data source.

Step 11
The lineage for the asset is published to EventBridge, which invokes an Amazon ECS deployment to register the lineage in a deployment of OpenLineage Marquez.

Get Started

Try out this Guidance

Demo

Explore an interactive demo for a sneak peak at how this Guidance functions

Deploy this Guidance

Sample Code: Data Management Core

The Data Management Core sample code provides a complete guide for deploying and using the data management capabilities

Sample Code: Sustainability Data Management

Sustainability Data Management is an extension to the Data Management Core sample code, providing a sustainability lens

Well-Architected Pillars

The AWS Well-Architected Framework helps you understand the pros and cons of the decisions you make when building systems in the cloud. The six pillars of the Framework allow you to learn architectural best practices for designing and operating reliable, secure, efficient, cost-effective, and sustainable systems. Using the AWS Well-Architected Tool, available at no charge in the AWS Management Console, you can review your workloads against these best practices by answering a set of questions for each pillar.

The architecture diagram above is an example of a Solution created with Well-Architected best practices in mind. To be fully Well-Architected, you should follow as many Well-Architected best practices as possible.

Operational Excellence

Amazon CloudWatch provides centralized monitoring and observability, which tracks operational metrics and logs across services. This integrated visibility into your workload health and performance helps you identify issues and troubleshoot problems, allowing you to continuously improve processes and procedures for efficient operations.

Read the Operational Excellence whitepaper
Security

Cognito, AWS Identity and Access Management (IAM), and IAM Identity Center help you implement secure authentication and authorization mechanisms. Cognito provides user authentication and authorization for the application APIs, while IAM policies and roles control access to resources based on the principle of least privilege. IAM Identity Center simplifies managing user identities across the components of this Guidance, enabling centralized identity management.

Read the Security whitepaper
Reliability

An Application Load Balancer, Lambda, EventBridge, and Amazon S3 work in tandem so that your workloads perform their intended functions correctly and consistently. For example, the Application Load Balancer distributes traffic to the application containers, providing high availability. EventBridge replicates events across accounts for reliable event delivery, while the automatic scaling of Lambda handles varying workloads without disruption. And as the root data source, Amazon S3 provides highly durable and available storage.

Read the Reliability whitepaper
Performance Efficiency

The services selected for this Guidance are optimal services to help you both the monitor performance and maintain efficient workloads. Specifically, Athena and the Amazon Redshift Data API provide efficient querying of data assets. AWS Glue DataBrew and crawlers automate data transformation and cataloging, improving overall efficiency. Amazon Redshift Serverless scales compute resources elastically, allowing high-performance data processing without over-provisioning resources. Lastly, Amazon S3 offers high data throughput for efficient querying.

Read the Performance Efficiency whitepaper
Cost Optimization

To optimize costs, this Guidance uses serverless services that automatically scale based on demand, ensuring that you only pay for the resources you use. For example, EventBridge eliminates the need for polling-based architectures, reducing compute costs, and Amazon Redshift Serverless automatically scales compute based on demand, charging only for resources consumed during processing.

Read the Cost Optimization whitepaper
Sustainability

The serverless services of this Guidance work together to reduce the need for always-on infrastructure, lowering the overall environmental impact of the workload. For example, Amazon Redshift Serverless automatically scales to the required demand, provisioning only the necessary compute resources and minimizing idle resources and their associated energy usage.

Read the Sustainability whitepaper

[SEO Subhead]

Architecture Diagram

Get Started

Try out this Guidance

Demo

Deploy this Guidance

Sample Code: Data Management Core

Sample Code: Sustainability Data Management

Well-Architected Pillars

Related Content

Streamline your ESG Reporting with AWS Sustainability Data Fabric and Accenture

Disclaimer

Was this page helpful?

Guidance for Sustainability Data Management on AWS

[SEO Subhead]

Architecture Diagram

Get Started

Try out this Guidance

Demo

Deploy this Guidance

Sample Code: Data Management Core

Sample Code: Sustainability Data Management

Well-Architected Pillars

Related Content

Streamline your ESG Reporting with AWS Sustainability Data Fabric and Accenture

Disclaimer

Was this page helpful?

Ending Support for Internet Explorer