Guidance for Carbon Data Lake on AWS

This Guidance, with the sample code, can be used to deploy a carbon data lake to the AWS Cloud using an AWS Cloud Development Kit (AWS CDK). It provides customers and partners with the foundational infrastructure that can be extended to support use cases including monitoring, tracking, reporting, and impact verification of greenhouse gas emissions. The carbon data lake Guidance sample code deploys a data lake and processing pipeline that assists with data ingestion, aggregation, automated processing, and CO2 equivalent calculation based on ingested greenhouse gas emissions data.

Please note: This solution by itself will not make a customer compliant with any end-to-end carbon accounting solution. It provides the foundational infrastructure from which additional complementary solutions can be integrated.

Architecture Diagram

[Architecture diagram description]

Download the architecture diagram PDF

Guidance Architecture Diagram for Carbon Data Lake on AWS

Step 1
Customer emissions data from various sources is mapped to a standard CSV upload template. The CSV is uploaded, either directly to the Amazon Simple Storage Service (Amazon S3) landing bucket, or through the user interface.

Step 2
Amazon S3 landing bucket provides a single landing zone for all ingested emissions data. Data ingress to the landing zone bucket triggers the data pipeline.

Step 3
AWS Step Functions workflow orchestrates the data pipeline including data quality check, data compaction, transformation, standardization, and enrichment with an emissions calculator AWS Lambda function.

Step 4
AWS Glue DataBrew provides data quality auditing and an alerting workflow, and Lambda functions provide integration with Amazon Simple Notification Service (Amazon SNS) and AWS Amplify web application.

Step 5
Lambda functions provide data lineage processing, queued by Amazon Simple Queue Service (Amazon SQS). Amazon DynamoDB provides NoSQL pointer storage for the data ledger, and a Lambda function provides data lineage audit functionality, tracing all data transformations for a given record.

Step 6
Lambda function outputs scopes 1, 2 and 3 emissions data using a DynamoDB Green House Gas (GHG) Protocol Emissions factor lookup table.

Step 7
Amazon S3 enriched bucket provides data object storage for analytics workloads and the DynamoDB calculated emissions table provides storage for GraphQL API (a query language for users API).

Step 8
Customers can deploy a prebuilt Amazon SageMaker notebook and a prebuilt Amazon QuickSight dashboard with artificial intelligence and machine learning stacks, and business intelligence stacks. Deployments come with prebuilt Amazon Athena queries to query data stored in Amazon S3. Each service includes Amazon S3 enriched object storage.

Step 9
Customers can deploy a Web Application stack that uses AWS AppSync for a GraphQL API backend to integrate with web applications and other data consumer applications. Amplify provides a serverless, pre-configured management application that includes basic data browsing, data visualization, data uploader, and application configuration.

Well-Architected Pillars

The AWS Well-Architected Framework helps you understand the pros and cons of the decisions you make when building systems in the cloud. The six pillars of the Framework allow you to learn architectural best practices for designing and operating reliable, secure, efficient, cost-effective, and sustainable systems. Using the AWS Well-Architected Tool, available at no charge in the AWS Management Console, you can review your workloads against these best practices by answering a set of questions for each pillar.

The architecture diagram above is an example of a Solution created with Well-Architected best practices in mind. To be fully Well-Architected, you should follow as many Well-Architected best practices as possible.

Operational Excellence

If any changes are required for the Guidance, they can be implemented and deployed using GitHub Issues. All new deployments are tested through unit, security, infrastructure, and deployment testing.

Feedback can be visualized through the Step Functions workflow visuals. If data is not processed through the pipeline, the Step Functions workflow will visually depict at which point in the process the data processing failed. There are also Amazon SNS notifications that are sent when a pipeline fails for any reason. By using the Step Functions and Amazon SNS notifications, users can isolate the tech stack that caused a problem and evaluate the data submitted to the pipeline about the tech stack identified.

Read the Operational Excellence whitepaper
Security

This Guidance applies a zero-trust model for authentication and authorization. All users to the web application are authenticated using Amazon Cognito user pools. All additional resources are granted least-privilege access and all access patterns are evaluated using the cdk-nag utility to check AWS Cloud Development Kit (AWS CDK) applications.

All data is encrypted at rest and in transit using AWS Key Management Service (AWS KMS), Amazon S3, Lambda, AWS Glue DataBrew, and DynamoDB.

Read the Security whitepaper
Reliability

The services in this Guidance are highly available by default through AWS Managed Services (AMS). By enabling the provided sample code, all Amazon S3 bucket access is logged by default. Managed services such as Lambda and Step Functions emit Amazon CloudWatch metrics, and appropriate alarms can be configured to notify users about threshold breaches.

All deployment and configuration changes are managed using AWS CDK, reducing the possibility of human error.

Read the Reliability whitepaper
Performance Efficiency

The README file contains specific directions to extend, modify, or add to the Guidance. AWS customers or partners can extend the Guidance by adding additional ingestion APIs such as: building custom emissions factor libraries, doing custom calculations, and creating custom visualizations, forecasting, or AI/ML tools.

To decrease latency and improve performance, this Guidance is designed for deployment in any major AWS Region using AWS CDK regional context.

Read the Performance Efficiency whitepaper
Cost Optimization

The services in this guidance are managed by AWS and are serverless. They were selected to meet the demand with only the minimum resources required. We evaluated and tested with simulated synthetic data sources, selecting services that optimize performance while reducing cost and carbon footprint.

Read the Cost Optimization whitepaper
Sustainability

By using an on-demand serverless architecture and Step Functions, this Guidance can continually scale to match the load with only the minimum resources. All data processing is compressed and each layer of the architecture deploys to a single Region by default.

Read the Sustainability whitepaper

Implementation Resources

The sample code is a starting point. It is industry validated, prescriptive but not definitive, and a peek under the hood to help you begin.

Open sample code on GitHub

Architecture Diagram

Well-Architected Pillars

Implementation Resources

Related Content

[Title]

Disclaimer

Was this page helpful?

Guidance for Carbon Data Lake on AWS

Architecture Diagram

Well-Architected Pillars

Implementation Resources

Related Content

[Title]

Disclaimer

Was this page helpful?

Ending Support for Internet Explorer