This Guidance demonstrates how you can optimize a data architecture for sustainability on AWS that helps to maximize efficiency and reduce waste. Included are curated data services and best practices that help you identify the right solution for your workloads, so you can build a more efficient, end-to-end modern data architecture in the cloud. With a comprehensive set of data and analytics capabilities, this Guidance helps you design a data strategy that grows with your business.
Please note: [Disclaimer]
Architecture Diagram
Overview
These steps provide an overview of this architecture. For diagrams highlighting different aspects of this architecture, open the accordion dropdown options.
Step 1
Organizations ingest data from streaming sources like sensors, devices, social media, or web applications, and in batches from database and file systems.
Step 2
Streaming event data from data streams is stored for longer retention. Data from databases and file systems are stored in a colder storage layer for transformation and consumption.
Step 3
A stream analytics system analyzes, filters, and transforms incoming data streams in real-time. The batch data processing layer transforms raw data by cleaning, combining, and aggregating the data for analytical purposes.
Step 4
Streaming data is sent to downstream systems for consumers to query and visualize in real-time. Batch data is modeled and served for consumption for business intelligence.
Step 5
Data consumers use query and visualization tools to analyze the data.
-
Data Ingestion
This diagram shows a real-time and batch data ingestion pattern, and a database replication pattern with recommended AWS services that serve these capabilities.
-
Steps
-
Additional considerations
-
Steps
-
Follow the steps in this diagram to deploy this Guidance.
Step 1
Use managed services such as Amazon Kinesis Data Streams for streaming data, such as clickstream data from web applications. Managed services distribute the environmental impact of the service across many users because of the multi-tenant control planes.Step 2
Use managed services such as AWS IoT Core for consumer Internet of Things (IoT). It ensures that resources are used efficiently and distributes the environmental impact of the services across the user pool.Step 3
Use managed services such as AWS IoT SiteWise for industrial IoT streaming data. It ensures resources are used optimally and distributes the environmental impact of the services across the user pool.
Step 4
Choose the right-sized AWS Database Migration Service (AWS DMS) instance type.
Step 5
To reduce the waste of resources and to maximize the utilization through automation, use AWS Transfer Family managed workflows.
It automates various processing steps such as copying, tagging, scanning, filtering, compressing or decompressing, and encrypting or decrypting the data that is transferred using AWS Transfer Family.
Step 6
Use managed services like AWS Glue for better resource utilization and leverage AWS Glue Flex jobs for non-urgent ingestions.
-
Additional considerations
-
Consider the following key components when deploying this Guidance.
Consideration A
Evaluate the service-level agreement (SLA) for data delivery and assess whether a continuous stream is necessary.Consideration B
Evaluate whether a workload is memory-intensive or CPU-intensive, and size resources accordingly.Consideration C
Ingest valuable data from the source systems by filtering out data items that are of no value to consumers and remove redundant ingestion pipelines. Adopt an event-driven serverless architecture for your data ingestion, so it only provisions resources when work needs to be done.
Consideration D
Review scheduled ingestions and remove unnecessary operations or reduce the operation’s frequency where possible.
Spread batch ingestion schedules over the day rather than running all at once overnight. This helps with reducing the provisioning of resources for peak usage when most batch jobs are scheduled overnight.
-
-
Data Storage
This diagram shows the storage layer with frequently accessed data stores for operational use, and two popular storage patterns for analytics use – the data lake and the data warehouse.
-
Steps
-
Additional considerations
-
Steps
-
Follow the steps in this diagram to deploy this Guidance.
Step 1
Use the managed database of AWS IoT SiteWise for storing data from industrial equipment at scale. Set a retention period for how long your data is stored in the hot tier before it's deleted. Move historical data to colder tiers in Amazon Simple Storage Service (Amazon S3).Step 2
Use a fully managed, purpose-built time-series database to store time-series data. Amazon Timestream saves time and cost in managing the lifecycle of time-series data by keeping recent data in memory and moving historical data to a cost optimized storage tier based upon user-defined policies.Step 3
Consider using serverless databases for unpredictable or irregular workloads. For example, Amazon DynamoDB for non-relational data.
Step 4
Consider using serverless databases for unpredictable or irregular workloads. For example, Amazon Aurora for relational data.
Step 5
Consider using object storage such as Amazon S3 to store large volumes of data. Move infrequently accessed data to colder tiers for energy efficiency.
Use the Amazon S3 Intelligent Tiering storage class to optimize storage by automatically moving data to the most cost-effective access tier when access patterns change.
Step 6
Use Amazon S3 Glacier storage classes for archiving data that is queried infrequently.
Step 7
Use serverless services like Amazon Redshift Serverless for efficient utilization of resources and reduce engineering overhead on unpredictable or irregular data warehouse workloads. Use features in Amazon Redshift that provide automation, like Automatic Table Optimization (ATO). -
Additional considerations
-
Consider the following key components when deploying this Guidance.
Consideration A
For hot data stores where data is queried frequently, choose the right database service for the right purpose to improve resource efficiency of your workloads.Right-size your database infrastructure to reduce wasting resources and use energy-efficient processors like Graviton where possible. Automate database backups so you don’t have to take manual snapshots of your databases.
Consideration B
When data is stored in large volumes in a data lake, reduce storage footprint by compressing data and deleting unused data. Classify data to understand its significance to business outcomes to determine when you can move data to colder storage layers or safely delete it.Use efficient file formats suited for consumption patterns (like Parquet for certain analytics use-cases) to improve utilization of downstream compute. Practice partitioning and bucketing design standards to reduce the amount of data scanned per query by downstream compute.
Consideration C
Follow data warehouse table design best practices like choosing suitable sort keys and distribution styles based on workloads. Use appropriate data types for the columns; for example, use date/time data types for date columns.
-
-
Data Processing
This diagram shows the data processing layers with different AWS services that could be used to process data in real-time or in batch processing mode. Use either managed services (option 1) or self-managed (option 2) as shown in subsequent slides.
-
Managed Services
-
Managed Services - Additional considerations
-
Self-Managed
-
Managed Services
-
Follow the steps in this diagram to deploy this Guidance.
Step 1
Use managed services like Amazon Managed Service for Apache Flink to reduce overhead of managing infrastructure and risk of overprovisioning resources. Select the appropriate length for time-based windowing operations for streaming applications to reduce wastage of resources.Step 2
Use serverless services like AWS Lambda for the appropriate use. Design workloads to limit Lambda invocations to reduce resource usage. Run new and existing functions on Arm-based AWS Graviton2 processors.Step 3
Use AWS Glue for unpredictable batch data processing workloads on Apache Spark, Python shell, or Ray engines. Choose appropriate worker nodes for the job. Run non-critical jobs on AWS Glue Flex.
Step 4
Run petabyte-scale data processing jobs on open-source frameworks such as Apache Spark, Apache Hive, and Presto using Amazon EMR. For unpredictable data processing workloads, use Amazon EMR Serverless.
For Amazon EMR running on Amazon Elastic Compute Cloud (Amazon EC2) instances, use the right types of instances and design clusters to terminate on job completion for batch workloads.
Consider using EC2 Spot Instances for non-critical workloads. Leverage Amazon EMR managed scaling to automatically size cluster resources based on the workload for best resource utilization.
-
Managed Services - Additional considerations
-
Consider the following key components when deploying this Guidance.
Consideration A
Use predicate pushdown to reduce the amount of data moved between different layers during data processing. Implement an event-driven architecture to maximize overall resource utilization for asynchronous workloads. -
Self-Managed
-
Consider the following key components when deploying this Guidance.
Step 1
Use Amazon EC2 instances to build your own analytics application that is best suited for your business requirements. Run petabyte-scale data processing jobs on open-source analytics frameworks such as Apache Airflow, Apache Hive, Apache Kafka, Apache Flink, Apache Spark, Presto, and Trino, among others.Choose Graviton-based EC2 instances to reduce energy consumption for the same performance. AWS Graviton-based EC2 instances use up to 60% less energy than comparable EC2 instances for the same performance.
Step 2
Consider using Amazon EC2 Auto Scaling to launch new EC2 instances seamlessly and automatically when demand increases, and terminate unneeded instances automatically to reduce wastage and save money when demand subsides.
-
-
Data Consumers
This diagram shows the data query and visualization layer with different AWS services that helps users to query and visualize data
-
Steps
-
Additional considerations
-
Steps
-
Follow the steps in this diagram to deploy this Guidance.
Step 1
Use Amazon OpenSearch Service for real-time visualization. Amazon OpenSearch Serverless removes much of the complexity of managing OpenSearch clusters and capacity.It automatically sizes and tunes your clusters and takes care of shard and index lifecycle management, ensuring optimal utilization of resources. If you are using provisioned OpenSearch clusters, consider using instances with Graviton processors.
Step 2
Leverage the rich, interactive data visualizations in Amazon Managed Grafana. It is a visualization and operational dashboarding service used to analyze, monitor, and alarm on metrics, logs, and traces across multiple data sources.The infrastructure is fully managed by AWS, which helps reduce the environmental impact of the service.
Step 3
Deliver insights to all your users through visualizations using the serverless, scalable business intelligence service, Amazon QuickSight. Use in-memory caching mechanisms like Amazon QuickSight SPICE to reduce contention with source data stores and improve query efficiency.
Use built-in machine learning insights in QuickSight to eliminate the need of setting up separate machine learning pipelines and natural language processing (NLP) features like Amazon Q in QuickSight to reduce the need to develop visuals.
Step 4
Implement data virtualization techniques to query information about where the data resides and avoid data movement. For example, use the federated querying feature in Amazon Athena while querying external data stores.
-
Additional considerations
-
Consider the following key components when deploying this Guidance.
Consideration A
Consider reviewing reports and dashboard usage at regular intervals. Remove unused, redundant reports from dashboards.Consideration B
Optimize data consumption by reducing physical data movement to consuming systems. Use pre-compiled views to improve resource utilization while querying data.Identify and optimize long-running, resource-intensive queries. Use result-caching techniques to reduce I/O operations.
-
Well-Architected Pillars
The AWS Well-Architected Framework helps you understand the pros and cons of the decisions you make when building systems in the cloud. The six pillars of the Framework allow you to learn architectural best practices for designing and operating reliable, secure, efficient, cost-effective, and sustainable systems. Using the AWS Well-Architected Tool, available at no charge in the AWS Management Console, you can review your workloads against these best practices by answering a set of questions for each pillar.
The architecture diagram above is an example of a Solution created with Well-Architected best practices in mind. To be fully Well-Architected, you should follow as many Well-Architected best practices as possible.
-
Operational Excellence
To swiftly respond to incidents and events, customize Amazon CloudWatch metrics, alarms, and dashboards. This service allows you to monitor the operational health of the Guidance and notify operators of faults.
-
Security
Resources deployed by this Guidance are protected by AWS Identity and Access Management (IAM) policies and principles. For example, authentication to services like Aurora, TimeStream, AWS IoT SiteWise, Amazon S3, and Amazon Redshift are managed by IAM. With IAM identity-based policies, administrators can set what actions users can perform, on which resources, and under what conditions.
-
Reliability
Amazon S3, Aurora, DynamoDB, and Amazon Redshift are built for data storage, backup, and recovery. We recommend using AWS Backup to back up TimeStream tables. And AWS IoT SiteWise uses the highly available and durable Amazon S3 for backups.
-
Performance Efficiency
This Guidance uses purpose-built services for each layer of its data architecture. For storage, it selects services based on access patterns (transactional, analytical), and frequency of access (hot, cold, archival). For data ingestion, it selects services based on data velocity (data streaming services, batch data ingestions). And for data processing, it selects services based on consumption patterns (real-time, batch). For query and visualization, it selects services based on personas (business insights consumers, data analysts, data engineers, and data scientists).
You can use proxy metrics—metrics that best quantify the effect of any changes you make with the associated resources. Examples of proxy metrics include CPU Utilization, Memory Utilization, and Storage Utilization that you can use to measure and optimize this Guidance based on changes you make.
-
Cost Optimization
This Guidance uses serverless services that reduce compute costs on data ingestion and data processing by provisioning the appropriate resources and disposing resources when processes are not running. For storage, this Guidance recommends using serverless services such as Aurora for hot data storage, as well as cost-effective and scalable services for colder layers like Amazon S3.
-
Sustainability
This Guidance uses technologies based on data access and storage patterns. For frequently accessed data, it guides you to use hot storage layers supported by Aurora, TimeStream, DynamoDB, and AWS IoT SiteWise. For lower frequency or batch consumption, it guides you to use services for colder storage layers, like Amazon S3. For specialized access patterns, like aggregations on normalized tables, it uses Amazon Redshift.
This Guidance recommends you select serverless services to reduce the chances of overprovisioning your resources. In addition, Lambda functions powered by Graviton2 are designed to deliver up to 19 percent better performance at 20 percent lower cost, leading to the additional benefit of improved environmental sustainability as a result of potential increased performance. We also recommend you review the delivery SLA to choose the appropriate patterns that reduce the consumption of resources when the resources are not needed. For example, moving to a batch ingestion pattern from real-time streaming patterns when real-time consumption is not required. Finally, it helps you to implement automation to terminate resources when not in use.
Implementation Resources
A detailed guide is provided to experiment and use within your AWS account. Each stage of building the Guidance, including deployment, usage, and cleanup, is examined to prepare it for deployment.
The sample code is a starting point. It is industry validated, prescriptive but not definitive, and a peek under the hood to help you begin.
Related Content
Optimize Data Pattern using Amazon Redshift Data Sharing
Disclaimer
The sample code; software libraries; command line tools; proofs of concept; templates; or other related technology (including any of the foregoing that are provided by our personnel) is provided to you as AWS Content under the AWS Customer Agreement, or the relevant written agreement between you and AWS (whichever applies). You should not use this AWS Content in your production accounts, or on production or other critical data. You are responsible for testing, securing, and optimizing the AWS Content, such as sample code, as appropriate for production grade use based on your specific quality control practices and standards. Deploying AWS Content may incur AWS charges for creating or using AWS chargeable resources, such as running Amazon EC2 instances or using Amazon S3 storage.
References to third-party services or organizations in this Guidance do not imply an endorsement, sponsorship, or affiliation between Amazon or AWS and the third party. Guidance from AWS is a technical starting point, and you can customize your integration with third-party services when you deploy the architecture.