This Guidance demonstrates observability in applications to get deeper insights from application stacks and infrastructure metrics. To improve resiliency across two AWS Regions, it is essential to monitor application and infrastructure components across the entire stack.
CloudWatch metrics, logs, and alarms are displayed on Amazon CloudWatch dashboards.
CloudWatch metrics, logs, and alarms are displayed on CloudWatch dashboards across Regions. CloudWatch instances are replicated across regions.
AWS Systems Manager Automation runbooks are initiated on service degradation, grey failures, and absolute failures. They can be used to run tasks and notify the site reliability engineering (SRE) team.
Upon receiving the notification, the SRE team can signal the Route 53 Application Recovery Controller to point to the secondary cluster and follow relevant failover procedures.
The AWS Well-Architected Framework helps you understand the pros and cons of the decisions you make when building systems in the cloud. The six pillars of the Framework allow you to learn architectural best practices for designing and operating reliable, secure, efficient, cost-effective, and sustainable systems. Using the AWS Well-Architected Tool, available at no charge in the AWS Management Console, you can review your workloads against these best practices by answering a set of questions for each pillar.
The architecture diagram above is an example of a Solution created with Well-Architected best practices in mind. To be fully Well-Architected, you should follow as many Well-Architected best practices as possible.
Deep application observability (DAO) ensures that application observability is carried at every layer of your workload: infrastructure, application, and business metrics. What you monitor depends on your organizational KPIs and SLAs. It helps customers prepare for potential service degradations and/or region-level failures and operate with efficiency and automation where applicable. As customers get more familiar with key metrics related to their application, they can evolve further by potentially incorporating automated systems with other existing SOPs to handle a full regional failure as needed (often as an audit/compliance requirement).
All logs are encrypted at rest using AWS Key Management Service (AWS KMS). Access to the dashboards and any automated tasks running as a result of alarms will practice the principle of least privilege and only have the appropriate policies attached to their roles. Moreover, changing alarm thresholds, automated tasks, and other actions should be done by the appropriate personnel only. Changes should go through a change review process to ensure that business SLAs are always respected, and infrastructure metrics are leveraged to ensure business goals are met.
DAO guidance aligns with the Reliability pillar by advocating for automatic recovery from failure using proactive observability. If a regional failover is required, it can be initiated manually or automatically. DAO also emphasizes the need to monitor business SLAs to ensure infrastructure capacity is optimized and if those SLAs are not met, appropriate alarms are tripped. The guidance further encourages regional failover to be tested regularly to ensure all failure pathways are discovered and thus reducing business risk.
DAO encourages mechanical sympathy by recommending customers to monitor application workloads using the right tool, such as X-Ray for Lambda. DAO provides guidance on leveraging advanced technologies, such as CloudWatch Synthetics and canary testing, to ensure workload performance is measured through multiple dimensions.
DAO guidance leverages CloudWatch metrics, alarms, and logs coupled with application-level tracing like X-Ray. Most of the guidance implementation will remain with the AWS Free Tier boundaries of CloudWatch and X-Ray, although as customer requirements vary, the cost aspect will need to be considered. For example, older CloudWatch logs can be pushed to Amazon Simple Storage Service (Amazon S3) to reduce costs further.
The DAO guidance recommends that you monitor all layers of your workload to ensure that business SLAs are continuously met, and that you conduct a regional failover when degradation or failure occurs. DAO can also be used to ensure efficient use of resources and reduce over provisioning of infrastructure to ensure a sustainable long-term working environment. Moreover, because the secondary environment is in a passive state, we recommend the resources to be scaled down until they are needed in case of a regional failover.
The sample code; software libraries; command line tools; proofs of concept; templates; or other related technology (including any of the foregoing that are provided by our personnel) is provided to you as AWS Content under the AWS Customer Agreement, or the relevant written agreement between you and AWS (whichever applies). You should not use this AWS Content in your production accounts, or on production or other critical data. You are responsible for testing, securing, and optimizing the AWS Content, such as sample code, as appropriate for production grade use based on your specific quality control practices and standards. Deploying AWS Content may incur AWS charges for creating or using AWS chargeable resources, such as running Amazon EC2 instances or using Amazon S3 storage.