Guidance for Deep Application Observability on AWS

This Guidance demonstrates observability in applications to get deeper insights from application stacks and infrastructure metrics. To improve resiliency across two AWS Regions, it is essential to monitor application and infrastructure components across the entire stack.

Architecture Diagram

Download the architecture diagram PDF

Guidance for Deep Application Observability on AWS

Step 1
All appropriate business and infrastructure metrics are collected using Amazon CloudWatch.

Step 2
Application-level logic, like in AWS Lambda, can be collected using specific monitoring services, such as AWS X-Ray.

Step 3
Amazon Route 53 health checks are used to monitor appropriate endpoints.

Step 4
CloudWatch metrics, logs, and alarms are displayed on Amazon CloudWatch dashboards.

Step 5
CloudWatch metrics, logs, and alarms are displayed on CloudWatch dashboards across Regions. CloudWatch instances are replicated across regions.

Step 6
AWS Systems Manager Automation runbooks are initiated on service degradation, grey failures, and absolute failures. They can be used to run tasks and notify the site reliability engineering (SRE) team.

Step 7
Upon receiving the notification, the SRE team can signal the Route 53 Application Recovery Controller to point to the secondary cluster and follow relevant failover procedures.

Well-Architected Pillars

The AWS Well-Architected Framework helps you understand the pros and cons of the decisions you make when building systems in the cloud. The six pillars of the Framework allow you to learn architectural best practices for designing and operating reliable, secure, efficient, cost-effective, and sustainable systems. Using the AWS Well-Architected Tool, available at no charge in the AWS Management Console, you can review your workloads against these best practices by answering a set of questions for each pillar.

The architecture diagram above is an example of a Solution created with Well-Architected best practices in mind. To be fully Well-Architected, you should follow as many Well-Architected best practices as possible.

Operational Excellence

Deep application observability (DAO) ensures that application observability is carried at every layer of your workload: infrastructure, application, and business metrics. What you monitor depends on your organizational KPIs and SLAs. It helps customers prepare for potential service degradations and/or region-level failures and operate with efficiency and automation where applicable. As customers get more familiar with key metrics related to their application, they can evolve further by potentially incorporating automated systems with other existing SOPs to handle a full regional failure as needed (often as an audit/compliance requirement).

Read the Operational Excellence whitepaper
Security

All logs are encrypted at rest using AWS Key Management Service (AWS KMS). Access to the dashboards and any automated tasks running as a result of alarms will practice the principle of least privilege and only have the appropriate policies attached to their roles. Moreover, changing alarm thresholds, automated tasks, and other actions should be done by the appropriate personnel only. Changes should go through a change review process to ensure that business SLAs are always respected, and infrastructure metrics are leveraged to ensure business goals are met.

Read the Security whitepaper
Reliability

DAO guidance aligns with the Reliability pillar by advocating for automatic recovery from failure using proactive observability. If a regional failover is required, it can be initiated manually or automatically. DAO also emphasizes the need to monitor business SLAs to ensure infrastructure capacity is optimized and if those SLAs are not met, appropriate alarms are tripped. The guidance further encourages regional failover to be tested regularly to ensure all failure pathways are discovered and thus reducing business risk.

Read the Reliability whitepaper
Performance Efficiency

DAO encourages mechanical sympathy by recommending customers to monitor application workloads using the right tool, such as X-Ray for Lambda. DAO provides guidance on leveraging advanced technologies, such as CloudWatch Synthetics and canary testing, to ensure workload performance is measured through multiple dimensions.

Read the Performance Efficiency whitepaper
Cost Optimization

DAO guidance leverages CloudWatch metrics, alarms, and logs coupled with application-level tracing like X-Ray. Most of the guidance implementation will remain with the AWS Free Tier boundaries of CloudWatch and X-Ray, although as customer requirements vary, the cost aspect will need to be considered. For example, older CloudWatch logs can be pushed to Amazon Simple Storage Service (Amazon S3) to reduce costs further.

Read the Cost Optimization whitepaper
Sustainability

The DAO guidance recommends that you monitor all layers of your workload to ensure that business SLAs are continuously met, and that you conduct a regional failover when degradation or failure occurs. DAO can also be used to ensure efficient use of resources and reduce over provisioning of infrastructure to ensure a sustainable long-term working environment. Moreover, because the secondary environment is in a passive state, we recommend the resources to be scaled down until they are needed in case of a regional failover.

Read the Sustainability whitepaper

Implementation Resources

The sample code is a starting point. It is industry validated, prescriptive but not definitive, and a peek under the hood to help you begin.

Open sample code on GitHub

Architecture Diagram

Well-Architected Pillars

Implementation Resources

Related Content

[Title]

Disclaimer

Was this page helpful?

Guidance for Deep Application Observability on AWS

Architecture Diagram

Well-Architected Pillars

Implementation Resources

Related Content

[Title]

Disclaimer

Was this page helpful?

Ending Support for Internet Explorer