Guidance for Deep Application Observability on AWS

Go to sample code

Overview

This Guidance demonstrates observability in applications to get deeper insights from application stacks and infrastructure metrics. To improve resiliency across two AWS Regions, it is essential to monitor application and infrastructure components across the entire stack.

How it works

Monitoring application and infrastructure components in Amazon Web Services (AWS) to improve resiliency across two AWS Regions requires deep monitoring across the entire stack. Absolute failures, grey failures, and service degradation need to be observed across both Regions and coupled with automated alerting and actioning.

Download the architecture diagram

Well-Architected Pillars

The architecture diagram above is an example of a Solution created with Well-Architected best practices in mind. To be fully Well-Architected, you should follow as many Well-Architected best practices as possible.

Deep application observability (DAO) ensures that application observability is carried at every layer of your workload: infrastructure, application, and business metrics. What you monitor depends on your organizational KPIs and SLAs. It helps customers prepare for potential service degradations and/or region-level failures and operate with efficiency and automation where applicable. As customers get more familiar with key metrics related to their application, they can evolve further by potentially incorporating automated systems with other existing SOPs to handle a full regional failure as needed (often as an audit/compliance requirement).

Read the Operational Excellence whitepaper

All logs are encrypted at rest using AWS Key Management Service (AWS KMS). Access to the dashboards and any automated tasks running as a result of alarms will practice the principle of least privilege and only have the appropriate policies attached to their roles. Moreover, changing alarm thresholds, automated tasks, and other actions should be done by the appropriate personnel only. Changes should go through a change review process to ensure that business SLAs are always respected, and infrastructure metrics are leveraged to ensure business goals are met.

Read the Security whitepaper

DAO guidance aligns with the Reliability pillar by advocating for automatic recovery from failure using proactive observability. If a regional failover is required, it can be initiated manually or automatically. DAO also emphasizes the need to monitor business SLAs to ensure infrastructure capacity is optimized and if those SLAs are not met, appropriate alarms are tripped. The guidance further encourages regional failover to be tested regularly to ensure all failure pathways are discovered and thus reducing business risk.

Read the Reliability whitepaper

DAO encourages mechanical sympathy by recommending customers to monitor application workloads using the right tool, such as X-Ray for Lambda. DAO provides guidance on leveraging advanced technologies, such as CloudWatch Synthetics and canary testing, to ensure workload performance is measured through multiple dimensions.

Read the Performance Efficiency whitepaper

DAO guidance leverages CloudWatch metrics, alarms, and logs coupled with application-level tracing like X-Ray. Most of the guidance implementation will remain with the AWS Free Tier boundaries of CloudWatch and X-Ray, although as customer requirements vary, the cost aspect will need to be considered. For example, older CloudWatch logs can be pushed to Amazon Simple Storage Service (Amazon S3) to reduce costs further.

Read the Cost Optimization whitepaper

The DAO guidance recommends that you monitor all layers of your workload to ensure that business SLAs are continuously met, and that you conduct a regional failover when degradation or failure occurs. DAO can also be used to ensure efficient use of resources and reduce over provisioning of infrastructure to ensure a sustainable long-term working environment. Moreover, because the secondary environment is in a passive state, we recommend the resources to be scaled down until they are needed in case of a regional failover.

Read the Sustainability whitepaper

Implementation Resources

The sample code is a starting point. It is industry validated, prescriptive but not definitive, and a peek under the hood to help you begin.

Open sample code on GitHub

Disclaimer

The sample code; software libraries; command line tools; proofs of concept; templates; or other related technology (including any of the foregoing that are provided by our personnel) is provided to you as AWS Content under the AWS Customer Agreement, or the relevant written agreement between you and AWS (whichever applies). You should not use this AWS Content in your production accounts, or on production or other critical data. You are responsible for testing, securing, and optimizing the AWS Content, such as sample code, as appropriate for production grade use based on your specific quality control practices and standards. Deploying AWS Content may incur AWS charges for creating or using AWS chargeable resources, such as running Amazon EC2 instances or using Amazon S3 storage.

Did you find what you were looking for today?

Let us know so we can improve the quality of the content on our pages

Guidance for Deep Application Observability on AWS

Overview

How it works

Well-Architected Pillars

Implementation Resources

Disclaimer

Did you find what you were looking for today?

Learn

Resources

Developers

Help