AWS Cloud Operations Blog
What is observability and Why does it matter? – Part 1
Before defining observability, consider the following example: You run an e-commerce site, and you’re interested in understanding the customer experience of the site, as well as how that translates into sales. You have identified that long page-loading times lead to poor customer experience, which in turn leads customers to abandon their carts and buy competing products. Therefore, if a page directly impacting sales conversion is loading slowly, you need to identify why in order to troubleshoot the issue as quickly as possible and ensure that it loads faster. Fewer abandoned carts mean increased sales. In the longer term, you also want to determine if you can implement architectural changes in order to proactively prevent page-loading issues.
Observability lets you do all of these things. It lets you understand what is happening in the system, and why, so that you can focus on the key issues faster. We’ll expand on our example, and explore how observability helps deal with the e-commerce site incident.
An Amazon CloudWatch Alarm is raised when the render time for a product’s details page breaches the threshold you set up. Amazon CloudWatch Service Lens provides a holistic application view. During review, traces show high latency in the Product Details page, so you review its logs. Amazon CloudWatch Contributor Insights shows that most requests exceeding the threshold are for a single product ID in the catalog. An AWS X-Ray trace for one request provides you with the product ID. The trace also provides the context associated with this interaction, such as the request parameters passed by the client, the calls to services involved in rendering the detail page, and how long each component took to process the request. The trace examination reveals that the page spends most of its time waiting on several large images embedded in the product page. This information lets you take immediate action to cache these images via Amazon CloudFront, or reduce their size.
Having discovered that the alarm was triggered by a suddenly popular product, you investigate further to understand that popularity. Looking for top contributors by referrer URL, you determine that traffic to this detail page comes from a social media site where an influencer has linked to this product. Understanding the impact of social media on sales might lead the development team to prioritize the implementation of backlogged features or improvements. They might prioritize implementing a content-delivery network (Amazon CloudFront) in order to deliver static content like pictures and video at lower latency. Or they might implement the product page content rendering as microservices that allow basic content and placeholders to load independently of image and video files. This may also change the marketing strategy for the company.
What is observability?
Observability lets you gain insights into systems and ask new questions using metrics, logs, or traces. It provides a holistic workload view, with rich contextual information letting you understand why a system is, or is not meeting your Service Level Objectives (SLO) as measured by the Key Performance Indicators (KPIs). Observability is informed by key business drivers, rather than focusing only on component-level insights related to faults, configuration, accounting, performance, and security.
Monitoring and observability
Monitoring enables an observable system. Systems utilize monitoring to measure the system state via KPIs that provide insights into its observable properties, such as reliability, availability, and performance. An effective observability strategy cannot exist without effective monitoring.
How to build an observability strategy
Your observability strategy must work backwards from your business needs. Developing a purely technical strategy that doesn’t account for your business requirements will develop into an incomplete solution.
Observability requires two foundational capabilities. The first is clear alignment between business and technology teams in order to understand key business needs and goals. Your application architecture should optimize for business needs. This lets you identify the KPIs needed to measure and build the capabilities for monitoring theseKPIs. For more information regarding ensuring a shared understanding between teams, check out the Operational Excellence pillar of Well-Architected, and learn about building a Cloud Center of Excellence.
The second requirement is instrumenting your system to capture the telemetry needed in order to monitor your system, and then determining the context for the requests contributing to the KPIs. This captured data consists of logs, metrics, and traces. Utilize open source standards, like OpenTelemetry through AWS Distro for OpenTelemetry, or utilize AWS services like Amazon CloudWatch, and AWS X-Ray to collect telemetry from your application. Furthermore, utilize Amazon Managed Service for Grafana to get an aggregate view of, as well to drill down into application metrics captured through Amazon CloudWatch. Amazon Managed Service for Prometheus lets you deep dive into container metrics. AWS X-Ray provides the traces that describe how requests pass through the system, so that you can investigate transactions spanning many components. Amazon CloudWatch also lets you explore, analyze, and visualize your logs so that you can easily troubleshoot operational problems.
Together, your KPIs, along with the telemetry data, will provide the insights that enable you to quickly understand the cause of alarms and events that can put business outcomes at risk. In a continuously changing environment, observability is essential to improving the signal-to-noise ratio in monitoring, and it helps your teams focus on what really matters for your business.
The Observability journey
Observability is a journey. Your observability strategy should evolve over time based on learnings developed from operating your systems. You must tune your telemetry sources, data, and analyses to quickly, and efficiently identify novel trends or insights. Observability insights should be utilized to target systems improvements, as well as how those systems are instrumented to support your business needs. Regardless of where your teams are in the journey, whether they’re just starting to develop a monitoring strategy, or are already implementing continuous observability strategy improvements, your monitoring and observability should be based on supporting business needs, and must work backwards from the business.
Developing an observable system requires up-front investment in time, resources, skills, and tooling. It may need to be sustained by continued investment. An opportunity analysis will determine which observability benefits should be prioritized for implementation.
In future posts, we will dive deeper into the tools and resources available in AWS that you can utilize to implement your observability strategy. We will also talk through several business use cases that demonstrate the value of observability. Stay tuned.
Learn more about observability at AWS.
Read the next blog in this series part 2 here