AWS Cloud Operations Blog
How to develop an Observability strategy – Part 2
Your observability strategy starts with your business. “Observability” describes how well you can understand what’s happening in a system. Developing an observability strategy isn’t a one-time effort. It’s a continuous improvement effort that occurs throughout the lifecycle of your workloads. It enables your teams to determine whether or not the workloads they design and run are achieving your desired business outcomes. A strong observability foundation starts by working backward from those outcomes.
Your observability strategy:
Identify what you will measure
The first step in defining your observability strategy is to identify what you need to measure. Each business outcome has a set of Key Performance Indicators (KPIs) associated with it. They aren’t measurements themselves. Instead, they’re definition of what can be measured to evaluate the successful achievement of business outcomes. The number of KPIs will vary by business, but the number isn’t important. KPIs are defined by business and technology owners, and agreement between these stakeholders is the key to success. We covered the example of an e-commerce site in our first post.
For an e-commerce site, the desired business outcome may be growing sales volume by a certain percentage in each region. KPIs associated with this could include rate of completed or abandoned orders as these directly impact increasing sales.
Determine workloads and components to measure, and the sources of telemetry
Once the KPIs are defined, it is important to identify which workloads and components needs to be monitored to measure these KPIs. This is also a collaborative exercise between the stakeholders. The key business outcomes, and the associated KPIs and measurements should be agreed upon between the various stakeholders. For measuring the completed orders, the e-commerce site would need to get information from components like shopping cart, order processing, payment, etc.
Define the sources of telemetry (metric, logs and traces)
Observability gives you the ability to understand your system’s state. It allows you to know how well your system is functioning and tells you where the issues are, and what errors are being generated using data emitted by your systems. Your observability set up should give you the ability to follow your requests through the system, the microservices it interacts with, the state of the infrastructure that these services run on, and the impact of each of these has on the user experience.
You get this information through metrics, logs, and traces. Metrics are runtime measurements about a service, including system metrics like CPU utilization or network bandwidth, as well as workload specific metrics like the number of times a given function is executed. Logs are text records with metadata, are often used to determine the root cause of an issue. Traces track the flow of a request across applications. Metrics, logs, and traces together can provide a comprehensive view of the system.
For the e-commerce site, you may trace user order through the various sub-systems like ordering, finance, order processing; collect logs from each of these services; and collect metrics from the underlying infrastructure. This will enable you to answer questions like where is the user order slowing down, whether this latency is caused by the hardware, or whether there is an underlying dependency between microservices that was not identified during development. It may also give you the insights you need to improve your system, your architecture, or your development process.
Define what good looks like
It is important to define a baseline for your KPIs. This process begins by establishing the baseline from which you’ll measure. Baselines often represent the normal range of business. You can develop your baseline by reviewing your business for some time. You will observe that your business normally operates within a certain percentile. You can set these percentiles as thresholds, and raise alerts if you notice requests outside this predefined threshold. Percentiles are better than averages as they are not skewed by outliers. You should also include a grace period before taking action, to help prevent alert fatigue for temporary spikes that often occur naturally. These thresholds should also evolve over the lifecycle of your workloads. As your observability capability matures, thresholds will adapt. You may also use machine learning (ML) that can automatically adjust thresholds based on trends over time, as well as anomaly detection that can identify anomalous data based on patterns that aren’t recognizable through percentiles alone. For the e-commerce site, you may determine that if you order rate crosses and remains outside the 90th percentile for a predefined time (say 10 minutes), it is unexpected behavior, and you may want to take a specific action.
Define actions to be taken when threshold is crossed
The next step in an observability strategy turns sources of information into actions and insights. Knowing when to take action and what actions to take are the roots of a functional observability strategy. Numerous alarms may lead to alert fatigue. Grace periods in alarms, and ensuring that each alarm has a defined action is another way to limit alert fatigue. If no action is required, then you shouldn’t trigger alerts.
Another approach is automation. Rather than alerting IT teams, some alarms can trigger automatic actions. Alarms should only alert IT teams when manual intervention is required. You should define actions for manual intervention for alarms in the form of runbooks (codified scripted actions that you take in response to well-understood issues), or playbooks (defined steps to investigate issues which don’t have a well-understood cause). This is where monitoring intersects with observability.
Observability is tied to business outcomes. In the case of the e-commerce site, you can set alerts for latency, and can trigger an autoscaling event to increase capacity when your system crosses threshold for latency. You should also have a defined playbook that includes which systems to investigate, how to identify the order of investigation, which metrics, logs, and traces to use, and how to correlate between them.
Evolving your observability strategy over time
As you evolve your observability strategy with regular inspection and analysis mechanisms, there are several inputs that may contribute to a mature observability capability. For example, trend analysis and anomaly detection with automated responses may be used. You should also develop methods for correlating events to shorten the time required to troubleshoot. These methods help you detect events that could affect the business, and enable you to engage and investigate quickly to prevent or mitigate impacts. The goal of evolving and improving your observability strategy should be to reduce the mean time to detect (MTTD) and the mean time to resolve (MTTR). Reduced MTTD and MTTR will further reduce or prevent downtime and improve your ability to achieve the desired business outcomes that your technology is designed to deliver.
Each business is unique, and your observability solution should meet the needs of your business. As you start or evolve your observability journey, consider whether you’re partnering with the right business and technical stakeholders to define the correct set KPIs of four business. How would your business benefit from applying the approach above to develop your observability strategy? The answers to these questions will enable you to create your optimal observability strategy.
In the next post, we’ll cover how you can use AWS solutions to collect metrics, logs, and traces to implement your observability strategy. In the meantime, we encourage you to learn more about observability at AWS.
Read part 1 in this series here
About the authors: