AWS Cloud Operations Blog
Monitoring Service Level Objectives (“SLOs”) Made Easier with Nobl9 and Amazon CloudWatch Metrics Insights
The updated version (June 2022) that follows is based on working backward from a customer need to understand Service Level Objectives (“SLOs”) and the benefits from monitoring SLOs.
This post was originally written in Nov 2021 by Natalia Sikora-Zimna, Product Owner at Nobl9.
A service can be provided by infrastructure, a platform, software, or people. When users find a service’s functionality valuable, they expect it to be available. Reliability measures the ability of a service to work correctly. Therefore, users expect a service to be reliable. There’s an observation by site reliability engineering (SRE) practitioners that 100% is the wrong reliability target for nearly everything. Google documented this observation in detail with the book Site Reliability Engineering. If 100% reliability is the wrong target, then what is the right target that will continue to meet user expectations? Collaboration between product management, business stakeholders, engineering, and user experience (UX) is necessary for defining a viable target. Moreover, this reliability target is essential to maintaining a service level and meeting user expectations.
Overview
Product management and business stakeholders know user satisfaction with a service is key to sustaining business revenue. They depend on UX and operations teams to understand and monitor user satisfaction. UX teams use surveys to analyze user satisfaction. Operations teams use tools, like real user monitoring (RUM), to quantify UX. The key performance indicators (KPI) or metrics collected with these processes are point-in-time values that provide a snapshot for a set of users at a specific time.
Organizations frequently call these metrics product KPI to monitor user satisfaction. There are two challenges with this approach in the real world. The first is that the operations teams don’t know which metrics matter for user satisfaction. The second is that monitoring point-in-time values doesn’t really help with knowing if the service is reliable and satisfactory over a given period of time.
A Service Level Agreement (SLA) at its core answers a yes or no question about a user-relevant metric meeting a certain threshold value over a given period of time. An SLA has a provision to compensate users if the SLA value doesn’t satisfy the agreed upon value. For example, Amazon Simple Storage Service (Amazon S3) services are provided with a “Monthly Uptime Percentage” SLA with a time ranging over the last billing cycle.
Operators can rely upon an SLA to know which metrics matter, the time period, and the threshold value. SLAs don’t specify every user satisfaction metric in the agreement for various reasons. In turn, an SLA can’t be relied upon to monitor all of the user satisfaction metrics.
However, there’s an alternative approach that organizations can adopt to address this challenge. Site Reliability Engineers have a practice called monitoring Service Level Objectives (SLO) for addressing the challenges outlined above.
Service Level Objectives
A Service Level Indicator (SLI) is a metric used to reflect user satisfaction. Availability and latency are commonly used indicators.
An SLO is target value applied on an SLI over a period of time. This value is reported as a percentage value over a specified period of time. Examples of SLOs include the aggregated availability value needing to be more than 99% in the last 30 days, and the aggregated latency value needing to be less than 1 second in the last 30 days.
An SLA utilizes a published SLO and has a well-defined penalty for the service provider when an SLO value falls below the target agreement value.
This service level framework also introduces a term called error budget. This error budget is the difference between a 100% and the SLO target value. This budget allows a service reliability level to be less than 100% and still be sure that the service is meeting user expectations. As long as the actual SLO value is over the target value, users will be satisfied with the service.
An SLI metric and an SLO target are defined by a cross-functional team including the product, UX, and engineering, so that the metric is based on end user expectations. An SLO is also defined over a time period such as the last 30 days. These two properties of an SLO help address the challenges with metric selection and point-in-time monitoring.
Benefits of monitoring SLOs
The benefits of monitoring SLOs include the following:
- SLOs set expectations on system behavior, thereby helping with knowing if the service is meeting user expectations
- Deciding when to invest development time on increasing reliability vs. new features
- Support SLAs
- Alerting based on user expectations and reducing false positive alerts
- Support shift-left DevOps practice, thereby helping with identifying defects earlier in the software development lifecycle
Amazon CloudWatch has recently launched Metrics Insights, a fast, flexible, and SQL-based query engine that lets customers identify trends and patterns across millions of operational metrics in real time. Metrics Insights lets customers easily query and analyze metrics to gain better visibility into the health and performance of their infrastructure and large-scale applications.
Nobl9 and AWS have collaborated to extend the existing Nobl9 CloudWatch integration with CloudWatch Metrics Insights. This will help users retrieve metrics even faster and gain added flexibility when querying raw SLI data to use for SLOs.
Nobl9 launched the first version of its Amazon CloudWatch integration in September 2021, giving customers a versatile tool to monitor their products. CloudWatch collects data from over 70 AWS services, thereby providing AWS users with access to valuable infrastructure metrics. In addition, users can create their own custom metrics. Moreover, Nobl9’s CloudWatch integration provides customers with the power to translate these metrics into actionable SLOs. This means that companies have all of the information that they need to maintain a balance between cost and reliability, and to keep their customers happy.
CloudWatch Metrics Insights takes the SLO game to the next level. It’s an innovative analytics tool that works for both types of CloudWatch metrics: infrastructure and custom. The introduction of this feature lets Nobl9 customers benefit from using a powerful, SQL-based query engine for grouping, aggregating, and filtering metrics by labels in real time. This also helps them better organize their business insights. Furthermore, it gives users broad possibilities when defining metrics and choosing the granularity of insights that best fits their needs.
Metrics Insights comes with a query builder that lets customers select their metrics of interest, namespaces, and dimensions visually. Then the console automatically constructs Metrics Insights SQL queries based on their selections. Metrics Insights also provides an SQL query editor, where customers can type in raw SQL queries or edit the ones that they’ve created earlier and get down to the finest level of granular detail. Note that CloudWatch Metrics Insights comes with auto-completion support, which provides smart suggestions throughout the query composition process.
Once customers create their SQL queries, they can use them in the Nobl9 platform to set up SLOs that provide actionable data regarding multiple aspects of their business. Nobl9 keeps the integration as simple as possible: just choose the data center’s Region, and paste in the metric SQL query, exactly as it was created in CloudWatch.
CloudWatch Metrics Insights is available in all AWS Regions, except China.
If you’d like to learn more about Nobl9 and SLOs, then visit nobl9.com. If you’d like to try out the Nobl9 console and see how it can help your business, then sign up for a free 30-day trial.
To learn more about Metrics Insights, refer to the CloudWatch Metrics Insights documentation.
The content and opinions in this post are those of the third-party author and AWS is not responsible for the content or accuracy of this post.