AWS Cloud Operations Blog

Improve application reliability with effective SLOs

At AWS, we consider reliability as a capability of services to withstand major disruptions within acceptable degradation parameters and to recover within an acceptable timeframe. Service reliability goes beyond traditional disciplines, such as availability and performance, to achieve its goal. Components of a system or application will eventually fail over time. Like our CTO Werner Vogels says, “everything fails, all the time”. The question is how your system or application can sustain failures without impacting the end users, and how resilient your system is in relation to failures. Our customers are constantly asking us to help to reduce the blast radius from incidents and meet the reliability, performance, and scalability expectations their businesses need.

In this post, you will learn about reliability best practices that will set your teams up for success by measuring performance objectively and reporting reliability with accuracy for a quick turn-around when incidents happen. You will also learn how to create, monitor, and alert on Service Level Objectives (SLOs) natively in AWS using any Amazon CloudWatch Metric with Amazon CloudWatch Application Signals.

Service Level Management (SLM)

Service Level Management (SLM) provides a framework or process to define, negotiate, and manage delivered IT services and service levels for the customers. This framework includes several critical elements such as service availability, quality, data security, and throughput. This aims to protect the objectives of both the customer and you, the service provider. Now, let’s get familiarized with terminology that represents the assurance you make to your customers and the trackable measurements that tell you how healthy your services are.

  • SLI (Service Level Indicator) is a carefully defined quantitative measure of some aspect of the level of service provided.
  • SLO (Service Level Objective) is a target value or range of values for a service level measured by an SLI over a period of time.
  • SLA (Service Level Agreement) is an agreement with your customer that outlines the level of service you promise to deliver. An SLA also details the course of action when requirements are not met, such as additional support or pricing discounts.
  • Error Budget is the rate at which SLOs can be missed. It is the difference between 100% reliability and the SLO target value. Simply put, an error budget is an SLO for meeting other SLOs!

The following diagram illustrates how SLAs, SLOs, and SLIs interact. The customer (service consumer) is external to the team that owns the service. Within the service team you have sales functions such as business owners and customer success engineers, you have the product owner (owning the roadmap), and the engineering team, creating and operating the service. The engineering team owns the SLIs measuring the service and driving the SLOs. Product and engineering typically jointly own the SLOs, which inform the SLAs. To close the loop: as a customer, you have visibility into the SLAs and you can see how the service is performing, however, SLOs and SLIs are usually not shared outside of the service team boundary.

How SLAs, SLOs, and SLIs interact

Effective SLOs

SLOs help ensure performance standards are met and act as data points for meeting Key Performance Indicators (KPIs). For that reason, SLOs should be SMART (Specific, Measurable, Achievable, Relevant, and Time-bound). SLOs should clearly define what is to be achieved, provide a way to measure the progress, ensure that goals can realistically be achieved given the current resources and capabilities, align with business objectives, and set a time frame for the achievement of these goals. Improved visibility for effective decision making and improved service quality effectively preventing business disruptions are two of the key benefits of effective SLOs. These are few examples of effective SLOs that would allow you to gauge your customer experience.

Effective SLOs

Common Challenges

SLOs can be a powerful tool for helping you effectively prioritize what matters most for your end users and the business. However, getting started can come with its own challenges. Following are some of the common challenges that we’ve observed among our customers.

  • Capturing the right metrics for your SLIs: Effective SLOs start with the right metrics. However, identifying the right metrics to use, and ensuring your services are instrumented properly to capture the right metrics that impact your business can be a challenge.
  • Knowing how and when to respond to violations: Once you’ve identified the right services, metrics and goal, the next challenge is knowing how to calculate an error budget, and how to craft the right level of alerting based on your burn rate.
  • Connecting SLOs to your diagnostics tools: If you plan to respond to set operational alarms to respond to SLO breaches, having a disconnected experience between the tool you use for monitoring SLOs and for debugging application performance can make it difficult to identify why you’re not meeting your SLO. The more connected the experience, the more insights you can gain, and faster you can identify what to focus on to improve you SLO performance.

Best practices

Cross-team collaboration is the most critical factor for the successful implementation Service Level Objectives in an organization. We recommend you to consider the following best practices while creating SLOs to meet SLAs.

  • Align on goals across all stakeholders: When setting effective SLOs, it is critical to have alignment across product, engineering, and operations. With this alignment you can enforce SLO practices to inspect and improve reliability.
  • Attaining 100% is not realistic: As much as you may not like it, everything fails. Given this, setting a goal for 100% reliability is a recipe for failure. Instead, think about what a realistic goal to achieve is and what your end users might expect from your service. And bear in mind that services should be designed to retry a failed request!
  • Plan your response (automation diagnostics): It’s important to carefully consider when and how to get alerted to SLOs violations. Some operational events may lead to faster error budget burn rates than others and may require a higher severity alert. Where possible, use automation to detect and remediate application issues that impact your SLOs.
  • Document, share, and leverage open standards: As you adopt more SLOs across your organization it’s good to have a common framework for documenting and sharing SLOs so that teams use consistent patterns. Consider how to build SLO reviews into your daily, or weekly operational meetings.
  • Iterate (feedback loop, re-assess goals): Unlike SLAs, which are more rigid, SLOs are more flexible, and intended to help you improve the reliability of your service. It’s important to have a mechanism to continuously inspect if you are attaining your SLO target, and iterate to help you achieve the right balance of reliability for your business and your customers.

How to create SLOs native in Amazon CloudWatch?

With the introduction of Amazon CloudWatch Application Signals you can now create and monitor SLOs natively in AWS. With these SLOs and Application Signals, you track application performance against your most important business objectives without the undifferentiated heavy lifting of manual instrumentation, metrics computations, and correlating observed problems to root causes. Application Signals provides a comprehensive application performance monitoring solution in CloudWatch that enables you to connect SLOs to your APM experience. You can get started with SLOs using any metric available to you in CloudWatch. This makes it easy to get started with metrics that you have available today in CloudWatch.

Let’s say you work for a fitness company with an application that users can log into to see workouts, and monitor their key fitness activities. You run this application on a fleet of EC2 instances behind an Application Load Balancer (ALB). One day, you receive an urgent notification from your support team that users are complaining because they don’t see any of their workouts after they login to the app. After resolving the issue, you want to set an SLO to monitor availability so that you can better understand when large scale events occur that degrade end-user experiences.

Let’s walk through how you can create an SLO to monitor availability using ALB metrics that you already have in CloudWatch. In this example, you will set a goal of achieving 99% of 1-minute metric periods, where 95% of requests are processed successfully over a rolling 28-day basis. With this, you can quickly be notified when not achieving the expected result of >95% successful requests every minute, as well as identify smaller issues that might not require immediate attention but will ultimately result in breaching our SLO.

  1. First, navigate to the SLO dashboard in the CloudWatch console, by selecting the Service Level Objectives (SLO) option found under the Application Signals tab in the left-hand navigation.                                     Application Signals
  2. Next, click Create SLO to define your SLI and SLO.                                                     Create SLO
  3. From the SLO form, enter a name (for example, “My Availability SLO”) within the Set Service Level Objective (SLO) name field and select to use a CloudWatch Metric for SLI. Set Service Level Objective (SLO) name
  4. Next, select the metric you want to evaluate. In this case, you want to evaluate if the application is available or not. You will use a CloudWatch metric math expression to calculate the percentage of requests that did not result in a 5xx error in a 1-minute period. In the Select metrics widget, first select the ApplicationELB metric namespace, and the select Per AppELB metrics to find the metrics for ALB. Next, select two metrics, RequestCount and HTTPCode_ELB_5XX_Count. With these two metrics, you can measure the availability of the application as the percentage of requests that did not return a 5xx error. CloudWatch metric math
  5. After selecting RequestCount and HTTPCode_ELB_5XX_Count metrics, select the Graphed metrics tab to create a math expression to calculate the rate of requests that did not return a 5xx error. First for both metrics, update the statistic to Sample count and update the period to 1-minute. Next, add the following expression to calculate percentage of successful requests: ((totalrequests-failedrequests)/totalrequests)*100. Last, deselect the original metrics such that the only metric selected in the CloudWatch metric window is the math expression and click Select metric. CloudWatch metric
  6. Now that you’ve defined the metric you want to use for the SLI, set a condition to state if your service is achieving its goal. For this, set a threshold of less than or equal to 95, which means that any minute where less than 95% of requests are successful will be considered a bad minute.                                                         SLI
  7. Next, you need to set what your SLO goal is and how long you want to measure this goal against. CloudWatch Application Signals provides you two options when selecting the time interval for your SLO, rolling days or calendar months, with a max interval of up to 12 months. In this example, you want to select 28 rolling days, and set the goal for 99%. You can also select when to designate the SLO as being in a warning state by setting the warning level threshold field.
    SLO
  8. You can optionally select to automatically create three alarms to notify when the SLI doesn’t meet its threshold (SLI health), when the SLO goal is breached, and when you pass the warning threshold and select Create SLO. Set CloudWatch alarms
  9. After creating the SLO, within minutes you will begin to see SLO metrics such as attainment and error budget populated on SLO page. Application Signals also publishes the attainment and SLI breach count metrics that can be used for more advanced alarms, and dashboard use cases.                                           SLO

Next steps

Using Application Signals, you can create Service Level Objectives (SLOs) to focus on metrics with high business impact helping you prioritize critical issues and continuously fine-tune your SLOs to better correlate with business KPIs. To speed up root cause identification, Application Signals provides a comprehensive view of application performance, integrating additional performance signals from CloudWatch Synthetics, which monitors critical APIs and user interactions, and CloudWatch RUM, which monitors real user performance.

  • See this blog post to learn how you can use CloudWatch Application Signals to easily see the performance of applications on AWS without needing to manually instrument the applications.
  • Watch the re:Invent 2023 video to learn how JPMorgan Chase used Amazon CloudWatch Application Signals to track performance against their business objectives.
  • To learn more about application monitoring using Amazon CloudWatch Application Signals, check this YouTube video.
  • Use our hands-on workshop to get hands-on experience with this new capability. Here you will learn how to use Application Signals to monitor Amazon EKS workloads running on Amazon EC2.

Conclusion

SLOs defined solely by business teams without input from technical teams can result in unachievable targets, leading to frequent breaches of Service Level Agreements and missed KPIs. Defining SLOs should be a collaborative process involving both business and technical teams to align technical realities with business objectives and customer expectations.

In this blog post, you learned about best practices for effective SLOs that will set your teams up for success by measuring performance objectively, reporting reliability accurately, making alerts less disruptive and more actionable for a quick turnaround when incidents happen. Continuous improvement and periodic review of SLOs are essential to ensure they remain realistic and aligned with both the system’s capabilities and the business’s objectives. Changes to systems that could affect its performance should trigger reviews of the associated SLOs. We are here to help, and if you need further assistance, reach out to AWS Support and your AWS account team.

About the Authors

Andreas Bloomquist author photo

Andreas Bloomquist

Andreas Bloomquist is a Sr. Product Manager with Amazon CloudWatch. He focuses on Application Observability and helping customers monitor and assess the health of their applications and quickly arrive to the root cause of issues when they occur.

Michael Hausenblas author photo

Michael Hausenblas

Michael works in the AWS open source observability service team where he is a Technical Product Manager and owns the AWS Distro for OpenTelemetry (ADOT) from the product side.

Arun Chandapillai author photo

Arun Chandapillai

Arun Chandapillai is a Senior Infrastructure Architect who is a diversity and inclusion champion. He is passionate about helping his customers accelerate IT modernization through business-first Cloud adoption strategies and successfully build, deploy, and manage applications and infrastructure in the Cloud. Arun is an automotive enthusiast, an avid speaker, and a philanthropist who believes in ‘you get (back) what you give’. LinkedIn: /arunchandapillai