AWS Cloud Operations & Migrations Blog

Four APM features to elevate your observability experience

Application performance monitoring (or APM) is the practice of taking key application performance indicators to ensure system availability, improve system performance, and improve the end-user experience. This week we announced Amazon CloudWatch Application Signals, a new set of features built-in to Amazon CloudWatch to help you speed up troubleshooting, reduce application disruptions, and operational costs, so that you can deliver the best experience for your end users. This blog will discuss the top new features that you can use to improve observability in your application workloads.

Application Signals makes it easy to automatically instrument your application to collect new standard application metrics – like request volume, latency, faults, and errors – in CloudWatch metrics and AWS X-Ray traces, with no manual effort or custom code. It will discover services, API endpoints, and dependencies, and makes it easy to define service level objectives (SLOs) so that you can better monitor the experience of your users. Once you’ve created a SLO, you can alarm when the SLO breaches the threshold you define, and a dashboard in the CloudWatch console provides visibility into the status of all your defined SLOs, which is connected to the rest of your new monitoring experiences in Application Signals.

Feature #1: Discovered services and top operations

The Services link under Application Signals on the left navigation of the CloudWatch console provides a convenient way to easily see all services that Application Signals has discovered across your instrumented workloads. In this example, Application Signals has discovered four services, it has automatically summarized a number of top operations across these services, and it has generated graphs so I can easily view relevant metrics in this pre-built dashboard.

The Application Signals services dashboard shows the top services by fault rate, the health status of each service, and the top operations and dependencies across all services. The visits-service service shown here has the highest fault rate at 3.81%, and the pet-clinic-frontend service has an unhealthy status with 1/4 SLOs showing as unhealthy.Figure 1: Application Signals services dashboard

By clicking into one of the services, you can dive deeper into assessing the health of the service in a new, unified view. You can see which service operations have SLOs, and each service operation that don’t yet have an SLO will display a Create SLO button that you can use to create a new SLO.

The Application Signals Services detail page shows a “Create SLO” button for each service operation that does not yet have an SLO defined.Figure 2: Application Signals Services detail page

Clicking the Create SLO button will allow you to create a new SLO. You must select either a reliability target (like availability or latency) or a CloudWatch metric to determine if your service is operating within your system’s tolerances. You can specify the timing for measuring the SLO, the attainment target, and you can set up CloudWatch alarms on the service level indicator and/or the service level objective attainment.

In the Set Service Level Objective (SLO) section, CloudWatch will tell you what must be done to achieve that objective. For example, in setting a Service Level Indicator to Availability, a goal of greater or equal to 99.9% to be tracked every one rolling day, with an attainment goal of 99%, CloudWatch reports that to achieve that goal, you must have no more than 14 minutes and 24 seconds of availability below 99.9% in a one rolling day interval.

Not only does this feature make it easy to set up SLOs on your various service operations, but it also provides clear guidance on how your service must operate to achieve that objective.

The create a service level objective screen, where you set the service level indicator type, the service and operation, the condition that determines if the SLO is met or not, and the service level objective.Figure 3: Creating a new Service Level Objective

Feature #2: Viewing correlated traces to diagnose anomalies

Application Signals makes it easy to view correlated traces for each of your SLOs, which you can view to identify and troubleshoot performance anomalies in your workload. From the Services left navigation menu, I select the service I am interested in viewing. In the service details page, I find a service operation I want to troubleshoot and view its metrics.

For each operation, I can easily view critical application metric graphs, and when I notice an anomaly, I can select a point on the graph that I want to investigate further (see figure 4 below). Once I click on a point on the graph, correlated traces for that observation appear. This feature makes it easy to correlate real-world impact with X-Ray traces, making it easy to understand how your workload is behaving and quickly learn the root cause of incidents or anomalies.

After selecting the POST /api/customer/owners’ operation, I click on data point on the latency graph, which then displays correlated traces.Figure 4: Selecting a point on the graph in the service operations view displays correlated traces

Once you click into the correlated trace, you have the ability to jump directly to container monitoring via CloudWatch Container Insights. Through this integration, you can go from service metrics to the exact pods that processed an individual request in just three clicks, without having to know how to query traces or how to connect individual transactions to the right container dashboard.

Feature #3: Service Level Objectives (SLO) dashboard

The Service Level Objectives dashboard allows you to view all your SLOs on a single screen and monitor their long-term budget health. From this screen, you can easily determine the status of each of your SLOs, which helps you to prioritize your operational activities by understanding how close you are to breaching your reliability objectives.

From the Services screen, I can click on the Service Level Indicator (SLI) status of one of my services to see which SLIs are unhealthy, as shown below.

The services page lists a service called pet-clinic-frontend with a status of 1/4 Unhealthy. By clicking on the 1/4 Unhealthy link, I can see a list of the SLIs that are unhealthy. By clicking on the unhealthy SLIs, I am taken to a detail screen.Figure 5: Viewing the service health of a service’s SLIs

By clicking on the unhealthy SLI, I am taken to a detail screen for the SLI.

The Application Signals Service Level Objectives dashboard shows SLO attainment across all service level objectives that have been set up. The availability of the Scheduling a Visit service is in danger of breaching SLO due to a DDoS attack that affected availability.Figure 6: The Application Signals Service Level Objects (SLOs) dashboard

One of my applications suffered a DDoS attack. While my site did not go down, the attack slowed the workload’s ability to respond to traffic, which impacted my availability SLO. When I select that SLO, CloudWatch displays a graph of SLO attainment, how this impacted by error budget, and the measured availability on that specific API endpoint. I can clearly see the breach in the availability SLI in the graph on the right, and I can see in the graph on the left that SLO attainment is hovering around 95% – well below my goal of 99%. In addition, just above the graphs, CloudWatch reports that I will not achieve the SLO if I have more than 14 one-minute periods of reduced availability in the next rolling day.

The SLO dashboard makes it easy to understand attainment and error budget across all SLOs in a single view, providing clear guidance to the operator on what scenario may impact that attainment.

Feature #4: Integration with CloudWatch Synthetics and CloudWatch RUM

Monitoring your workload via back-end metrics provides only half of the story. It’s also important to understand the experience your users and consumer applications are having when using your web applications and API endpoints.

After drilling into the service operations detail of my API, I can click the Synthetics Canaries or the Client Pages tab of the Services view. From these screens, I can drill into the details of any canary runs, or I can see page view events from CloudWatch RUM. This integrated view helps me see all the critical application telemetry for my service in a single place so that I don’t need to switch between multiple tools when troubleshooting, saving me time. Even better is that I didn’t have to do anything other than enabling the X-Ray Active tracing integration on my Canaries and RUM AppMonitor in order for these to appear in Application Insights.

The Client Pages tab of the Services detail screen. CloudWatch RUM data is displayed, including the number of page loads over time, the number of seconds to produce Largest Contentful Paint, and the number of errors over time. The pet-clinic-frontend service has an average page load time of 40ms, a Largest Contentful Paint of 1.5s, and an average of 5 AJAX errors.Figure 7: The Services page provides integration with Synthetics Canaries and CloudWatch RUM to better understand page view events.

Conclusion

In this blog post, I walked through some of the top new APM features that are available as part of CloudWatch Application Signals. These features are designed to make it easy to instrument and track your workload’s service level objectives. Application Signals doesn’t require any code changes to your EKS workloads, and it can automatically discover your services. By tracking your SLOs, you can ensure that you are providing a good experience for the end users of your applications.

With CloudWatch Application Signals, you pay based on the volume of requests being monitored. For more details on Application Signals pricing, visit the pricing page.

As a next step, get started today by enabling Application Signals on one of your EKS workloads.

About the author:

Mike George

Mike George is a Principal Solutions Architect based out of Salt Lake City, Utah. He enjoys helping customers solve their technology problems. His interests include software engineering, security, artificial intelligence (AI), and machine learning (ML).