AWS Cloud Operations Blog
Improve your application availability with AWS observability solutions
Distributed systems are complex due to their high number of interconnected components and susceptibility to failures caused by constant updates. Legacy monolithic applications can be distributed across instances and geographic locations or microservices. These rely on thousands of resources to operate and can be updated frequently, scaled elastically, or invoked on demand. In turn, these components can generate billions of metrics, logs, and traces.
Other challenges exist when it comes to the observability of a complex distributed system. Different teams within an organization tend to adopt tools they are most familiar with. And while this tooling flexibility is great for individual teams, they end up with data that is collected, stored, and visualized in different data silos. So, how do you correlate across metrics, logs, traces, and events stored in different tools? How do you determine one data signal relates to another in order to isolate the issue faster?
Monitoring is about collecting and analyzing the system state to observe performance over time. Whereas observability is about how well you infer the internal state of the components in a system by learning about the external signals it produces. If you set up good observability in your environment by utilizing the right tools and processes, then it can also enable you to answer questions you did not know you needed to ask.
This post will take an example use case and look at how AWS observability services help you solve these challenges, while giving you the ability to get a holistic view of your distributed systems.
Issue Timeline
Let us look at a typical issue timeline.
Figure 1: Typical Issue Timeline
When an incident occurs, Operations teams are notified via either an email notification or a pager going off, and critical time is lost just to detect the issue. The measure of the mean time it takes to detect issues in your environment is called Mean Time to Detect (MTTD). The phase where teams spend the most time is the identification phase, because this is where you are trying to determine the root cause of the issue. This can be shorter for a repeated known issue, but in most cases, it tends to be much more complex, especially in a distributed system environment. The measure of the mean time it takes to identify issues in your environment is called Mean Time to Identify (MTTI).
Our primary focus is reducing MTTD and MTTI by setting up systems and processes to quickly identify and spot issues.
Figure 2: AWS Observability Tools
AWS observability tools offer various services for gaining visibility and insights into your workload’s health and performance. These tools are built on strong building blocks of logs, metrics, and traces on Amazon CloudWatch and AWS X-Ray . They enable you to collect, access, and correlate data on a single platform from across all of your AWS resources, applications, and services that run on AWS and on-premises servers. This helps you break down data silos so you can easily gain system-wide visibility and quickly resolve issues. You can also derive actionable insights to improve operational performance and resource optimization.
Reference Application
We launched the One Observability Demo Workshop in August 2020 that utilizes an application called PetAdoptions that is available on GitHub. We will use that as the reference application for this demonstration. It is built using a Microservice architecture, and different components of the application are deployed on various services, such as Amazon Elastic Kubernetes Service (Amazon EKS), Amazon Elastic Container Service (Amazon ECS), AWS Lambda, Amazon API Gateway, Amazon DynamoDB, Amazon Simple Queue Service (Amazon SQS), Amazon Simple Notification Service (Amazon SNS), and AWS Step Functions. The application architecture is shown in the following diagram.
Figure 3: Pet Adoption Application Architecture
As illustrated in the diagram, the application is deployed on various services and written using different programming languages, such as Java, C#, Go, Python, and Node.js. The service components collect traces, metrics, and logs, which are then sent to CloudWatch and X-Ray.
Use Case
Let’s consider a scenario: imagine your page went off this morning, stating that the most visited search page of your PetAdoptions application has gone down. As a member of the Operations team, you must find out the root cause of the problem, along with the strategy to be proactively notified of such issues and the associated remediation strategy.
Figure 4: Broken Search Page
Let us navigate to the CloudWatch ServiceLens console to begin the investigation. The CloudWatch ServiceLens feature allows for the easy correlation of logs, metrics, and traces together to gain a deeper perspective of the issues you are investigating. It also offers a visual understanding of performance and service health metrics by automatically resizing the nodes, adjusting edge sizes based on requests or latency, and highlighting problematic services for easy problem identification. ServiceLens also lets dive deep into other specific features to conduct extensive analysis and garner insights for a particular resource type. As a prerequisite, you must have X-Ray enabled within your application. Note that the PetAdoptions application is already instrumented using the AWS X-Ray Go SDK for trace collection. Refer to workshop source code for implementation reference.
You can see that the PetSearch component that is an Amazon ECS container has 2% faults around the same timeframe that you received the alert.
Figure 5: CloudWatch ServiceLens service map
When you click a particular node, the Service Map displays the metrics and any associated alerts. PetSearch container metrics depict the decline in requests and increase in the fault % during the selected timeframe. Since the selected node is an ECS container, you can navigate directly to Cloudwatch Container insights for a deeper analysis. Moreover, you can view the associated traces or go to a more detailed dashboard within the same panel.
Figure 6: PetSearch node
Click the “2% Faults (5xx)”, which will navigate you to the traces section with a filter for Node status as “Fault” for the selected service in a given timeframe. You can also filter the traces based on several available filter options, including any custom expression and filters. The list of traces displayed at the bottom of the page is narrowed down to the traces matching your filter. In this case, all faulty traces for the PetSearch Container are displayed.
Select any of the traces to get into the trace details. This shows how the request travels through the application, along with the trace summary and segments timeline that shows all of the actions that took place in the application. You can see there is a 500 internal server error within the search function of the SearchController class.
Figure 7: Service map and segment timeline for a particular trace
If you scroll further down on the page, then you can see the segment details, such as internal IP, container Id, SDK version, etc. for the selected segment for this particular trace. Note the task Id listed in the segment detail for later reference.
Figure 8: Segment detail of the selected node in the segment timeline
Furthermore, you can navigate to a specific feature in order to conduct deeper analysis for the selected resource type. In this case, since the resource type is an ECS container, you can navigate to CloudWatch Container Insights from within the same trace detail console. Select the relevant node from the map view and navigate to the dashboard to analyze key metrics such as CPU memory, utilization, and network traffic for that particular resource.
Figure 9: CloudWatch Container Insights
You can see the tasks that are part of the selected service at the bottom in the task performance table. Select the task and navigate to the application and performance logs from the Actions menu. Note this is the same task highlighted in Figure 8.
Figure 10: Tasks Performance table for the selected ECS service
Click “View performance logs” from the Actions menu, which takes you to CloudWatch Log insights with the relevant log group automatically selected and a sample query that will fetch performance logs from the selected task for the given timeframe. Click “run query” to examine the task logs.
Figure 11: CloudWatch Logs Insights query results
As you can see in figure 11 that a Java “NullPointerException” led to this failure. On further code analysis, it was identified that key attributes were missing from one of the pet entries in the DynamoDB table. This table stores the pet types and related metadata such as pet color, availability, etc., and the search function has failed since no null checks were applied on attribute usage.
Under filters you can jump to X-Ray analytics via the X-Ray analytics button. This takes you to the analytics screen with a filter expression already applied showing traces that satisfy this condition. In this case, it automatically applied a filter for all failed instances of a PetSearch container within a given timeframe.
Figure 12: AWS X-Ray Analytics
The AWS X-Ray Analytics console is an interactive tool for interpreting trace data in order to understand how your application and its underlying services are performing. The console lets you explore, analyze, and visualize traces through interactive response time and time-series graphs. By utilizing X-Ray analytics, you can filter traces using filter expressions, and create X-Ray groups based on specific filter expressions.
In addition to manually analyzing traces, applying filter expressions, and visualizing traces through interactive response time and time-series graphs, AWS X-Ray continuously analyzes trace data and automatically identifies anomalous services and potential issues for your applications’ service faults. When fault rates exceed the expected range, it creates an insight that records the issue and tracks its impact until it is resolved.
Figure 13: X-Ray Insight generated for PetSearch Container faults
As you can see in figure 13X-Ray has generated an insight for our PetSearch container issue and identified that 31% of the overall client requests have failed due to this reason. X-Ray insights summarizes every identified insight, along with insight description. When you click an insight from the list, you see the insight status, the root cause service that caused the anomaly, and the overall impact during that period, among other details.
Figure 14: AWS X-Ray Insights detailed view
You can see the timeline that displays insight duration under the Anomalous services section. Check more details related to the actual and predicted number of faults by hovering over the insight identification time. The service map in Figure 14 also clearly identifies the anomalous service, and you can see the total impact due to this event under the impact section. With X-Ray insights, you can also receive notifications as the issue changes over time. Insight notifications can be integrated with your monitoring and alerting solution using Amazon EventBridge. This integration enables you to send automated emails or alerts based on the severity of the issue. Look for more details in the next section.
New Troubleshooting Workflow
At the start of this post, we discussed a typical troubleshooting workflow and the associated challenges (as depicted in Figure 1). With every AWS observability service feature explained, the new troubleshooting workflow looks drastically different. With the help of a PetSearch use case, we saw the difference between the original and the new approach that utilizes some of the AWS observability tools in the mix.
If you look at the Identification phase for our use case, you made use of ServiceLens to correlate logs, metrics, and trace data that also guided you to the right tool to conduct deeper analysis without switching context or scope. This is critical to shortening the path to root cause identification. Moreover, this reduces the overall effort required to identify the issue, and the overall MTTI. Our original goal was to reduce the MTTI, which primarily contributes to better application availability, improved customer experience, and business growth. The process might vary depending on various factors, such as the type of environment, so use this as an approximation of an ideal scenario.
Figure 15: New Troubleshooting workflow
We have seen how AWS observability services can reduce the time required to identify the root cause of an issue reported by end users. However, instead of waiting for your end users to report errors, or applying custom error detection mechanisms, you can receive proactive alerts by utilizing Synthetics. You can also be notified by an Insight generated from X-Ray through anomalies identified from collected traces by using machine learning (ML). These mechanisms could potentially shorten the MTTD.
CloudWatch Synthetics lets you conduct canary monitoring. You can create canaries that are configurable scripts and can be run on a schedule for proactive monitoring. Synthetics can conduct availability monitoring, and they can test API endpoints and even UI automation to test user interaction. These canary script runs traverse through the same path in your applications as your real user requests would.
When you navigate to CloudWatch Synthetics, you will see a list of canaries along with the execution history for current account in the selected Region (Please note that the code for these canaries is available at aws samples github repository). API canary monitors the Search API for different pet types. As you can see, the api-canary success percentage is only 56%, and you could be proactively notified before the end users encountered the same issue.
Figure 16: CloudWatch Synthetics – Canaries
You can create a CloudWatch alarm based on various metrics available under the “CloudWatchSynthetics” namespace, such as success percentage. Configure an alarm either if the success percentage falls below a specified threshold within a stipulated time, or if the value goes outside of the anomaly detection band for the selected metric.
X-Ray Insights anomaly detection
AWS X-Ray helps you analyze and debug distributed applications. AWS X-Ray Insights utilizes anomaly detection to create actionable insights into anomalies identified via traces from your applications. Typical use cases for this feature are getting an alert when the fault rate for a specific service is over X%, when anomalies are detected for a specified group, when the root cause of the issue is a specified service, or when an insight state is active or closed.
You can either utilize an existing default group or create a new X-Ray group with a particular filter expression and enable notifications and insights to allow X-Ray to send notifications to Amazon EventBridge. In this case, we have created a “PetSearchFault” group with a filter expression service (id(name: “PetSearch”, type: “AWS::ECS::Container”)){fault} which essentially logically segregates all of the faults for the PetSearch Container.
Figure 17: Custom X-Ray group for PetSearch Container faults
Once the notifications are enabled on X-Ray group, you can create an Amazon EventBridge rule for an incoming event from X-Ray insights when an anomaly is detected for the X-Ray group “PetSearchFault”. Configure the target as an SNS topic in order to receive corresponding notifications. Pass the matched event either as it is, or customize the input by utilizing an input transformer before passing it to a target. Note that you can configure any other available targets based on the monitoring strategy you would like to implement. (Please refer to Amazon EventBridge integrations for all available partner integrations).
Figure 18: Create rule page in the Amazon EventBridge console
Conclusion
This post walks through the steps of how AWS observability tools can help you easily correlate logs, metrics, and traces to quickly identify the root cause of an issue. Furthermore, it has purpose-built features for specific resource types like Containers, Lambda functions, etc. in order to garner deeper insights into those environments. We have also shown how to proactively receive alerts by utilizing canaries and the ML capabilities of X-Ray insights as well as Amazon EventBridge to take remediation actions. All of these features collectively help reduce the MTTD and MTTI and improve the overall availability of your application.