Enhancing DevOps Practices with Amazon CloudWatch Application Performance Monitoring

Organizations seeking to deliver meaningful technology services at a higher velocity to their customers have incorporated application performance monitoring (APM) into their DevOps operating models. Software development and IT operations teams that have traditionally worked in their own silos now strive to work in concert to increase organizational agility. The transformation path is unique for each organization, with a combination of cultural changes and technology required to create and maintain long-term success.

As the pace of software development increases, it is important to maintain a strong focus on quality at scale. Any friction within the customer digital experience can lead to lost revenue and brand erosion. According to research by Unbounce, nearly 70% of consumers said that page speed influences their likeliness to make an online purchase. At AWS, we hear frequently from our customers that you want to ensure your online users are satisfied with their digital experiences and that you need observability tools that can operate at scale. Based on this feedback, Amazon CloudWatch has grown to include real-time monitoring capabilities for both your Amazon Web Services (AWS) resources and the applications you run on AWS.

CloudWatch Synthetics allows teams to monitor applications through the lens of acceptable customer experience. CloudWatch Synthetics are configurable scripts, called canaries, that you can run on a schedule to continually test your APIs and website experiences 24×7. Canaries follow the same code paths and network routes as real-users, and can notify you of unexpected behavior including latency, page load errors, and broken user workflows.

CloudWatch Evidently is used to safely launch new application features by serving them to a specified percentage of your audience. Results can be monitored and validated before proceeding with a wider release. You can also create experiments to compare the performance of multiple versions of a feature.

CloudWatch RUM collects client-side metrics from your web application users and gives you insights into their hands-on experience. The data can be used to quickly identify and debug client-side performance bottlenecks and is aggregated based on location, device type, and browser to visualize themes and trends.

CloudWatch ServiceLens enhances the observability of your distributed applications by building a service map depicting relationships and dependencies between application endpoints and resources. Faults and latency are highlighted on the service map which acts as a single point of access to correlated metrics, logs and application traces.

In this post, we’ll explore how to strengthen your DevOps operating model by integrating CloudWatch application monitoring features into your technology stack.

Figure 1: Collection and analysis of data from CloudWatch Synthetics, RUM, and Evidently

We will focus on several foundational areas that you should consider when developing a holistic approach for your organization.

Gain alignment by focusing on the customer.
Leverage automation to achieve faster results.
Increase transparency and improve communication.
Build a culture of accountability.
Foster a continuous improvement mindset.

Gain alignment by focusing on the customer

As development and operations teams join forces and task lists are combined, it is natural for competing priorities within your product backlog to arise. Create a shared focus by measuring the impact of each item as it relates to the customer, rather than basing it on the goals or preferences of the teams.

As your first joint effort, place a focus on measuring the customer digital experience to identify any friction or bottlenecks. Even without live user traffic, use CloudWatch Synthetics to test your applications 24×7 in order to detect anomalies proactively before they affect your customers. Canaries can measure the availability and latency of both application endpoints and REST APIs, and ensure that those metrics have not dropped below the acceptable standards defined in your service-level objectives (SLOs) and service-level agreements (SLAs).

Besides measuring performance, CloudWatch Synthetics performs visual monitoring to alert you of defects in the end-user browsing experience. As canaries run, screenshots are taken and compared to the baseline screenshot you designate. If the visual variation in any new screenshot exceeds the threshold you’ve set, the visual monitoring canary will fail. The differences are highlighted within the screenshots and can be reviewed with the canary results.

Take your focus on the customer one step further by using CloudWatch Evidently to implement feature flags within your application. Evidently gives you the ability to progressively launch new application features safely and incorporate experimentation to get feedback on customer preferences. By serving new features to only a portion of your users and gathering real-time feedback, you can more quickly make adjustments as needed during rollout. Define audience segments to customize launch traffic splits for a subset of users. Using a segment limits the launch to users that match the required criteria. For example, you can segment users by type of browser, geographic location, or data that your application tracks such as a membership level or user type.

With A/B experiments you can test multiple variations of your features and use the data to inform future design decisions. Multiple metrics can be tracked for the duration of each experiment and will be aggregated by Evidently to show the total and average values. Data collected by Evidently is analyzed and recommendations are made based on a statistical analysis of the results. The analysis performed determines how likely it is that an experiment variation had a direct impact on the metrics you’re tracking.

Figure2: CloudWatch Evidently experiment results showing variation performance

Leverage automation to realize results faster

Manual operations and incident response procedures can detract from the customer experience when system issues inevitably occur. Any important tasks should be automated and tested regularly to ensure effectiveness. By implementing a proactive detection strategy using CloudWatch Synthetics canaries, you’ll be able to discover issues faster and reduce your mean time to identify (MTTI) and mean time to resolve (MTTR) issues.

As your canaries are running 24×7, they generate a wealth of data that developers can use to quickly pinpoint the root cause of issues, minimizing the impact on your customer’s experience. You can collect fine-grained metrics and logs from your servers using the CloudWatch agent, automatically link them to AWS X-Ray tracing data, and visualize results using CloudWatch ServiceLens.

Because canaries are defined using code, they can be stored along with application source code and maintained using the same process. Besides continuously monitoring your production applications, the canary tests can be integrated with your build pipeline so they’re run against your test environments automatically after a new application deployment to validate that the user experience has not been negatively impacted. You’ll be able to deliver software more frequently and with greater confidence by performing automated testing before your code changes are promoted to production.

Increase transparency and improve communication

Once you’ve determined your performance goals and implemented associated canary tests, it’s important to bring visibility to that information so that it can be acted upon as quickly as possible. All of your key metrics and indicators should have alarms associated that send alerts to the team members when thresholds are exceeded. For a more comprehensive approach, configure multiple threshold-based alerts to increase communication as problems persist.

Once an alert reaches its intended audience, CloudWatch dashboards can be used as a central source of operational data, depicting the current state of their environment and creating a shared organization-wide view. For support engineers that must investigate further, the CloudWatch alarms can also trigger the creation of an OpsItem in AWS Systems Manager OpsCenter. The OpsItem can be used to track the issue and actions taken going forward. To add an extra level of visibility, you may also choose to route your alarm events to one of the Amazon EventBridge API destination partners.

Build a culture of accountability

With the ability to write and execute test cases in your development environment, you can better track areas of responsibility across team members. Monitor your tests and tie results back to individuals on your team to drive ownership as applications change over time. This can accelerate the pace of development, both for front-end developers with user interface testing requirements and API developers that need to confirm the functionality of their REST API endpoints.

As your suite of canary tests grows, you may see the need to organize and track them in groups to more easily isolate and drill in to canary failures. To facilitate this, CloudWatch Synthetics recently added support for custom canary groups which gives you the ability to view the results of your canary tests at an application or group level. In addition to drilling in to individual test results, you can view aggregated run results and statistics for all canaries within a group making it easier to pinpoint issues across your set of tests. Canary group combinations are commonly based on application component, business function, business criticality, or customer workflow. When you create a group, it is replicated across all AWS Regions that support groups. You can add canaries from any of these Regions to the group. A canary can be a member of multiple groups. By tailoring these groups to your DevOps teams, you can more clearly delineate responsibilities and define what actions should be taken in case of a canary failure.

Figure3: CloudWatch Synthetics canary list dashboard showing canary run results by group

Foster a continuous improvement mindset

The data being collected as you monitor your applications is commonly used for point-in-time investigation and assistance with troubleshooting. The data can also be valuable when aggregated over time. Use this historical data to identify performance degradation and bottlenecks so you can address them before affecting your end-users.

To complement your synthetic monitoring, incorporate additional data from your end users by performing real-user monitoring with CloudWatch RUM to understand how your application is being used by actual users. This will help to inform your product roadmap and allow you to make data-driven improvements to the customer experience. Use the same tools to monitor the effectiveness of changes made to your applications over time.

CloudWatch Evidently can be used for greater control over the feature launch process and to incorporate A/B testing and experimentation. Safely validate changes in functionality by exposing them only to a small percentage of customers that you choose. The ability to experiment and make adjustments based on results allows DevOps teams to work at an accelerated pace .

With extended metrics you can send relevant metric data from CloudWatch RUM to CloudWatch and CloudWatch Evidently. You can then trigger an alarm to be notified of problems identified once the metric data arrives in CloudWatch, for example an increase in JavaScript errors for a particular browser. Any of the CloudWatch RUM metrics can be sent to CloudWatch. Relevant performance and user experience metrics can be sent to Evidently to be used in experiments. This ability to monitor data, alarm, and make adjustments in near real-time will allow you to deliver value to your customers faster.

Conclusion

Amazon CloudWatch application monitoring features helps AWS customers gain a unified view of operational health, giving them an ability to proactively monitor the customer experience at scale. With visibility into the experience of real users, 24×7 synthetic user monitoring, and the ability to safely launch features, you can reduce the time it takes to identify and resolve issues that impact your customer’s experience.

To learn more about CloudWatch application monitoring, see the CloudWatch Application monitoring documentation.

To learn more about AWS monitoring and observability best practices, see the AWS Well-Architected Framework Management and Governance Cloud Environment Guide.

About the author:

AWS Cloud Operations & Migrations Blog