AWS Cloud Operations & Migrations Blog

Best practices: Implementing observability with AWS

As customers deploy cloud-based solutions, they need to be able to ensure that systems are running smoothly, and that they can quickly remediate issues when they arise. Deploying observability at scale can be challenging for customers, especially when it involves tens and hundreds of services across their enterprise. Customers want best practice recommendations, guidance in their tool selection, and most importantly a step-by-step process for getting started. In order to simplify the process of implementing a robust observability strategy with AWS, we have put together a best practices guide. In this post, we will explore the topics covered in the guide, how you can benefit, and also contribute to the guide. The best practices guide provides a roadmap for customers to get started and evolve their observability strategy to address more complex scenarios.

Topics Covered in the Best Practices Guide
The best practices in the guide are organized by AWS Services, use of data types, and specific observability tools. In addition, the guide also contains curated recipes derived from actual customer engagements and customer feedback. Curated recipes are templated solutions to help users get started with observability based on their needs and outcomes. If you are only starting with monitoring and observability now, you can start with general best practices and branch out to other sections based on the tools and data type that you choose. Others looking to mature their observability strategy can directly begin with specific sections of interest to them. Irrespective of the approach you take, as stated in the best practices guide, you should proactively plan for observability instead of adding it as an afterthought later in development.

The best practices guide covers a vast range of scenarios like choosing the right tool if you are getting started with the process, additional considerations for a hybrid or multi-cloud environment, or a scenario where you are thinking of using of machine learning to manage the baselines and identify anomalies.

The guide also states that while it may be tempting to gather as much data as possible, it can lead to system degradation, tedious analysis, and cost inflation. So, it provides guidance to focus only on the metrics that matter. These metrics vary from business to business. For example, a payment processor may track transaction processing time while a university may want to track student attendance. You should then decide the telemetry data to capture based on their impact on those metrics. The guide also advises to collect telemetry data across all the tiers of the workload. In many cases, you need to troubleshoot in the context of an end user. Hence, having a single unique identifier that ties the insight from the data across tiers is critical to this experience. In addition, the guide also provides useful information regarding how to choose the right tracing agent.

The guide has individual sections outlining best practices for monitoring Amazon Elastic Compute Cloud (Amazon EC2) and databases. It provides special attention to implementing observability for containers with subsections dedicated to collecting system and service metrics for Amazon Elastic Container Service (Amazon ECS) and Amazon Elastic Kubernetes Service (Amazon EKS) using AWS and managed open-source solutions.

The guide also provides best practices for monitoring cost of observability tools and provides recommendations for options for visualizing those costs. The guide outlines best practices for calculating and monitoring Service Level Indicators, Service Level Objectives, and Service Level Agreements, with clear and concise examples. In some cases, our customers deploy their workloads on our partner solutions, like Databricks on AWS, to address their specific use cases more effectively. The guide recommends best practices for monitoring even those workloads using AWS Native services or AWS Managed Open-Source services. We expect this section to expand over time with the addition of other partner solutions.

Observability is built on three pillars of logs, metrics, and traces, and each requires specific focus. Therefore, the best practices guide addresses them in separate subsections under the Data types section.

Most of modern-day architectures are event-driven. Hence, they need special consideration in context of observability. You can find the best practices for integrating events with observability and deriving actionable insights in the section. The last topic discussed in this section is alarms and the best practices to avoid common challenges like alarm fatigue and “everything is OK alarm”.

You can also look into best practices for each observability tool in the Tools section. The section includes best practices for Amazon CloudWatch Agent, Alarms, Dashboards, Amazon CloudWatch Internet Monitor, Amazon CloudWatch Logs, Metrics, Real User Monitoring, Synthetic Testing, and Tracing with AWS X-Ray. Finally, you should look into curated recipes to learn about the experiences of other AWS customers. The curated recipes have been organized by six dimensions of observability, telemetry (signals by source and destination) and Tasks. You can find a curated recipe based on the dimensions that fit your workload. For example, if you have an AWS Lambda application backed by Amazon RDS, you can find a curated recipe for them under dimensions. You can also find curated recipes by tasks that you want to accomplish for your workload. For example, if you want to proactively monitor Amazon RDS application, you can find the recipes under the Alerting subsection of Tasks section.

Contributing to the Best Practices Guide
In addition to providing best practice recommendations, the best practices guide aims to provide a forum to the community for sharing experiences, suggestions, and enhancements. Therefore, if you would like to contribute to the guide’s content or seek suggestions from the community, you can do so using the discussions section of the guide.

Conclusion
The best practices guide is a valuable resource for users seeking to optimize their monitoring and observability practices. By providing comprehensive guidance, this guide empowers you to make informed decisions, avoid common pitfalls, and unlock the full potential of observability in your workloads.

AWS’ intention behind this guide is to foster a culture of excellence in monitoring and observability, ensuring that AWS users can derive maximum value from their investments. By contributing to the guide, you can actively participate in the collective knowledge sharing and continuous improvement process. Together, let’s build robust, scalable, and efficient AWS deployments that deliver exceptional performance and reliability.

If you want additional resources for observability on AWS, try the One Observability Workshop to get hands-on experience with observability on AWS. You can also look at Terraform AWS Observability Accelerator and CDK AWS Observability Accelerator to learn how you can set up observability for your AWS environments.

About the authors:

Deepak Jha

Deepak Jha is a Customer Solutions Manager at AWS, currently focused on cloud journey acceleration among Games customers and an aspiring Cloud Operations Technical Field Community member at AWS. He is passionate about using technology to solve business problems of his customers and has more than 23 years of experience doing the same..