AWS Cloud Operations Blog
How Audible used Amazon CloudWatch cross-account observability to resolve severity tickets faster
This blog was co-written with Audible’s Apurva Jatakia, Kaushik S., and David Etler.
Audible’s consumption services platform serves thousands of requests every second, and each incoming request is served by a distributed set of microservices owned by different teams. An Audible team, in charge of a platform called Stagg, is responsible for five separate microservices. The Audible Stagg team supports the player and library projects and powers experiences on Audible’s mobile app and website. Each AWS service generates its own logs and metrics, and until last year, the Audible Stagg team did not have a holistic view of how requests were flowing through the service chain.
The absence of a unified observability platform means on-call engineers do not have a tool to trace the request/response/exceptions for a request across the services, increasing the time that is spent to identify and solve customer-facing tickets. Customers who have hundreds of AWS accounts can benefit from learning how Audible was able to correlate metrics, logs, and traces and create a unified observability solution providing a single-pane-of-glass view across a set of microservices.
In this blog post, we show how a team at Audible implemented a unified observability solution using Amazon CloudWatch cross-account observability that helped them become more efficient, saving 60% debugging time; as well as achieving greater developer satisfaction.
Challenges Audible Stagg was facing
- Inability for on-call engineers to link metrics/alarms to any associated request logs
- Inability to trace a request across services
- Inability to correlate interactions in the app with service traces
- Lack of holistic view of AWS services across Audible Stagg platform
When triaging customer facing issues, Audible had to transfer the tickets from one team to another until the root cause was found. Even within a set of microservices owned by a single team, the absence of a tracing solution meant engineers had to spend a significant amount of time analyzing logs of each service in the chain to ultimately build a holistic view.
Additionally, it was difficult to correlate issues reported by customers with the associated service requests. This resulted in Customer Service personnel having to request additional information from the customer to gather enough details to be able to reproduce the issue, as well as a significant amount of developer time spent on reproducing the issue.
Solution requirements
- Ability to trace a request across all Audible Stagg services, as well as the Audible app and other external clients
- Ability to query and drill into a trace to see AWS services associated with an application that failed and the root cause exception
- Provide an easy way for on-call engineers to link traces to logs, when further analysis is required
- Provide a single-pane-of-glass view for AWS services associated with an application spanning multiple AWS accounts
- Support a variety of service frameworks (Java, Node.JS) and compute platforms (EC2, ECS, and Lambda)
Key Decisions
Amazon CloudWatch cross-account observability
Audible Stagg started using Amazon CloudWatch cross-account observability for cross-account tracing, logging, and metrics. Multiple source accounts feed into one monitoring account. The number of source accounts can scale up to 100,000. The current service quotas can be found here. A monitoring account is a central AWS account that can view and interact with observability data generated across other accounts. A source account is an individual AWS account that generates observability data for the resources that reside in it.
Source accounts share their data with the monitoring account. Cross-account tracing aggregates traces from multiple source accounts into a single monitoring account. This enables a complete view of requests that travel across multiple accounts. You can view cross-account traces in the AWS X-Ray service map and traces pages within the CloudWatch console.
AWS X-Ray
AWS X-Ray is a distributed tracing solution which provides an easy way to trace requests across the service chain. Audible Stagg onboarded their services to X-Ray, which allowed them to trace a request across services. Since X-Ray requires little infrastructure to maintain; services simply needed to onboard to X-Ray’s agent to get started.
High level summary of how X-Ray works:
- X-Ray links (or “correlates”) requests together by using a “trace id”. Trace ID is generated by the service which serves a request first and is propagated to downstream services.
- Once the X-Ray agent is setup and necessary permissions are given to publish X-Ray traces, you can use AWS X-Ray UI to query and select specific traces to debug the issue.
Roadmap to solution
High-level solution architecture
Features of the solution
With the solution, you can create a centralized AWS observability account for collecting the logs, traces, and metrics, and get a global view of the data. Here are the main features of the solution.
- Using AWS X-Ray for tracing: The solution deploys AWS X-Ray agents in the services across AWS accounts. AWS X-Ray supports applications running on EC2, ECS, Lambda, Amazon SQS, Amazon SNS and Elastic Beanstalk. In addition, the X-Ray SDK automatically captures metadata for API calls made to AWS services using the AWS SDK. X-Ray tracks requests flowing through applications or services across multiple regions. X-Ray data is stored locally to the processed region but with enough information to enable client applications to combine the data and provide a global view of traces. The X-Ray agent can assume a role to publish data into an account different from the one in which it is running for EC2 and ECS. This enables publishing data from various components of the application into a central account.
- Using CloudWatch for log collection: The CloudWatch Logs Agent will send log data to Amazon CloudWatch in each service’s AWS account. Cross-account logging enables us to view all these logs in our centralized monitoring account.
- Using CloudWatch for metrics monitoring: Amazon CloudWatch allows you to monitor AWS cloud resources and the applications you run on AWS. Metrics are provided automatically for a number of AWS products and services, including Amazon EC2 instances, EBS volumes, Elastic Load Balancers, Auto Scaling groups, EMR job flows, RDS DB instances, DynamoDB tables, ElastiCache clusters, RedShift clusters, OpsWorks stacks, Route 53 health checks, SNS topics, SQS queues, SWF workflows, and Storage Gateways.
- Amazon CloudWatch ServiceLens: You can get a unified view of X-Ray and Cloudwatch metrics and logs using Amazon CloudWatch ServiceLens, that helps you visualize and analyze the health, performance, and availability of your applications in a single place. CloudWatch ServiceLens ties together CloudWatch metrics and logs as well as traces from AWS X-Ray to give you a complete view of your applications and their dependencies. This enables you to quickly pinpoint performance bottlenecks, isolate root causes of application issues, and determine users impacted. CloudWatch ServiceLens enables you to gain visibility into your applications in three main areas: Infrastructure monitoring (using metrics and logs to understand the resources supporting your applications), transaction monitoring (using traces to understand dependencies between your resources), and end user monitoring (using canaries to monitor your endpoints and notify you when your end user experience has degraded).
Outcome
Audible Stagg was able to leverage all X-Ray features including the service map and traces. They were able to access all these features from a single monitoring account, even though the underlying services are spread across many AWS accounts.
Service map showing holistic view of Audible Stagg’s services in the monitoring account
Drilling into a node on the service map
Metrics and trace correlation
Log insights
For any trace, Audible Stagg was able to easily pull up the corresponding logs for that trace, from any of the services. They were able to access these logs from the shared monitoring account.
Where Audible Stagg is today
With the implementation of cross-account observability, Audible’s Stagg team can now just log into one centralized account to identify the issue. This has saved them 60% debugging time that was earlier spent on triaging high-severity issues. They are now able to access logs, metrics, and traces for all their services in a centralized account.
Another benefit the team has seen is increased developer satisfaction. Leveraging X-Ray in conjunction with CloudWatch has allowed developers to tackle issues more quickly and with higher confidence. X-Ray allows the Stagg developers to query their services under a single AWS account, and this capability has cut down time and effort in having multiple log windows open or having to constantly sign in and out between services and AWS accounts.
The observability solution met Audible Stagg team’s need and they plan to help onboard other Audible teams to the solution. They would also be leveraging further features of X-Ray down the line, like custom annotations, and plan to expand use of X-Ray for other use cases, such as QA and customer care bug reporting.
Conclusion
In this post, we saw how Audible implemented their unified observability solution which helped provide a rich cross-account observability and discovery experience for their metrics, logs, and traces. Cross-account functionality is integrated with AWS Organizations to help you efficiently build your cross-account dashboards. The Audible team has been continuously working towards including more services and accounts to use with the cross-account observability solution.