How Amazon CloudWatch Metrics Insights enables Ring to monitor large cloud fleets easily

Amazon CloudWatch has recently launched Metrics Insights – a fast, flexible, SQL-based query engine that enables you to identify trends and patterns across millions of operational metrics in real-time. With Metrics Insights, you can easily query and analyze your metrics to gain better visibility into the health and performance of your infrastructure and large-scale applications. Metrics Insights is also available in Open Source Grafana and in Amazon Managed Grafana.

Amazon Ring is Amazon’s smart home security brand, and it has devices ranging from Video Doorbells, Indoor and Outdoor Cameras, and Home Security Systems.

Ring has complex and extensive infrastructure monitoring needs. Ring’s core function – recording and delivering live video to customers – requires tens of thousands of Amazon Elastic Compute Cloud (Amazon EC2) instances across hundreds of services and microservices working in perfect harmony. Monitoring can be a challenge with a fleet of that size operating across hundreds of AWS accounts. Originally, Ring had used a third-party monitoring tool to perform complex monitoring and alerting. Ring sought to utilize first party tooling inline with an initiative on better security and improved flexibility and functionality.

Ring DevOps took on the challenge to develop and deploy a functional monitoring solution before Ring’s biggest event of the year – Halloween – where streaming traffic often more than doubles. Furthermore, the team needed to make that happen in a matter of three months to replace its earlier monitoring solution without losing fidelity or scale.

The key requirements for Ring’s solution included:

Flexible, real-time aggregation, which means higher uptime and better service for Ring’s customers.
Analysis at scale.
Visibility and accessibility to all relevant monitoring and alerting data.
Ease of getting started.
Security.
To be able to use metric labels in grouping to add more context behind monitoring.
Uniformity where all of Ring’s services must produce similar metrics.

On top of the requirements listed above, Ring has experienced that pre-aggregations defined earlier often lead to observability gaps. This is why the Ring DevOps team has been looking for flexible server-side aggregation for real-time root cause analysis. Moreover, having proprietary languages for analysis had created a blocker for Ring teams earlier in getting started. Furthermore, Ring needed to be able to create metrics that are reflective of the complex deployments of its applications, and it needed to provide an easy, common, and single pane of glass monitoring environment to all stakeholders.

The main hurdle that the Ring DevOps team ran into was server side aggregation. Before Metrics Insights’ launch, Ring was required to plan ahead to identify how the metrics will be called out and configure the identifiers on client-side before deployment. This was an overhead for Ring, and it doesn’t pay off when eventually an issue occurs and the DevOps team find themselves left with the proper metric aggregations not defined. CloudWatch Metrics Insights enabled Ring to define server-side aggregations based on use cases, and to use these real-time definitions to reduce mean time to resolution.

Furthermore, Metrics Insights comes with the ability to slice and dice operational metrics at scale with dimensions, dive deep, and identify issues to the finest granularity. With Metrics Insights, customers can scan through millions of metrics, group them using dimensions, and quickly narrow down the analysis to pinpoint issues. Moreover, Metrics Insights queries can be used to create powerful visualizations that will stay up-to-date as resources are deployed or removed. This helps to proactively monitor and identify issues rapidly.

Therefore, CloudWatch Metrics Insights allowed Ring to do the complex real-time aggregation of metrics easily. This removed the overhead of having to define pre-aggregations, and it also provided full visibility across the whole fleet at scale, with dynamic, to-the-point, and real-time aggregations.

Ring has been using Grafana as the default graphing and dashboarding tool in observability, as Grafana was commonly used by the stakeholders and has integrations with various backends. For example, it can use CloudWatch, AWS X-Ray, and Amazon Managed Service for Prometheus, as well as accommodate any other foreseeable backend switch. It offers a single pane of glass which can be securely shared with non-Ring users, such as SREs or key third-party collaborators. This allows Ring to remain in compliance with Security guidelines while maintaining the competitive advantage. This is why Grafana was selected as the default dashboarding tool for Ring.

Using Metrics Insights integrated with Grafana also offered another feature which unlocked the entire project for Ring: Grafana uses CloudWatch Insights under the hood and adds the functionality of alerting on Metrics Insights queries. As Metrics Insights comes already integrated with Grafana, it made it possible to easily power Ring’s dashboards with Metrics Insights queries without any extra onboarding required. Also, it allowed for the creation of alerts to monitor the health and performance of Ring services.

Last but not the least, Metrics Insights comes with standard SQL query language that lowers the barrier to entry. As well, Metrics Insights also offers a visual query builder that enables visually selecting the metrics, namespaces, and dimensions of interest. The console creates SQL queries for the user based on these visual selections, removing the requirement for query language know-how and providing a friction-free getting-started experience.

The conclusion of the project was a successful transition in time for Halloween, Ring’s biggest night of the year. The initiative allowed Ring DevOps’ internal customers to monitor critical server resources in a single pane. This let them respond quickly to capacity needs to continue offering the extremely high availability to which Ring’s customers are accustomed. The total usage numbers in terms of metrics insights are impressive: Ring processes approximately 10,000,000,000 metrics per month across all accounts.

CloudWatch Metric Insights helps Ring measure latency across all load balancers in a single image. These metrics are compiled using SQL queries and the graphs created using these queries stay up-to-date, thus reflecting the dynamic and short lifespan for modern ephemeral resources.

In the following example, Ring uses

SELECT SUM(HTTPCode_Backend_5XX)
FROM “AWS/ELB”
GROUP BY LoadBalancerName
ORDER BY SUM() DESC
LIMIT 10

This is done to query and list the summed HTTPCode_Backend_5XX metrics for each load balancer (LB), as well as display the top 10 problematic LBs with the highest server error responses sent from the registered instances. After identifying the related LBs, Ring checks the access logs or the error logs on the instances to determine the root cause.

Graph on a sample Metrics Insights query displaying the top ten Load Balancers with problematic error codes.

According to Alan LaCombe, Lead DevOps Engineer at Ring, “Thanks to the real-time server-side aggregation and easy getting started experience in Metrics Insights, every Ring team using CloudWatch and Grafana was able to quickly adopt Metrics Insights querying and add it to their dashboards easily. Metrics Insights is a powerful instrument for day-to-day operations and has already become the most widely adopted CloudWatch feature across Ring in a very short time.“

Conclusion

In this post, we described how Ring uses CloudWatch Metrics Insights to monitor their large scale cloud infrastructure, related services, and applications with real-time server-side aggregations. This enabled the pinpointing of issues faster and reduced the mean time to resolution (MTTR).

We also went through the frictionless getting started experience for Metrics Insights, eliminating the toil and training needed. This helped Ring to deploy as the functional monitoring solution before its biggest event of the year — Halloween. We also investigated how using Metrics Insights with Grafana integration helped Ring create alerts based on the results of CloudWatch Metrics Insights queries.

Metrics Insights is a new feature that lets you identify trends and patterns across millions of operational metrics in real-time, and it helps you use these insights to reduce the time to resolution. With Metrics Insights, you can gain better visibility into your infrastructure and large-scale application performance with flexible querying and on-the-fly metric aggregations. Metrics Insights is available in all commercial AWS Regions, and you can start using it immediately. To get started, select the All metrics link under Metrics on the left navigation panel of the CloudWatch console, and browse to Query tab. Metrics Insights is also available on the Amazon Managed Grafana console. To learn more about Metrics Insights, refer to the Metrics Insights documentation.

For more information about Ring, visit the website.

Authors:

AWS Cloud Operations & Migrations Blog

How Amazon CloudWatch Metrics Insights enables Ring to monitor large cloud fleets easily

Conclusion

Resources

Follow