AWS Cloud Operations Blog
How Unitary achieved automatic metric collection with Amazon Managed Service for Prometheus collector
This post was co-authored with Nicolas Fournier, Platform Engineer at Unitary.
Every day, over 80 years’ worth of video content is uploaded online. Some of this content can also be harmful. Unitary knows that human moderators are the current gold standard for moderation, but this manual approach does not scale. While automated systems can scale, today’s solutions often fail to understand context.
Unitary was founded to help brands and platforms understand every piece of content in detail. Unitary does this by building context-aware Al and multimodal machine learning methods, to interpret content in context. The company’s mission is to make the internet safer by understanding content accurately, swiftly and at scale.
When Unitary is sent a video to analyze, a number of microservices are run on Amazon Elastic Kubernetes Service (Amazon EKS) to split the video into different frames, and infer the characteristics of that content using machine learning. The company chose Amazon EKS along with Karpenter to allow rapid scaling in response to the volume of videos that need to be processed. Observability, the practice of understanding a system using the measured outputs, is key to enabling Unitary to do this at scale, and helps ensure that systems are working reliably and as intended.
This post discusses Unitary’s journey utilizing Prometheus, and how usage has changed as the company grew. From initially self-managing Prometheus to utilizing a fully-managed Prometheus metrics system in Amazon Managed Service for Prometheus, to recently adopting the service’s fully-managed agentless collection capabilities to collect Prometheus metrics.
Why Unitary chose Prometheus
Prometheus is one of the most popular tools for monitoring a Kubernetes environment, but it can also be very challenging to scale and operate. At the beginning of Unitary’s Prometheus journey, it examined different Prometheus providers, with the key considerations being Prometheus compatibility and the number of monitoring use cases that were supported. Given the support of Prometheus in the Kubernetes ecosystem, Prometheus compatibility was very important to Unitary. At Unitary, the number of nodes per Kubernetes cluster can vary from 30 to 1,000 nodes in a matter of minutes. From a Prometheus perspective, more nodes mean more time series to handle, and more samples to scrape. The company wanted to make sure the solution could scale with the unpredictable nature of its workloads.
Unitary started with this architecture.
Figure 1: Self-managed Prometheus architecture backed by Amazon EBS volumes
In order to achieve redundancy and high availability, both Prometheus replicas were scraping the same metrics. However, Unitary could only configure its Grafana instance to read from one replica at a given time. Unitary also used the Vertical Pod Autoscaler (VPA) to scale up or down the resources allocated to the Prometheus replicas. Since VPA can take a non-trivial time to adjust CPU and RAM requests and limits, Unitary often ended up with Prometheus replicas running out of memory when scaling at a high rate. In the past, Unitary had situations where customers started sending increased volumes of videos to process, which created more metrics, which eventually overwhelmed the local Prometheus instance. However, Unitary didn’t notice the issue quickly because nothing erroneous was appearing in Grafana. To make sure Unitary always had workload visibility, it needed a better way to manage the reliability of the Prometheus solution. Furthermore, setting up Prometheus to be highly available, multi-AZ, and highly resilient to spikes in traffic or infrastructure issues required significant investment. To minimize the operational risks, Unitary decided to look for managed solutions.
Reducing maintenance overheads with Amazon Managed Service for Prometheus
When Amazon Managed Service for Prometheus was released, Unitary configured the Prometheus server to remote-write into the Amazon Managed Service for Prometheus workspace. This provided the peace of mind that all its time series are stored and available, and that Unitary was only paying for what is used, based on metrics ingested, queried, and stored. It removed the pain Unitary was experiencing to scale the storage and provided a single query endpoint for its Grafana instance. By configuring “replicaExternalLabelName” and “externalLabels” according to the AWS documentation, Amazon Managed Service for Prometheus automatically de-duplicated data from multiple Prometheus agents, making sure Unitary didn’t see the same metrics multiple times. Now with Amazon Managed Service for Prometheus, the company doesn’t have to worry about the durability, reliability, and availability of its Prometheus datastore. By using an isolated metric store, Unitary can ensure that observability still works even if the EKS cluster is having issues, increasing reliability and durability.
Figure 2: Amazon Managed Service for Prometheus with Prometheus agent architecture
Adopting Amazon Managed Service for Prometheus solved part of the problem – the “backend” side of Prometheus (storage, read/write and querying). Unitary decided to use Prometheus with Agent Mode enabled, in order to remote-write to Amazon Managed Service for Prometheus, without a need for local storage. However, these agents would still need to be resized using VPA as the cluster scaled, which could mean that metrics endpoints were missed or that agents went down if a cluster scaled too quickly. The Prometheus agents have the same problem as the Prometheus server, in that it doesn’t scale well as workloads scale. If you have a rapid scale event, Prometheus agents are just as likely to run out of memory, which can lead to gaps in metrics being reported to Amazon Managed Service for Prometheus. Unitary expected the Prometheus agent to be light weight, but unfortunately, it still requires maintenance, operations, and capacity management. In practice, as the number of nodes on the cluster increased, Unitary had to enable those Prometheus pods to run on increasingly larger EC2 instances. An EC2 instance type rarely matches the exact memory required by pods, so it often had to select instances with larger memory than required by the pods, resulting in extra cost.
Automatic agentless Prometheus metric collection with Amazon Managed Service for Prometheus collector
Amazon Managed Service for Prometheus collector solved the problem of having to run Prometheus agents in Unitary’s Kubernetes clusters. This allowed the company to remove the Prometheus running in-cluster, remove the overhead of right-sizing and managing those containers, and gave Unitary peace of mind that its collectors were multi-AZ, highly available, and autoscaling. The collector was easy to enable for its Amazon EKS clusters, and was fully Prometheus-compatible, making it painless to transition from self-managed collectors. Now, the metrics pipeline is more reliable and resilient than it was before, ensuring that observability will continue to work even if the cluster experiences issues.
Unitary has been running the Amazon Managed Service for Prometheus collector on a dev cluster. It’s able to scrape all Prometheus metrics, and was able to remove all local agents. With the collector, it gets a multi-AZ and highly available Prometheus metrics scraper without the need to manage any agents.
Unitary plans to expand its usage of the collector to its remaining EKS clusters. Unitary’s usage is fundamentally unpredictable, which means that scaling Prometheus is fundamentally unpredictable. There is a limit to how much Unitary can resource the agents themselves.
Figure 3: Amazon Managed Service for Prometheus collector architecture
Getting Started with Amazon Managed Service for Prometheus collector
To get started with Amazon Managed Service for Prometheus collector, an agentless way to discover and collect Prometheus metrics from Amazon EKS applications and infrastructure, you can visit our user guide, or enable the collector directly from the Amazon EKS console while creating a new cluster. The collector can be connected to existing and new EKS clusters, and configured with any Prometheus-compatible scrape configuration. Visit our docs to learn more.