AWS Cloud Operations & Migrations Blog

How StormForge reduces complexity and ensures scalability with Amazon Managed Service for Prometheus

This blog post was co-written by Brent Eager, Senior Software Engineer, StormForge

StormForge is the creator of Optimize Live, a Kubernetes vertical rightsizing solution that is compatible with the Kubernetes HorizontalPodAutoscaler (HPA). Using cluster-based agents, machine learning, and Amazon Managed Service for Prometheus, Optimize Live is able to continuously calculate and apply optimal resource requests, limits, and HPA target utilization across thousands of workloads. In this blog post, we’ll discuss how using Amazon Managed Service for Prometheus allows StormForge to focus on their core business outcome instead of worrying about managing the overhead of a Prometheus server.

When the team at StormForge began building Optimize Live, they had a number of very specific requirements they had to meet:

  • They needed the ability to ingest, store and query a large volume of container metrics.
  • They needed to be able to play well in the cloud-native ecosystem. Their platform runs entirely on Amazon Elastic Kubernetes Service (EKS).
  • They needed a scalable solution that could support their largest customers with minimal engineering effort.
  • They wanted a cost-effective solution that could support significant query volume.
  • They wanted a managed service. As a small team, they don’t have the time to handle the undifferentiated heavy lifting of scaling the metric time series database.

Architecture

These requirements led the team to Amazon Managed Service for Prometheus. One of the out-of-the-box features that proved indispensable was the ability for their metrics agent to leverage the RemoteWrite API, which greatly streamlines the installation process for their customers. If customers can’t get the solution up and running, there’s no metric data, which means there are no rightsizing recommendations, and customers aren’t getting value from the product.

Customers remote write to the Optimize Live ingest gateway, which remote writes metrics to Amazon Managed Service for Prometheus. Customers can connect to a UI, which queries the Prometheus workspace for visualization data. ML workers query the Prometheus workspace for recommendations, cache recommendations in Amazon S3, and retrieve tasks from the Amazon MQ work queue.

Figure 1: StormForge Optimize Live architecture.

There are 5 major components that make up the system architecture and enable StormForge to provide workload recommendations to their customers (see figure 1).

  1. Ingest Gateway — They built a custom ingest gateway application that handles the authentication, authorization, and metrics routing. The ingest gateway is a stateless, horizontally-scalable microservice that allows them to scale transparently to handle any amount of traffic from their customers.
  2. Amazon Managed Service for Prometheus — By provisioning a new workspace for each customer, StormForge is able to provide customer data isolation as a foundational principle of the solution. Isolation and access rely on AWS Identity and Access Management (IAM), which further simplifies integration. Amazon Managed Service for Prometheus automatically builds isolated backend resources when a new workspace is provisioned, which is fast and allows StormForge to rapidly onboard new customers. Since each provisioned workspace is Prometheus-compatible, they are able to build their UI against an open standard, which greatly reduces the complexity of displaying data points in the frontend application.
  3. Read CacheAmazon Simple Storage Service (Amazon S3) is used as the caching mechanism of the solution. Amazon S3 provides a cost-efficient way to cache queried metrics. To maximize the efficiency of the cache, they slice the query time ranges to be small, which both helps fill in any missing data in the cache and minimizes any direct queries. Currently, their cache-hit rate averages greater than 90%.
  4. Machine Learning Workers — The recommender queries either the Prometheus workspace or the Amazon S3 cache for metric data to generate recommendations for workloads. The actual tasks that are to be processed are submitted to an Amazon MQ Depending on a customer’s schedule for recommendations, StormForge is able to distribute when tasks are sent to the queue, which normalizes the volume of recommendations per minute that the workload is generating.
  5. UI — Through the UI, StormForge customers can view recommendations and apply them directly to workloads. They provide dashboards to show the configured requests over time, plotted against the recommendations that have been generated and applied. For the dashboards to be responsive, these queries need to be as fast as possible, and the Prometheus workspace delivers this with no configuration or tweaking required.

Amazon Managed Service for Prometheus provides a Prometheus-compatible API, which enables StormForge’s partners to build on top of their architecture without any extra effort on their part. Amazon Managed Service for Prometheus is able to rapidly provision new workspaces, so customers don’t have to wait around after they sign up for Optimize Live. This rapid provisioning ensures that metrics can start flowing as soon as possible, with recommendations following closely behind. Amazon Managed Service for Prometheus easily supports StormForge’s scale requirements, not only for ingested samples per second, but also for queries per second. In turn, their horizontally-scalable ML recommendation workers can submit these queries both rapidly and in parallel.

After StormForge launched Optmize Live, they found that query costs from some of their larger customers far surpassed the cost to ingest the data. They were able to work with the team at AWS to find ways to tune the query logic, which allowed them to continue to provide accurate recommendations while pushing costs down. They optimized their queries to only return metrics that were needed. Through this work, they were able to decrease the query cost of Amazon Managed Service for Prometheus by more than 50%.

The proven scale that AWS provides and the trust StormForge built in their relationship with the teams at AWS is what helped ensure Optimize Live was built with the right architecture and was the right solution to meet their business objectives. The teams at AWS worked with StormForge early on to ensure they would meet their scalability goals.

Conclusion

Thanks to Amazon Managed Service for Prometheus, Optimize Live is able to ingest and query the millions of metric data points sent by their customers. These customers generally operate in the 100,000+ workload scale, spread across 10x as many containers running on thousands of clusters. Continuously generating recommendations that customers can trust is no simple feat, but the Optimize Live platform continuously accomplishes this task. Generating each recommendation is dependent on Amazon Managed Service for Prometheus’ performance and ability to scale, so those recommendations can be provided in a timely manner.

The ease with which the solution can be configured and deployed ensures that StormForge’s small team is able to confidently manage the platform. They are able to rely on Amazon Managed Service for Prometheus to provide the quick data access needed by their machine learning workers, app frontend, and customers. Amazon Managed Service for Prometheus silently and continuously scales in the background to handle the increasing flow of data and queries.

As a next step, get started today with Amazon Managed Service for Prometheus for your workload, or start a free 30-day trial of Optimize Live for help with rightsizing your Kubernetes workloads.

About the authors:

Mike George

Mike George is a Principal Solutions Architect based out of Salt Lake City, Utah. He enjoys helping customers solve their technology problems. His interests include software engineering, security, artificial intelligence (AI), and machine learning (ML).

Brent Eager

Brent Eager is a Senior Software Engineer at StormForge focusing on infrastructure scalability, reliability, and observability. He has spent the last 7 years architecting and building a variety of systems running on AWS, and writing Go in his spare time to tie the pieces together. Outside of work, he enjoys exploring new breweries and traveling with his family.