Kubernetes right-sizing with metrics-driven GitOps automation

Efficient resource allocation in Kubernetes is essential for optimizing application performance and controlling costs. In Amazon Elastic Kubernetes Service (Amazon EKS), managing resource requests and limits manually can be challenging and error-prone. This post introduces an automated, and GitOps-driven approach to resource optimization using Amazon Web Services (AWS) services such as Amazon Managed Service for Prometheus and Amazon Bedrock. This approach is particularly beneficial for users who prefer non-intrusive methods for resource optimization.

Understanding the challenges of resource management in Amazon EKS

Understanding resource management in Kubernetes is crucial for optimal cluster performance. When deploying pods, the Kubernetes scheduler evaluates resource requests to find suitable nodes that can accommodate the specified CPU and memory requirements. These requests act as the minimum guaranteed resources for the pod, while limits serve as upper bounds to prevent any single pod from monopolizing node resources.

The impact of inefficient resource management

Over-provisioning and under-provisioning of resources in Kubernetes can lead to increased costs and performance issues. Striking the right balance is essential for optimal resource usage. In a shared environment, one pod consuming excessive resources can degrade the performance of others on the same node. Applications with fluctuating resource demands can be challenging to manage. Without adaptive resource allocation strategies, these workloads may experience performance degradation or resource waste. Furthermore, manual adjustment of resource requests and limits for dynamic workloads can be time-consuming and susceptible to human error.

Existing solutions for Kubernetes resource management

The Vertical Pod Autoscaler (VPA) or Goldilocks in recommendation mode, while useful, need installation inside the cluster, which means even more tooling in the clusters for you to manage, more resources used in the clusters for these tools, and configuration for each workload. At times, teams don’t have the flexibility or freedom to install custom tooling in clusters due to regulatory and strict security requirements. Other tools, such as Robusta KRR, can be run as a CLI, but use strategies to calculate resource recommendations, such as 95th percentiles, while needing custom implementations to incorporate more advanced machine learning (ML)-based strategies.

VPA and Horizontal Pod Autoscaler (HPA) can be used independently, but careful consideration is needed when using them together to avoid conflicting scaling situations. There are other tools, such as StormForge, that use autonomous agents and allow direct patching of requests and limits in the cluster. Although these solutions integrate with GitOps tooling, they break the principle of Git being the source of truth for configurations, which might not be acceptable in highly-regulated environments.

Although Helm and Kustomize streamline manifest management, they introduce a significant challenge for automated updates: pinpointing the correct template or values file to modify when resource recommendations are generated. Resource usage metrics and recommendations typically pertain to the final rendered Kubernetes objects (such as Deployments or StatefulSets), but the necessary changes must be applied to the source templates or their associated values files within the Git repository. This disconnect between the target of the recommendation (a rendered object) and the location of the necessary change (a source template file in git directory) creates complexity, because knowing the optimal CPU/memory for a specific Deployment isn’t enough.

How the proposed solution addresses the challenges

This post offers a non-intrusive solution that can optimize resources without modifying the cluster or disrupting existing autoscaling mechanisms and uses pull requests as a mechanism for change management. This solution runs outside of production EKS clusters and incorporates various out-of-the-box options for advanced configurable strategies, such as forecasting algorithms. Our proposed pattern uses the rendering logic inherent in GitOps tools such as Argo CD. We incorporate a manifest rendering step so that the workflow can accurately map the recommended resource changes back to the specific source files in the Git repository. Therefore, the automation can precisely target the correct files and generate pull requests that modify the templates themselves. This makes sure that the changes integrate with the existing GitOps process and templating structure, bridging the gap between resource recommendations and template updates. We have seen this pattern used in production by users using an automatic recommender integrated with their existing GitOps tooling and approach. This involves an initial setup from the platform team, and then each different development team across the organization assumes the ownership of accepting, or not, the suggestions on their specific application repositories, based on automatic pull request creation.

Solution overview

This post introduces an architectural pattern for automated optimization of Kubernetes resource requests and limits, as shown in the following figure, which combines the following three key components.

A diagram of a process flow to optimize Kubernetes resources usage with a GitOps driven solution scheduled via CI/CD with GitHub Actions.

Figure 1: Overall solution flow triggered on a schedule to optimize Kubernetes resources usage

Metrics-driven analysis: Uses Amazon Managed Service for Prometheus to collect and analyze historical resource usage patterns across your Amazon EKS workloads.
GitOps-based implementation: Uses ArgoCD to maintain a declarative approach to resource management, making sure that all changes are version-controlled and auditable.
Pattern-aware optimization: Supports diverse workload patterns through specialized resource optimization strategies:
- Time-aware analysis for applications with business-hour patterns
- Trend-aware for handling sudden spiky resource demands
- Workload-aware optimization for nuances of specific deployments
- Statistical ensemble combining Quantile Regression, Moving Average, and Prophet strategies for comprehensive pattern analysis

This solution runs on a scheduled basis, automatically creating pull requests using recommended values based on a recommendation generator engine with configurable options. This is combined with generative AI with Amazon Bedrock to explain the changes and generate the pull request descriptions. Organizations can adopt this pattern to continuously improve resource usage while maintaining GitOps principles of collaboration, version control, and auditability.

Workflow overview

The workflow is triggered at a scheduled interval based on the requirements through GitHub Actions. As a first step we clone the repository containing the Kubernetes configuration.
We fetch resources historical usage from Amazon Managed Service for Prometheus, connected to your production cluster, and calculate the recommended values for the limits and requests of each deployment based on configurable strategies.
The workflow creates a temporary local Kubernetes cluster and deploys Argo CD.
Using the Argo CD CLI, the workflow renders the full Kubernetes manifests based on several configuration values, overlays, and environment overrides. This allows the workflow to accurately map the recommended resource changes back to the specific source files in the Git repository, enabling precise targeting of the correct files.
Now that we have the updated values, we call Amazon Bedrock with the new values and the existing manifests and prompt it to generate a pull request with a description and an explanation for the changes. To make sure that the model generates the pull request content properly, we provide in-context learning examples passed with the prompt.
The workflow creates a new branch and a pull request on the GitHub repository with the changes containing the new Kubernetes limits and request values and a relevant description.
Finally, a developer has to review the pull request and merge it in order for a deployment to be initiated according to GitOps best practices.

Key architectural considerations

The solution incorporates multiple safeguards to make sure of production stability. Resource recommendations are bounded by configurable minimum and maximum thresholds to prevent extreme adjustments. Furthermore, the recommendation generator tooling isn’t directly interacting with the production Kubernetes cluster and thus avoids the need to install extra tooling and increase the possible attack surface of clusters.

The architecture seamlessly integrates with established GitOps practices and existing continuous integration/continuous development (CI/CD) pipelines. Resource updates are processed through the same approval workflows as other infrastructure changes, thus maintaining consistency and control. The solution supports various infrastructure as code (IaC) formats such as raw Kubernetes manifests, Helm charts, and Kustomize overlays. It includes the ability to parse and update resource definitions regardless of the templating mechanism used.

The solution uses the Amazon Managed Service for Prometheus long-term metrics storage capabilities to maintain an extensive history of resource usage patterns. This historical data informs recommendation algorithms and supports compliance requirements for change documentation. The solution includes configurable retention policies for recommendations and their associated metrics data, making sure of alignment with organizational governance standards.The architecture promotes collaboration through clear delineation of responsibilities between the platform and application teams. Platform teams can set global policies and constraints, while application teams maintain control over their specific workload configurations. The automated pull request process includes detailed context about recommendations enhanced with explanations and descriptions powered by generative AI and Amazon Bedrock, enabling informed decisions during review. Integration with notification systems makes sure that relevant stakeholders are aware of proposed changes and can participate in the review process.

Walkthrough

In this walkthrough, we give detailed instructions on how to do the following:

Set up an environment capable of querying Amazon Managed Service for Prometheus to compute resource recommendations.
Integrate the resulting resource configuration updates into your GitOps workflow for automated pull request creation, review, and deployment.

The walkthrough uses a sample application scenario to demonstrate the process. It compares the before-and-after resource configurations and shows screenshots of pull requests and deployment statuses. Follow these steps to learn how to replicate this solution in your own environments.

Prerequisites

Before you begin, make sure that you have the following:

An AWS account.
An EKS cluster configured with a GitOps tool such as Argo CD.
Access to Amazon Managed Service for Prometheus.
A functioning CI/CD pipeline.
Basic familiarity with containers and Kubernetes concepts.

Implementing a GitOps-driven automation for resource optimization

The following sections walk you through implementing a GitOps-driven automation for resource optimization.

GitOps principle

The GitOps paradigm treats infrastructure and application definitions as version-controlled artifacts. Any changes to Kubernetes manifests go through pull requests in a central Git repository, enabling clear audit trails, review processes, and rollbacks when needed.

Setting up the recommendation generator

You can integrate a recommendation generator with GitOps to automate resource optimization within a controlled, review-based process. The generator runs outside the cluster, consults metrics to produce recommended CPU and memory requests/limits, and generates pull requests updating the manifests in your Git repository. This setup operates outside the cluster and relying on existing metrics, thereby avoiding any impact on cluster operations or performance. A GitOps tool, such as Argo CD, then detects and deploys these updates after they’re approved and merged.

Environment and metrics source

To implement this solution, you must begin with an EKS cluster. Next, verify that Amazon Managed Service for Prometheus is properly configured to gather essential CPU and memory usage data from your cluster. You must establish a Git repository to store and version control your Kubernetes manifests, including all necessary Deployments and other configuration files. Finally, implement a GitOps tool such as Argo CD to continuously monitor your repository for changes, making sure that your cluster state always matches your desired configuration as defined in your Git repository.

This GitHub link is an example of creating an EKS cluster and a Prometheus workspace to get started with Terraform.

Local or External Installation

As a first step, clone the resource optimizer repository:

git clone https://github.com/aws-samples/K8sResourceResizer.git

Set up your recommendation generator in an environment external to your cluster, whether that’s a dedicated instance, a containerized solution, or integrated within your existing CI/CD pipeline runner. When it’s established, make sure that the generator is properly configured with the necessary credentials and endpoints to access and query historical metrics from Amazon Managed Service for Prometheus. With this setup in place, execute the tool to analyze current and historical usage data, which generates optimized resource recommendations tailored to your specific workload requirements. In this walkthrough, we set up the recommendation generator as part of a GitHub Actions CI/CD pipeline.

Automating the GitOps workflow

To set up the workflow on GitHub Actions:

Build and push the recommendation docker container image from the Dockerfile, and the source code of the repository you cloned, to your own container image registry. You can find instructions on this in the GitHub repository readme file, or you can use our example GitHub Actions workflow to build and push to ECR. To set up connection with AWS, follow this guide to configure an AWS Identity and Access Management (IAM) role according to best practices.
After you have the recommendation docker image in your own container image registry, set up a GitHub Actions workflow on the repository where you store your Kubernetes manifests or templates. This is an example GitHub Actions Workflow that triggers this flow based on a configurable cron schedule, parametrized with different configuration options for the recommendations, and using IAM roles for authentication. Go to the Strategy Selection section of the code repository to get an idea of the various available strategies and how to choose one. To set this up, configure IAM roles with GitHub Actions and configure these values as GitHub secrets:
- AMP_WORKSPACE_ID: Amazon Managed Prometheus workspace ID
- AWS_REGION: AWS Region
- GITHUB_TOKEN: GitHub token for authentication
Trigger the GitHub Actions workflow manually and verify that you get a pull request created on your Kubernetes manifests repository.

The following figures demonstrate how the produced pull request looks.

Example pull request description generated with the help of Amazon Bedrock on GitHub

Figure 2: Pull request description generated with Amazon Bedrock

View of the pull request changes according to the recommendations from the tool.

Figure 3: Pull request changes according to the recommendations

Cleaning up

To avoid ongoing charges, remove all of the resources that you have created as part of testing this solution:

Go to the AWS Management Console, find the Amazon Elastic Container Registry (Amazon ECR) repository, and remove the recommendation container image of the solution.
Follow the cleanup section of the GitHub repository and from the Amazon EKS directory run the command:

terraform destroy

Conclusion

This solution demonstrates a practical, non-intrusive approach to Kubernetes resource optimization that combines the power of metrics-driven decision making with GitOps principles. Organizations can use Amazon Managed Service for Prometheus for historical data analysis, Amazon Bedrock for intelligent manifest updates, and GitOps workflows through Argo CD to automate their resource optimization process while maintaining control and visibility over changes.

Although this solution provides immediate value for resource optimization, it also serves as a foundation for more advanced automation scenarios. Organizations can build upon this framework to incorporate more metrics, custom optimization algorithms, or integration with cost management tools. We encourage users to start with a subset of non-critical workloads to gain familiarity with the process, then gradually expand to cover more of their Kubernetes estate. Teams can follow the implementation guide and best practices outlined in this post to achieve better resource usage while maintaining the stability and reliability of their applications.

To get started, visit our GitHub repository and follow the setup instructions. We welcome community feedback and contributions to help evolve this solution further.

About the authors

Hari Charan Ayada is a Senior Solutions Architect at Amazon Web Services in Copenhagen.

Ioannis Moustakis is a Senior Solutions Architect at Amazon Web Services in Brussels.

Containers