Automating metrics collection on Amazon EKS with Amazon Managed Service for Prometheus managed scrapers

Managing and operating monitoring systems for containerized applications can be a significant operational burden for customers such as metrics collection. As container environments scale, customers have to split metric collection across multiple collectors, right-size the collectors to handle peak loads, and continuously manage, patch, secure, and operationalize these collectors. This overhead can detract from an organization’s ability to focus on building and running their applications. To address these challenges, Amazon Managed Service for Prometheus announced a fully- managed agentless scraper for Prometheus metrics coming from Amazon Elastic Kubernetes Service (Amazon EKS) applications and infrastructure. The fully-managed scraper is a feature that enables customers to collect Prometheus metrics from Amazon EKS environments without installing, patching, updating, managing, or right-sizing any agents in-cluster. This allows customers to offload the “undifferentiated heavy lifting” of self-managing agents to collect Prometheus metrics.

Walkthrough

In this blog, we’ll walk through how you can use infrastructure-as-code with Terraform or AWS CloudFormation to define an Amazon Managed Service for Prometheus scraper for an existing Amazon EKS cluster. With the support of Amazon EKS access management controls, we can fully automate Prometheus metrics collection for the Amazon EKS clusters. A high-level architecture with the fully managed scraper looks like the following diagram.

Figure 1: High-level architecture for metrics collection with Amazon Managed Service for Prometheus Scraper

Figure 1: High-level architecture for metrics collection with Amazon Managed Service for Prometheus scraper

Later, we’ll look into how to update the scraper configuration without disrupting metrics collection and provide insights into how to validate the setup, leveraging usage metrics to follow metrics ingestion into the Managed Prometheus workspace.

Prerequisites

An existing Amazon EKS cluster with cluster endpoint access control configured to include private access. It can include private and public access, but must include private access.
Amazon EKS authentication mode of the cluster to either the API_AND_CONFIG_MAP or API modes.
kubectl version 1.30.2 or later.
AWS Command Line Interface (AWS CLI) version 2.
Terraform version 1.8.0 or later.

Automate with AWS CloudFormation

AWS CloudFormation has released a new resource type AWS::APS::Scraper, to manage the lifecycle of a managed scraper. This resource accepts the following mandatory parameters:

Source: The source of collected metrics for a scraper. Currently, only a child block EksConfiguration which is an Amazon EKS cluster with cluster endpoint access control at least private, or private and public.
Destination: A location for collected metrics. This block can have a child block AmpConfiguration representing the Amazon Resource Name (ARN) of an Amazon Managed Service for Prometheus workspace. Note that the AmpConfiguration block is optional, and if omitted, will trigger the creation of a new workspace by the underlying CreateScraper API.
ScrapeConfiguration: A base-64 encoded Prometheus configuration that specifies which endpoints to collect Prometheus metrics from, the scraping interval (interval of collection), and service discovery (using Kubernetes endpoints for additional metadata).

In the AWS CloudFormation template, we have a parameter called AMPWorkSpaceAlias , if the value for this is provided as “NONE” then we will create a new AWS Prometheus workspace to store the Amazon EKS cluster metrics. You can alternatively provide the ARN of an existing Amazon Managed Service for Prometheus workspace (in the same region) using the Parameter AMPWorkSpaceArn.

The first step is to execute few AWS CLI commands to retrieve values like the Security Group ID and Subnet IDs of the Amazon EKS cluster (Note: Please follow the best practices to setup Security of the Amazon EKS cluster), which are required for the creation of the Scraper. The output from the below commands will be used as value for the Input Parameters defined in the AWS CloudFormation stack in the next step. Replace <EKS_CLUSTER> with your Amazon EKS cluster name.

# selecting one security group associated to the cluster's VPC
aws eks describe-cluster —name <EKS_CLUSTER> | jq .cluster.resourcesVpcConfig.securityGroupIds[0]

# selecting cluster's subnets
aws eks describe-cluster —name <EKS_CLUSTER> | jq .cluster.resourcesVpcConfig.subnetIds[0]
aws eks describe-cluster —name <EKS_CLUSTER> | jq .cluster.resourcesVpcConfig.subnetIds[1]

Now let’s build the AWS CloudFormation template, running the commands below to create the Scraper. Please replace WORKSPACE_ARN, EKS_CLUSTER_ARN, EKS_SECURITY_GROUP_ID, SUBNET_ID_1, SUBNET_ID_2 with their respective values retrieved earlier.

git clone https://github.com/aws-samples/containers-blog-maelstrom.git
cd containers-blog-maelstrom/amp-scraper-automation-blog/cloudformation
aws cloudformation create-stack --stack-name AMPScraper \
    --template-body file://scraper.yaml \
    --parameters AMPWorkSpaceArn=<WORKSPACE_ARN> \
    ClusterArn=<EKS_CLUSTER_ARN> \
    SecurityGroupId=<EKS_SECURITY_GROUP_ID> \
    SubnetId1=<SUBNET_ID_1> \
    SubnetId2=<SUBNET_ID_2>

After running these commands, the managed scraper creation will leverage Amazon EKS access entries to automatically provide, Amazon Managed Service for Prometheus scraper, access to your cluster. The AWS CloudFormation stack will take a few minutes to complete. In the next sections of the blog, we will confirm the resources are created as expected, either via the cli or the console.

Automate with Terraform

Now let’s see how to leverage Terraform for the end-to-end setup.

We will be creating an Amazon Managed Service for Prometheus fully managed scraper using the Terraform resource aws_prometheus_scraper. Make sure to complete the pre-requisites for the managed scraper creation to leverage Amazon EKS access entries to automatically provide, Amazon Managed Service for Prometheus scraper, access to your cluster.

Run these commands below and replace EKS_CLUSTER with your Amazon EKS cluster name.

git clone https://github.com/aws-samples/containers-blog-maelstrom.git
cd containers-blog-maelstrom/amp-scraper-automation-blog/terraform
terraform init
terraform apply -var eks_cluster_name="EKS_CLUSTER"

Validation

In the next sections, we will confirm that our managed scraper has been created, associated to the cluster and effectively collecting metrics.

Using AWS CLI

The list-scrapers CLI action allows to retrieve all the scrapers created. You can provide a filter to narrow down your search. In our example below, we filter on the alias amp-scraper-automation used in Terraform or AWS CloudFormation.

aws amp list-scrapers --filter alias=amp-scraper-automation

Figure 2 - View managed scraper using AWS CLI

Figure 2 – View managed scraper using AWS CLI

Using AWS Management Console

Login to the AWS account and Region where your Amazon EKS cluster is created. Select the cluster, click on the Observability tab and you should see the scraper created (as shown in the below screenshot).

Figure 3 - View managed scraper in the Amazon EKS console

Figure 3 – View managed scraper in the Amazon EKS console

Amazon CloudWatch Usage metrics

Amazon Managed Service for Prometheus will publish its usage metrics into Amazon CloudWatch. This allows you to
immediately have insights into the AWS EKS workspace utilization. You can set up Amazon CloudWatch alarms to track some of those metrics, depending on your use case. If you followed the steps above, you should be able to view these metrics in the Amazon CloudWatch console.

Looking at the Amazon CloudWatch Usage metric namespace, we select the IngestionRate and ActiveSeries metrics to validate and monitor the usage against service quotas, as shown in the following figure.

Figure 4 - Viewing Usage metrics in Amazon CloudWatch

Figure 4 – Viewing Usage metrics in Amazon CloudWatch

Let’s see some examples of setting-up Amazon CloudWatch Alarms for these Prometheus ingested metrics:

ActiveSeries – The quota on active series per workspace will be automatically adjusted to a certain point (as mentioned in the Service Quota page). To grow above, we can setup an Amazon CloudWatch Alarm to monitor its usage. For example, when ActiveSeries is above 10M, we receive an Alarm so that we can request a quota increase.
IngestionRate – We could use a DIFF and/or RATE metric math function to validate if there is any spike in the Ingestion that could be coming from a misconfiguration or if some teams are suddenly ingesting too many metrics.

Amazon CloudWatch will also create an Automatic Dashboard for these metrics ingested from Amazon Managed service for Prometheus.

Something important to note is that the managed scraper does not consume any resource in the source Amazon EKS cluster. If you list all the pods running in the cluster with the following command, try and see if you can spot any scraper!

kubectl get pods --all

High availability metrics data to avoid duplicate metrics

Now let’s see how to update the scraper configuration without disrupting the metrics collection, for this we will configure high-availability data with Amazon Managed Service for Prometheus. We need to add an external_labels under the global section, this can be any key:value pair. Here we are adding a label called source with a value reduce_metrics, and we have also reduced the metrics from the configuration, by just keeping the pod_exporter for this example.

Figure 5 - High-level architecture for HA metrics collection using scrapers

Figure 5 – High-level architecture for HA metrics collection using scrapers

Using Terraform as an example, we can add a new aws_prometheus_scraper resource block in the same file. In the snippet below, showing only the difference with the first scraper resource, we are using a smaller scrape configuration to collect less metrics. Note that we are adding external_labels under the global section of the scrape configuration.

resource "aws_prometheus_scraper" "reduce_samples" {
  # ... omitted for brevity 
  scrape_configuration = <<EOT
global:
  scrape_interval: 30s
  external_labels: 
    source: reduce_metrics
scrape_configs:
   # cluster config
  # pod metrics
  - job_name: pod_exporter
    kubernetes_sd_configs:
      - role: pod
EOT

  # ... omitted for brevity 
}

Append the snippet above to your Terraform file and execute the commands terraform
plan and then terraform apply, and you should see a new Scraper being created.

Figure 6 - Viewing creation of second scraper from the AWS console

Figure 6 – Viewing creation of second scraper from the AWS console

Once the new scraper is Active, you can delete the old scraper by removing the below resource block for aws_prometheus_scraper.

- resource "aws_prometheus_scraper" "this" {
- ...
- }

Again, execute the commands terraform plan and then terraformapply to apply the changes.

Visualize in Grafana

Using the Explore feature in Grafana, we can create our own queries by selecting desired metrics and filters. We can add them to an existing dashboard or create a new. We will use Explore to query our Amazon Managed Service for Prometheus workspace. Follow the AWS documentation to setup Amazon Managed Grafana.

We can see that the “reduce_metrics” external_label we added to our scraper config is now available under Explore → Label filters and can be used to create visualizations.

Figure 7 - Validating the external label added with the second HA scraper

Figure 7 – Validating the external label added with the second HA scraper

We can also confirm that the metrics are not duplicated while we had two managed scrapers running simultaneously.

Figure 8 - Validating HA and de-duplication of metrics

Figure 8 – Validating HA and de-duplication of metrics

Cleaning up

To delete the resources created in this post, run these commands, depending on the path you chose to avoid continuing incurring charges.

CloudFormation:

aws cloudformation delete-stack --stack-name AMPScraper

Terraform:

terraform destroy

Conclusion

In this blog, we’ve walked through how you can create an Amazon Managed Service for Prometheus scraper through Infrastructure as code tools such as Terraform and AWS CloudFormation. With the integration between the managed scraper and Amazon EKS access management controls, you can now programmatically create scrapers, and associate them with your Amazon EKS clusters with simple, repeatable and predictable deployments. By using the managed scraper, you can reduce your operational load and have AWS scale your ingestion to match your traffic. We’ve also shown how to update the managed scraper without disrupting your metrics collection. Finally, we have seen how to leverage CloudWatch metrics to follow the ingestion of an Amazon Managed Service for Prometheus workspace.

To go further with monitoring Amazon EKS clusters, check out our end-to-end solution with opinionated metrics collection, dashboards and alarms, with infrastructure-as-code.
Check out One Observability Workshop aimed at providing a hands-on experience for you on the wide variety of toolsets AWS offers to setup monitoring and observability on your applications.
Refer to AWS Observability best practices to learn more about prescriptive guidance and recommendations with implementation examples.

AWS Cloud Operations Blog