Using Prometheus Metrics in Amazon CloudWatch

Imaya Kumar Jagannathan, Justin Gu, Marc Chéné, and Michael Hausenblas

Update 2020-09-08: The feature described in this post is now in GA, see details in the Amazon CloudWatch now monitors Prometheus metrics from Container environments What’s New item.

Earlier this week we announced the public beta support for monitoring Prometheus metrics in CloudWatch Container Insights. With this post we want to show you how you can use this new Amazon CloudWatch feature for containerized workloads in Amazon Elastic Kubernetes Service (EKS) and Kubernetes on AWS cluster provisioned by yourself.

Prometheus is a popular open source monitoring tool that graduated as a Cloud Native Compute Foundation (CNCF) project, with a large and active community of practitioners. Amazon CloudWatch Container Insights automates the discovery and collection of Prometheus metrics from containerized applications. It automatically collects, filters, and creates aggregated custom CloudWatch metrics visualized in dashboards for workloads such as AWS App Mesh, NGINX, Java/JMX, Memcached, and HAProxy. By default, preselected services are scraped and pre-aggregated every 60 seconds and automatically enriched with metadata such as cluster and pod names.

We’re aiming at supporting any Prometheus exporters compatible with OpenMetrics, allowing you to scrape any containerized workload using one of the 150+ open source third party exporters.

How does it work? You need to run the CloudWatch agent in your Kubernetes cluster. The agent now supports Prometheus configuration, discovery, and metric pull features, enriching and publishing all high fidelity Prometheus metrics and metadata as Embedded Metric Format (EMF) to CloudWatch Logs. Each event creates metric data points as CloudWatch custom metrics for a curated set of metric dimensions that is fully configurable. Publishing aggregated Prometheus metrics as CloudWatch custom metrics statistics reduces the number of metrics needed to monitor, alarm, and troubleshoot performance problems and failures. You can also analyze the high-fidelity Prometheus metrics using CloudWatch Logs Insights query language to isolate specific pods and labels impacting the health and performance of your containerized environments.

With that said, let us now move on to the practical part where we will show you how to use the CloudWatch Container Insights Prometheus metrics in two setups: we start with a simple example of scraping NGINX and then have a look at how to use custom metrics by instrumenting a ASP.NET Core app.

Out-of-the-box metrics from NGINX

In this first example we’re using an EKS cluster as the runtime environment and deploy the CW Prometheus agent for ingesting them as EMF events into CloudWatch. We use NGINX as an Ingress Controller as the scrape target and a dedicated app generating traffic for it. Overall the setup looks as follows:

We have three namespaces in the EKS cluster: amazon-cloudwatch which hosts the CW Prometheus agent, nginx-ingress-sample where we have the NGINX Ingress controller running, and nginx-sample-traffic which hosts our sample app, incl. the traffic generator.

If you want to follow along, you will need eksctl installed to provision the EKS cluster as well as Helm 3 for the application installation.

For the EKS cluster we’re using the following cluster configuration (save as clusterconfig.yaml and note that you potentially want to change the region to something geographically closer):

apiVersion: eksctl.io/v1alpha5
kind: ClusterConfig

metadata:
  name: cw-prom
  region: eu-west-1

iam:
  withOIDC: true

managedNodeGroups:
  - name: defaultng
    minSize: 1
    maxSize: 4
    desiredCapacity: 2
    labels: {role: mngworker}
    iam:
      withAddonPolicies:
        externalDNS: true
        certManager: true
        ebs: true
        albIngress: true
        cloudWatch: true
        appMesh: true

cloudWatch:
  clusterLogging:
    enableTypes: ["*"]

You can then provision the EKS cluster with the following command:

eksctl create cluster -f clusterconfig.yaml

Under the hood, eksctl uses CloudFormation, so you can have a look in the console there on the progress. Expect the provisioning to take something like 15 min end to end.

Next, we install the NGINX Ingress controller in the dedicated Kubernetes namespace nginx-ingress-sample, using Helm:

kubectl create namespace nginx-ingress-sample

help repo add stable https://charts.helm.sh/stable

helm install stable/nginx-ingress --generate-name --version 1.33.5 \
--namespace nginx-ingress-sample \
--set controller.metrics.enabled=true \
--set controller.metrics.service.annotations."prometheus\.io/port"="10254" \
--set controller.metrics.service.annotations."prometheus\.io/scrape"="true"

In order to target the traffic generator to the load balancer managed by the NGINX Ingress controller, we have to query its public IP address like so:

$ kubectl -n nginx-ingress-sample get svc 
NAME                                          TYPE           CLUSTER-IP      EXTERNAL-IP                                                               PORT(S)                      AGE
nginx-ingress-1588245517-controller           LoadBalancer   10.100.245.88   ac8cebb58959a4627a573fa5e5bd0937-2083146415.eu-west-1.elb.amazonaws.com   80:31881/TCP,443:32010/TCP   72s
nginx-ingress-1588245517-controller-metrics   ClusterIP      10.100.32.79    <none>                                                                    9913/TCP                     72s
nginx-ingress-1588245517-default-backend      ClusterIP      10.100.75.112   <none>                                                                    80/TCP                       72s

With that we now have everything to set up the sample app and the traffic generator in the nginx-sample-traffic namespace (note that for EXTERNAL_IP you will have to supply your own IP you figured out in the previous step):

SAMPLE_TRAFFIC_NAMESPACE=nginx-sample-traffic

EXTERNAL_IP=ac8cebb58959a4627a573fa5e5bd0937-2083146415.eu-west-1.elb.amazonaws.com

curl https://cloudwatch-agent-k8s-yamls.s3.amazonaws.com/quick-start/nginx-traffic-sample.yaml | \
sed "s/{{external_ip}}/$EXTERNAL_IP/g" | \
sed "s/{{namespace}}/$SAMPLE_TRAFFIC_NAMESPACE/g" | \
kubectl apply -f -

Last but not least we install the CW agent in the amazon-cloudwatch namespace, using:

kubectl create namespace amazon-cloudwatch
kubectl apply -f https://raw.githubusercontent.com/aws-samples/amazon-cloudwatch-container-insights/prometheus-beta/k8s-deployment-manifest-templates/deployment-mode/service/cwagent-prometheus/prometheus-eks.yaml

We’re almost there but we need one more thing: we need to give the CW agent the permissions to write metrics to CloudWatch. For this we’re using IAM Roles for Service Accounts (IRSA), a EKS feature that allows for least-privileges access control, effectively restricting the access to CW via the CloudWatchAgentServerPolicy directly to the pod running the CW agent:

eksctl create iamserviceaccount \
           --name cwagent-prometheus \
           --namespace amazon-cloudwatch \
           --cluster cw-prom \
           --attach-policy-arn arn:aws:iam::aws:policy/CloudWatchAgentServerPolicy \
           --override-existing-serviceaccounts \
           --approve

Now we are in a position to verify the setup. First, we check if the service account that the CW agent deployment uses has been annotated properly (you should see an annotation with the key eks.amazonaws.com/role-arn here):

$ kubectl -n amazon-cloudwatch get sa cwagent-prometheus -o yaml |\
  grep eks.amazon
eks.amazonaws.com/role-arn (http://eks.amazonaws.com/role-arn): arn:aws:iam::148658015984:role/eksctl-cw-prom-addon-iamserviceaccount-amazo-Role1-69WKQE6D9CG3

You should also verify that the CWAgent is running properly with kubectl -n amazon-cloudwatch get pod, which should show it in Running state.

Given we have everything deployed and running, we can now query the metrics from the CLI as follows:

aws logs start-query \
       --log-group-name /aws/containerinsights/cw-prom/prometheus \
       --start-time `date -v-1H +%s` \
       --end-time `date +%s` \
       --query-string "fields @timestamp, Service, CloudWatchMetrics.0.Metrics.0.Name as PrometheusMetricName, @message | sort @timestamp desc | limit 50 | filter CloudWatchMetrics.0.Namespace='ContainerInsights/Prometheus'"

aws logs get-query-results \
        --query-id e69f2544-add0-4d14-98ff-0fadb54f27f1

The output of above aws logs command is something along the line of (note the Prometheus metrics encoded in the last value field shown here:

{
    "results": [
        [
            {
                "field": "@timestamp",
                "value": "2020-04-30 11:40:38.230"
            },
            {
                "field": "Service",
                "value": "nginx-ingress-1588245517-controller-metrics"
            },
            {
                "field": "PrometheusMetricName",
                "value": "nginx_ingress_controller_nginx_process_connections"
            },
            {
                "field": "@message",
                "value": "{\"CloudWatchMetrics\":[{\"Metrics\":[{\"Name\":\"nginx_ingress_controller_nginx_process_connections\"}],\"Dimensions\":[[\"ClusterName\",\"Namespace\",\"Service\"]],\"Namespace\":\"ContainerInsights/Prometheus\"}],\"ClusterName\":\"cw-prom\",\"Namespace\":\"nginx-ingress-sample\",\"Service\":\"nginx-ingress-1588245517-controller-metrics\",\"Timestamp\":\"1588246838202\",\"Version\":\"0\",\"app\":\"nginx-ingress\",\"chart\":\"nginx-ingress-1.33.5\",\"component\":\"controller\",\"container_name\":\"nginx-ingress-controller\",\"controller_class\":\"nginx\",\"controller_namespace\":\"nginx-ingress-sample\",\"controller_pod\":\"nginx-ingress-1588245517-controller-56d5d786cd-xqwc2\",\"heritage\":\"Helm\",\"instance\":\"192.168.89.24:10254\",\"job\":\"kubernetes-service-endpoints\",\"kubernetes_node\":\"ip-192-168-73-163.eu-west-1.compute.internal\",\"nginx_ingress_controller_nginx_process_connections\":1,\"pod_name\":\"nginx-ingress-1588245517-controller-56d5d786cd-xqwc2\",\"prom_metric_type\":\"gauge\",\"release\":\"nginx-ingress-1588245517\",\"state\":\"active\"}"
            },
 ...

Having seen how to scrape and use Prometheus metrics out-of-the box, in our example from NGINX, let’s now move on to the topic of how to use custom metrics.

Custom metrics from ASP.NET Core app

In this following setup we will instrument an ASP.NET Core application using Prometheus client libraries with the goal to expose custom metrics and ingest these metrics into CloudWatch. We will do this using the CloudWatch Prometheus agent with a custom configuration.

Instrumenting app to expose custom metrics

First, clone the sample application from aws-samples/amazon-cloudwatch-prometheus-metrics-sample and have a look at the HomeController.cs file:

// Prometheus metrics:
private static readonly Counter HomePageHitCounter = Metrics.CreateCounter("PrometheusDemo_HomePage_Hit_Count", "Count the number of hits to Home Page");

private static readonly Gauge SiteVisitorsCounter = Metrics.CreateGauge("PrometheusDemo_SiteVisitors_Gauge", "Site Visitors Gauge");

public IActionResult Index() {
            HomePageHitCounter.Inc();
            SiteVisitorsCounter.Set(rn.Next(1, 15));
            return View();
}

As well as the ProductsController.cs file:

// Prometheus metric:
private static readonly Counter ProductsPageHitCounter = Metrics.CreateCounter("PrometheusDemo_ProductsPage_Hit_Count", "Count the number of hits to Products Page");

public IActionResult Index(){
            ProductsPageHitCounter.Inc();
            return View();
}

The code snippets shown above instrument three different metrics to track the number of visitors to each page and overall visitors in general using an open source Prometheus client library.

Next, for local testing and preview, navigate to the directory where the Dockerfile is located. Build the container image and run it using the following commands:

docker build . -t prometheusdemo
docker run -p 80:80 prometheusdemo

Now navigate to localhost where you should be able to see a screen like the one below. Click on the Home and Products links a few times to generate some traffic:

Next, navigate to the http://localhost/metrics where you should see all the Prometheus metrics the app is exposing via the /metrics endpoint:

Setting up CloudWatch agent for discovering Prometheus metrics

Open the prometheus-eks.yaml file under /kubernetes folder in the repo. The following configuration under cwagentconfig.json section in the YAML file shows the metrics that the CloudWatch agent will scrape from the application:

{
 "source_labels": ["job"],
 "label_matcher": "prometheusdemo-dotnet",
 "dimensions": [["ClusterName","Namespace"]],
 "metric_selectors": [
 "^process_cpu_seconds_total$",
 "^process_open_handles$",
 "^process_virtual_memory_bytes$",
 "^process_start_time_seconds$",
 "^process_private_memory_bytes$",
 "^process_working_set_bytes$",
 "^process_num_threads$"
 ]
 },
 {
 "source_labels": ["job"],
 "label_matcher": "^prometheusdemo-dotnet$",
 "dimensions": [["ClusterName","Namespace"]],
 "metric_selectors": [
 "^dotnet_total_memory_bytes$",
 "^dotnet_collection_count_total$",
 "^dotnet_gc_finalization_queue_length$",
 "^dotnet_jit_method_seconds_total$",
 "^dotnet_jit_method_total$",
 "^dotnet_threadpool_adjustments_total$",
 "^dotnet_threadpool_io_num_threads$",
 "^dotnet_threadpool_num_threads$",
 "^dotnet_gc_pinned_objects$"
 ]
 },
 {
 "source_labels": ["job"],
 "label_matcher": "^prometheusdemo-dotnet$",
 "dimensions": [["ClusterName","Namespace","gc_heap"]],
 "metric_selectors": [
 "^dotnet_gc_allocated_bytes_total$"
 ]
 },
 {
 "source_labels": ["job"],
 "label_matcher": "prometheusdemo-dotnet",
 "dimensions": [["ClusterName","Namespace","app"]],
 "metric_selectors": [
 "^PrometheusDemo_HomePage_Hit_Count$",
 "^PrometheusDemo_SiteVisitors_Gauge$",
 "^PrometheusDemo_ProductsPage_Hit_Count$"
 ]
 }

In prometheus.yaml you will find the following section, instructing the CloudWatch agent about the Prometheus metric endpoint details, using the standard Prometheus configuration. Note that we have to make the regex for the address source label to match the endpoint and port number from which our sample application is exposing the metrics:

- job_name: 'prometheusdemo-dotnet'
      sample_limit: 10000
      metrics_path: /metrics
      kubernetes_sd_configs:
      - role: pod
      relabel_configs:
      - source_labels: [__address__]
        action: keep
        regex: '.*:80$'
      - action: labelmap
        regex: __meta_kubernetes_pod_label_(.+)
      - action: replace
        source_labels:
        - __meta_kubernetes_namespace
        target_label: Namespace
      - source_labels: [__meta_kubernetes_pod_name]
        action: replace
        target_label: pod_name
      - action: replace
        source_labels:
        - __meta_kubernetes_pod_container_name
        target_label: container_name
      - action: replace
        source_labels:
        - __meta_kubernetes_pod_controller_name
        target_label: pod_controller_name
      - action: replace
        source_labels:
        - __meta_kubernetes_pod_controller_kind
        target_label: pod_controller_kind
      - action: replace
        source_labels:
        - __meta_kubernetes_pod_phase
        target_label: pod_phase

Deploying app and CloudWatch Prometheus agent

First, push the Docker container image of the ASP.NET Core app you built earlier to a container registry of your choice. Then, replace <YOUR_CONTAINER_REPO> in the deployment.yaml file with the value of the container repo you published the image to.

Now that we have everything configured, we deploy the sample application and the CloudWatch Prometheus agent into the Kubernetes cluster, using the following command:

kubectl apply -f kubernetes/

Note that now you can optionally enable CloudWatch Container Insights in the Kubernetes cluster, allowing you to see the infrastructure map and automatic dashboard in the AWS Management Console.

You can verify your setup as follows (you should see all the pods here in the running state):

$ kubectl -n amazon-cloudwatch get pods 
NAME                                  READY   STATUS    RESTARTS   AGE
cloudwatch-agent-785zq                1/1     Running   0          26h
cloudwatch-agent-tjxcj                1/1     Running   0          26h
cwagent-prometheus-75dfcd47d7-gtx58   1/1     Running   0          120m
fluentd-cloudwatch-7ttck              1/1     Running   0          26h
fluentd-cloudwatch-n2jvm              1/1     Running   0          26h

Creating custom dashboards

In order to create a custom dashboard called PrometheusDotNetApp in the us-east-2 AWS region, execute:

dashboardjson=$(<cloudwatch_dashboard.json)

aws cloudwatch put-dashboard \
        --dashboard-name PrometheusDotNetApp \
        --dashboard-body  $dashboardjson

If you want the dashboard to be created in another region, replace us-east-2 in the JSON config file with your desired value.

Now you can navigate to the CloudWatch dashboard you just created and you should be able to see something similar to the following:

CloudWatch Container Insights publishes automatic dashboards created based on performance metrics from the Kubernetes cluster. Navigate to the Performance Monitoring page on CloudWatch Container Insights to see the automatic dashboard created by Container Insights:

With CloudWatch Container Insights enabled, you have access to a map view under Resources, showing you the Kubernetes cluster topology including its components:

With this we wrap up the custom metrics example scenario and have a quick look at what’s up next.

Next steps

We’re excited to be able to open up the CloudWatch Container Insights support for Prometheus metric into a public beta. We would like to hear from you how you’re using this new feature and what you expect to see going forward. For example, PromQL support or native support for Prometheus histograms or summary metrics. Please share your experiences and keep an eye on this space, we keep improving and adding new features based on your feedback.

Imaya Kumar Jagannathan

Imaya is a Senior Solution Architect focused on Amazon CloudWatch and AWS X-Ray. He is passionate about Monitoring and Observability and has a strong application development and architecture background. He likes working on distributed systems and is excited to talk about microservice architecture design. He loves programming on C#, working with Containers and Serverless technologies.

Justin Gu

Justin Gu is a Senior Software Development Engineer for Amazon CloudWatch based in Vancouver, Canada. He enjoys designing and developing monitoring solutions to support massive metric ingestion, distributed systems/cloud computing, data visualization, log processing and analytics.

Marc Chéné

Marc is a Principal Product Manager focused on monitoring microservices and containers for modern application environments. Marc works with customers to understand, build trust, and deliver the best user experience in an agile way. Currently he is focused on delivering the best observability experience across time series data such as metrics, logs, and distributed tracing using CloudWatch and open source tooling such as Grafana and Prometheus.

Containers