Monitoring your service mesh container environment using Amazon Managed Service for Prometheus

NOTICE: October 04, 2024 – This post no longer reflects the best guidance for configuring a service mesh with Amazon ECS and Amazon EKS, and its examples no longer work as shown. For workloads running on Amazon ECS, please refer to newer content on Amazon ECS Service Connect, and for workloads running on Amazon EKS, please refer to Amazon VPC Lattice.

——–

Observability is critical for any application and to understand system behavior and performance. It takes a lot of time and effort to detect and remediate performance slowdowns or disruptions. It’s even more challenging in a multi-tenant environment where numerous microservices are running and the processing of a request spans a handful of services. Service meshes handle the network communication between microservices.

It’s easy to feel lost when you’re troubleshooting such a complex environment. Not only do application traces play an important role in identifying performance issues, so do the metrics for the service mesh component. Prometheus is a widely used open source systems monitoring and alerting solution. It includes an alert manager to route, group, and silence alerts. It also provides a multi-dimensional data model with time series data identified by metric name and key-value pairs and a flexible query language for aggregating metrics across a high volume of labels. To instantly query, correlate, and visualize operational metrics collected by Prometheus, we use Grafana, a widely deployed visualization tool.

Although it’s easy to deploy Prometheus and Grafana, it can take a significant amount of engineering effort to scale across multiple servers and configure a highly available, scalable, and secure environment. You have to commit more time to optimize resources like memory and storage to control costs and improve query response times. Maintaining the environment on an ongoing basis, including patching and upgrading Prometheus servers, is a chore. With the help of Amazon Managed Service for Prometheus (AMP) and Amazon Managed Service for Grafana (AMG), you can operate a Prometheus service that can scale on demand without impacting the ability to monitor critical resources.

AMP is a Prometheus-compatible monitoring service for container infrastructure and application metrics for containers that makes it easy to securely monitor container environments at scale. With AMP, you can automatically scale the ingestion, storage, and querying of operational metrics as workloads scale up and down. AMG is a fully managed service that makes it easy for you to visualize and analyze operational data at scale.

Envoy metrics

Organizations are expanding their microservices footprint with the help of orchestration engines like Amazon Elastic Kubernetes Service (Amazon EKS) and Amazon Elastic Container Service (Amazon ECS). They often have applications owned by different teams running in different AWS accounts or VPCs and rely on a service mesh like AWS App Mesh to get consistent connectivity, observability, and security between applications. App Mesh is a managed service mesh that can be used with Amazon ECS, Amazon EKS, Amazon Elastic Compute Cloud (Amazon EC2) instances, and with self-managed Kubernetes on EC2. App Mesh makes it easy to connect applications running on different compute platforms into a common mesh. App Mesh uses the envoy proxy to standardize communication between applications, which provides end-to-end visibility and helps ensure high availability. Although AMP and AMG act as single source of truth by ingesting metrics from containers across the organization running on different compute platforms, it’s important to gather envoy metrics, too. Envoy is an integral part of App Mesh, which helps route traffic between different applications.

In this blog post, we will walk through the steps to ingest App Mesh envoy metrics in an Amazon EKS cluster to AMP and create a custom dashboard on AMG to monitor the health and performance of microservices.

As a part of implementing the solution, we:

Create an AMP workspace.
Install the App Mesh Controller for Kubernetes and inject the envoy container into the pods.
Collecting Envoy metrics using Grafana Agent configured in the Amazon EKS cluster and write them to AMP.
Create an AMG workspace and configure AMP as a data source.
Create and configure the Grafana dashboard on AMG.

Prerequisites

You will need the following to complete the steps in this blog post:

AWS CLI version 2
eksctl
kubectl
jq
helm
An AMP workspace configured in your AWS account. For instructions, see Create a workspace in the AMP User Guide.
AWS Single Sign-On (AWS SSO). For instructions, see Enable AWS SSO in the AWS Single Sign-On User Guide.

Create an EKS cluster

First, create an EKS cluster that will be enabled with App Mesh for running the sample application. The eksctl CLI tool will be used to deploy the cluster using the eks-cluster-config.yaml file:

export AMP_EKS_CLUSTER=AMP-EKS-CLUSTER
export AMP_ACCOUNT_ID=<Your Account id>
export AWS_REGION=<Your Region>


cat << EOF > eks-cluster-config.yaml
---
apiVersion: eksctl.io/v1alpha5
kind: ClusterConfig
metadata:
  name: $AMP_EKS_CLUSTER
  region: $AWS_REGION
  version: '1.18'
iam:
  withOIDC: true
  serviceAccounts:
  - metadata:
      name: appmesh-controller
      namespace: appmesh-system
      labels: {aws-usage: "application"}
    attachPolicyARNs:
    - "arn:aws:iam::aws:policy/AWSAppMeshFullAccess"
managedNodeGroups:
- name: default-ng
  minSize: 1
  maxSize: 3
  desiredCapacity: 2
  labels: {role: mngworker}
  iam:
    withAddonPolicies:
      certManager: true
      cloudWatch: true
      appMesh: true
cloudWatch:
  clusterLogging:
    enableTypes: ["*"]
EOF

Execute the following command to create the EKS cluster:

eksctl create cluster -f eks-cluster-config.yaml

This creates an EKS cluster named AMP-EKS-CLUSTER and a service account named appmesh-controller that will be used by the App Mesh controller for EKS.

Next, use the following commands to install the App Mesh controller.

First, get the Custom Resource Definitions (CRDs) in place:

helm repo add eks https://aws.github.io/eks-charts
helm upgrade -i appmesh-controller eks/appmesh-controller \
--namespace appmesh-system \
--set region=${AWS_REGION} \
--set serviceAccount.create=false \
--set serviceAccount.name=appmesh-controller

Create an AMP workspace

The AMP workspace is used to ingest the Prometheus metrics collected from envoy. A workspace is a logical and isolated Prometheus server dedicated to Prometheus resources such as metrics. A workspace supports fine-grained access control for authorizing its management such as update, list, describe, and delete, and the ingestion and querying of metrics.

aws amp create-workspace --alias AMP-APPMESH --region $AWS_REGION

Next, as an optional step, create an interface VPC endpoint to securely access the managed service from resources deployed in your VPC. A public endpoint for AMP is also available. This ensures that data ingested by the managed service does not leave the VPC in your AWS account. You can use the AWS CLI as shown here. Replace the placeholder strings such as VPC_ID, AWS_REGION, and others with your values.

export VPC_ID=<Your EKS Cluster VPC Id>

aws ec2 create-vpc-endpoint \
--vpc-id $VPC_ID \
--service-name com.amazonaws.<$AWS_REGION>.aps-workspaces \
--security-group-ids <SECURITY_GROUP_IDS> \
--vpc-endpoint-type Interface \
--subnet-ids <SUBNET_IDS>

helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo add kube-state-metrics https://kubernetes.github.io/kube-state-metrics

Scrape the metrics

AMP does not directly scrape operational metrics from containerized workloads in a Kubernetes cluster. You must deploy and manage a Prometheus server or an OpenTelemetry agent such as the AWS Distro for OpenTelemetry Collector or the Grafana Agent to perform this task. In this blog post, we walk you through the process of configuring the Grafana Agent to scrape the envoy metrics and analyze them using AMP and AMG.

Configure the Grafana Agent for AMP

The Grafana Agent is a lightweight alternative to running a full Prometheus server. It keeps the necessary parts for discovering and scraping Prometheus exporters and sending metrics to a Prometheus-compatible backend, which in this case is AMP, and removes subsystems such as the storage, query, and alerting engines. The Grafana Agent is 100 percent compatible with Prometheus metrics and uses the Prometheus service discovery, scraping, write-ahead log, and remote write mechanisms from the Prometheus project. The Grafana Agent also supports basic sharding across every node in your Amazon EKS cluster by only collecting metrics running on the same node as the Grafana Agent pod, removing the need to decide between one giant machine to collect all of your Prometheus metrics and sharding through multiple manually managed Prometheus configurations. The Grafana Agent also includes native support for AWS Signature Version 4 for AWS Identity and Access Management (IAM) authentication, which means you don’t need to run a sidecar Signature Version 4 proxy, which reduces complexity, memory, and CPU demand.

In this blog post, we walk through the steps to configure an IAM role to send Prometheus metrics to AMP. We install the Grafana Agent on the EKS cluster and forward metrics to AMP.

Configure permissions

The Grafana Agent scrapes operational metrics from containerized workloads running in the Amazon EKS cluster and sends them to AMP for long-term storage and subsequent querying by monitoring tools such as Grafana. Data sent to AMP must be signed with valid AWS credentials using the AWS Signature Version 4 algorithm to authenticate and authorize each client request for the managed service.

The Grafana Agent can be deployed to an Amazon EKS cluster to run under the identity of a Kubernetes service account. With IAM roles for service accounts (IRSA), you can associate an IAM role with a Kubernetes service account and thus provide IAM permissions to any pod that uses the service account. This follows the principle of least privilege by using IRSA to securely configure the Grafana Agent, which includes the AWS Signature Version 4 that helps ingest Prometheus metrics to AMP.

The agent-permissions-aks shell script can be used to execute the following actions. Replace the placeholder variable YOUR_EKS_CLUSTER_NAME with the name of your Amazon EKS cluster.

Creates an IAM role named EKS-GrafanaAgent-AMP-ServiceAccount-Role with an IAM policy that has permissions to remote-write into an AMP workspace.
Creates a Kubernetes service account named grafana-agent under the grafana-agent namespace that is associated with the IAM role.
Creates a trust relationship between the IAM role and the OIDC provider hosted in your Amazon EKS cluster.

You need kubectl and eksctl CLI tools to run the script. They must be configured to access your Amazon EKS cluster.

kubectl create namespace grafana-agent
export WORKSPACE=$(aws amp list-workspaces | jq -r '.workspaces[] | select(.alias=="AMP-APPMESH").workspaceId')
export ROLE_ARN=$(aws iam get-role --role-name EKS-GrafanaAgent-AMP-ServiceAccount-Role --query Role.Arn --output text)
export REGION=$AWS_REGION
export NAMESPACE="grafana-agent"
export REMOTE_WRITE_URL="https://aps-workspaces.$REGION.amazonaws.com/workspaces/$WORKSPACE/api/v1/remote_write"

Now create a manifest file, grafana-agent.yaml, with the scrape configuration to extract envoy metrics and deploy the Grafana Agent. This example deploys a DaemonSet named grafana-agent and a deployment (with one replica) named grafana-agent-deployment. The grafana-agent DaemonSet collects metrics from pods on the cluster. The grafana-agent-deployment collects metrics from services that do not live on the cluster, such as the Amazon EKS control plane. At the time of this writing, this solution will not work for the AWS Fargate data plane due to the lack of support for DaemonSet.

cat > grafana-agent.yaml <<EOF
apiVersion: v1
kind: ServiceAccount
metadata:
  annotations:
    eks.amazonaws.com/role-arn: ${ROLE_ARN}
  name: grafana-agent
  namespace: ${NAMESPACE}
---
apiVersion: v1
data:
  agent.yml: |
    prometheus:
        configs:
          - host_filter: true
            name: agent
            remote_write:
              - sigv4:
                    enabled: true
                    region: ${REGION}
                url: ${REMOTE_WRITE_URL}
            scrape_configs:
              - job_name: 'appmesh-envoy'
                metrics_path: /stats/prometheus
                kubernetes_sd_configs:
                - role: pod
                relabel_configs:
                - source_labels: [__meta_kubernetes_pod_container_name]
                  action: keep
                  regex: '^envoy$'
                - source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
                  action: replace
                  regex: ([^:]+)(?::\d+)?;(\d+)
                  replacement: \${1}:9901
                  target_label: __address__
                - action: labelmap
                  regex: __meta_kubernetes_pod_label_(.+)
                - source_labels: [__meta_kubernetes_namespace]
                  action: replace
                  target_label: namespace
                - source_labels: ['app']
                  action: replace
                  target_label: service      
                - source_labels: [__meta_kubernetes_pod_name]
                  action: replace
                  target_label: kubernetes_pod_name          
              - bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
                job_name: kubernetes-pods
                kubernetes_sd_configs:
                  - role: pod
                relabel_configs:
                  - action: drop
                    regex: "false"
                    source_labels:
                      - __meta_kubernetes_pod_annotation_prometheus_io_scrape
                  - action: keep
                    regex: .*-metrics
                    source_labels:
                      - __meta_kubernetes_pod_container_port_name
                  - action: replace
                    regex: (https?)
                    replacement: \$1
                    source_labels:
                      - __meta_kubernetes_pod_annotation_prometheus_io_scheme
                    target_label: __scheme__
                  - action: replace
                    regex: (.+)
                    replacement: \$1
                    source_labels:
                      - __meta_kubernetes_pod_annotation_prometheus_io_path
                    target_label: __metrics_path__
                  - action: replace
                    regex: (.+?)(\:\d+)?;(\d+)
                    replacement: \$1:\$3
                    source_labels:
                      - __address__
                      - __meta_kubernetes_pod_annotation_prometheus_io_port
                    target_label: __address__
                  - action: drop
                    regex: ""
                    source_labels:
                      - __meta_kubernetes_pod_label_name
                  - action: replace
                    replacement: \$1
                    separator: /
                    source_labels:
                      - __meta_kubernetes_namespace
                      - __meta_kubernetes_pod_label_name
                    target_label: job
                  - action: replace
                    source_labels:
                      - __meta_kubernetes_namespace
                    target_label: namespace
                  - action: replace
                    source_labels:
                      - __meta_kubernetes_pod_name
                    target_label: pod
                  - action: replace
                    source_labels:
                      - __meta_kubernetes_pod_container_name
                    target_label: container
                  - action: replace
                    separator: ':'
                    source_labels:
                      - __meta_kubernetes_pod_name
                      - __meta_kubernetes_pod_container_name
                      - __meta_kubernetes_pod_container_port_name
                    target_label: instance
                  - action: labelmap
                    regex: __meta_kubernetes_pod_annotation_prometheus_io_param_(.+)
                    replacement: __param_\$1
                  - action: drop
                    regex: Succeeded|Failed
                    source_labels:
                      - __meta_kubernetes_pod_phase
                tls_config:
                    ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
                    insecure_skip_verify: false
              - bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
                job_name: ${NAMESPACE}/kube-state-metrics
                kubernetes_sd_configs:
                  - namespaces:
                        names:
                          - ${NAMESPACE}
                    role: pod
                relabel_configs:
                  - action: keep
                    regex: kube-state-metrics
                    source_labels:
                      - __meta_kubernetes_pod_label_name
                  - action: replace
                    separator: ':'
                    source_labels:
                      - __meta_kubernetes_pod_name
                      - __meta_kubernetes_pod_container_name
                      - __meta_kubernetes_pod_container_port_name
                    target_label: instance
                tls_config:
                    ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
                    insecure_skip_verify: false
              - bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
                job_name: ${NAMESPACE}/node-exporter
                kubernetes_sd_configs:
                  - namespaces:
                        names:
                          - ${NAMESPACE}
                    role: pod
                relabel_configs:
                  - action: keep
                    regex: node-exporter
                    source_labels:
                      - __meta_kubernetes_pod_label_name
                  - action: replace
                    source_labels:
                      - __meta_kubernetes_pod_node_name
                    target_label: instance
                  - action: replace
                    source_labels:
                      - __meta_kubernetes_namespace
                    target_label: namespace
                tls_config:
                    ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
                    insecure_skip_verify: false
              - bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
                job_name: kube-system/kubelet
                kubernetes_sd_configs:
                  - role: node
                relabel_configs:
                  - replacement: kubernetes.default.svc.cluster.local:443
                    target_label: __address__
                  - replacement: https
                    target_label: __scheme__
                  - regex: (.+)
                    replacement: /api/v1/nodes/\${1}/proxy/metrics
                    source_labels:
                      - __meta_kubernetes_node_name
                    target_label: __metrics_path__
                tls_config:
                    ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
                    insecure_skip_verify: false
              - bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
                job_name: kube-system/cadvisor
                kubernetes_sd_configs:
                  - role: node
                metric_relabel_configs:
                  - action: drop
                    regex: container_([a-z_]+);
                    source_labels:
                      - __name__
                      - image
                  - action: drop
                    regex: container_(network_tcp_usage_total|network_udp_usage_total|tasks_state|cpu_load_average_10s)
                    source_labels:
                      - __name__
                relabel_configs:
                  - replacement: kubernetes.default.svc.cluster.local:443
                    target_label: __address__
                  - regex: (.+)
                    replacement: /api/v1/nodes/\${1}/proxy/metrics/cadvisor
                    source_labels:
                      - __meta_kubernetes_node_name
                    target_label: __metrics_path__
                scheme: https
                tls_config:
                    ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
                    insecure_skip_verify: false
        global:
            scrape_interval: 15s
        wal_directory: /var/lib/agent/data
    server:
        log_level: info
kind: ConfigMap
metadata:
  name: grafana-agent
  namespace: ${NAMESPACE}
---
apiVersion: v1
data:
  agent.yml: |
    prometheus:
        configs:
          - host_filter: false
            name: agent
            remote_write:
              - sigv4:
                    enabled: true
                    region: ${REGION}
                url: ${REMOTE_WRITE_URL}
            scrape_configs:
              - bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
                job_name: default/kubernetes
                kubernetes_sd_configs:
                  - role: endpoints
                metric_relabel_configs:
                  - action: drop
                    regex: apiserver_admission_controller_admission_latencies_seconds_.*
                    source_labels:
                      - __name__
                  - action: drop
                    regex: apiserver_admission_step_admission_latencies_seconds_.*
                    source_labels:
                      - __name__
                relabel_configs:
                  - action: keep
                    regex: apiserver
                    source_labels:
                      - __meta_kubernetes_service_label_component
                scheme: https
                tls_config:
                    ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
                    insecure_skip_verify: false
                    server_name: kubernetes
        global:
            scrape_interval: 15s
        wal_directory: /var/lib/agent/data
    server:
        log_level: info
kind: ConfigMap
metadata:
  name: grafana-agent-deployment
  namespace: ${NAMESPACE}
---
apiVersion: rbac.authorization.k8s.io/v1beta1
kind: ClusterRole
metadata:
  name: grafana-agent
rules:
- apiGroups:
  - ""
  resources:
  - nodes
  - nodes/proxy
  - services
  - endpoints
  - pods
  verbs:
  - get
  - list
  - watch
- nonResourceURLs:
  - /metrics
  verbs:
  - get
---
apiVersion: rbac.authorization.k8s.io/v1beta1
kind: ClusterRoleBinding
metadata:
  name: grafana-agent
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: grafana-agent
subjects:
- kind: ServiceAccount
  name: grafana-agent
  namespace: ${NAMESPACE}
---
apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: grafana-agent
  namespace: ${NAMESPACE}
spec:
  minReadySeconds: 10
  selector:
    matchLabels:
      name: grafana-agent
  template:
    metadata:
      labels:
        name: grafana-agent
    spec:
      containers:
      - args:
        - -config.file=/etc/agent/agent.yml
        - -prometheus.wal-directory=/tmp/agent/data
        command:
        - /bin/agent
        env:
        - name: HOSTNAME
          valueFrom:
            fieldRef:
              fieldPath: spec.nodeName
        image: grafana/agent:v0.11.0
        imagePullPolicy: IfNotPresent
        name: agent
        ports:
        - containerPort: 80
          name: http-metrics
        securityContext:
          privileged: true
          runAsUser: 0
        volumeMounts:
        - mountPath: /etc/agent
          name: grafana-agent
      serviceAccount: grafana-agent
      tolerations:
      - effect: NoSchedule
        operator: Exists
      volumes:
      - configMap:
          name: grafana-agent
        name: grafana-agent
  updateStrategy:
    type: RollingUpdate
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: grafana-agent-deployment
  namespace: ${NAMESPACE}
spec:
  minReadySeconds: 10
  replicas: 1
  revisionHistoryLimit: 10
  selector:
    matchLabels:
      name: grafana-agent-deployment
  template:
    metadata:
      labels:
        name: grafana-agent-deployment
    spec:
      containers:
      - args:
        - -config.file=/etc/agent/agent.yml
        - -prometheus.wal-directory=/tmp/agent/data
        command:
        - /bin/agent
        env:
        - name: HOSTNAME
          valueFrom:
            fieldRef:
              fieldPath: spec.nodeName
        image: grafana/agent:v0.11.0
        imagePullPolicy: IfNotPresent
        name: agent
        ports:
        - containerPort: 80
          name: http-metrics
        securityContext:
          privileged: true
          runAsUser: 0
        volumeMounts:
        - mountPath: /etc/agent
          name: grafana-agent-deployment
      serviceAccount: grafana-agent
      volumes:
      - configMap:
          name: grafana-agent-deployment
        name: grafana-agent-deployment
EOF

kubectl apply -f grafana-agent.yaml

After the grafana-agent is deployed, it will collect the metrics and ingest them into the specified AMP workspace. Now deploy a sample application on the Amazon EKS cluster and start analyzing the metrics.

Deploy a sample application on the EKS cluster

To install an application and inject an envoy container, use the AppMesh controller for Kubernetes you created earlier. AWS App Mesh Controller for K8s manages App Mesh resources in your Kubernetes clusters. The controller is accompanied by CRDs that allow you to define App Mesh components, such as meshes and virtual nodes, using the Kubernetes API just as you define native Kubernetes objects, such as deployments and services. These custom resources map to App Mesh API objects that the controller manages for you. The controller watches these custom resources for changes and reflects those changes into the App Mesh API.

## Install the base application
git clone https://github.com/aws/aws-app-mesh-examples.git
kubectl apply -f aws-app-mesh-examples/examples/apps/djapp/1_base_application
kubectl get all -n prod      ## check the pod status and make sure it is running

NAME                            READY   STATUS    RESTARTS   AGE
pod/dj-cb77484d7-gx9vk          1/1     Running   0          6m8s
pod/jazz-v1-6b6b6dd4fc-xxj9s    1/1     Running   0          6m8s
pod/metal-v1-584b9ccd88-kj7kf   1/1     Running   0          6m8s




## Now install the App Mesh controller and meshify the deployment

kubectl apply -f aws-app-mesh-examples/examples/apps/djapp/2_meshed_application/
kubectl rollout restart deployment -n prod dj jazz-v1 metal-v1
kubectl get all -n prod     ## Now we should see two containers running in each pod
 
NAME                        READY   STATUS    RESTARTS   AGE
dj-7948b69dff-z6djf         2/2     Running   0          57s
jazz-v1-7cdc4fc4fc-wzc5d    2/2     Running   0          57s
metal-v1-7f499bb988-qtx7k   2/2     Running   0          57s


## generate the traffic for 5 mins and we will visualize it in grafana
dj_pod=`kubectl get pod -n prod --no-headers -l app=dj -o jsonpath='{.items[*].metadata.name}'`

loop_counter=0
while [ $loop_counter -le 300 ] ; do kubectl exec -n prod -it $dj_pod  -c dj  -- curl jazz.prod.svc.cluster.local:9080 ; echo ; loop_counter=$[$loop_counter+1] ; done

Create an AMG workspace

Creating an AMG workspace is straightforward. Follow the steps in the Getting Started with Amazon Managed Service for Grafana blog post. To grant users access to the dashboard, you must enable AWS SSO. After you create the workspace, you can assign access to the Grafana workspace to an individual user or a user group. By default, the user has a user type of viewer. Change the user type based on the user role. Add the AMP workspace as the data source and then start creating the dashboard.

In this example, the user name is grafana-admin and the user type is Admin. Select the required data source. Review the configuration, and then choose Create workspace.

Configure the data source and custom dashboard

To configure AMP as a data source, in the Data sources section, choose Configure in Grafana, which will launch a Grafana workspace in the browser. You can also manually launch the Grafana workspace URL in the browser. Use the instructions in the Getting Started with Amazon Managed Service for Grafana blog post and specify the AMP workspace you created earlier as a data source.

After the data source is configured, import a custom dashboard to analyze the envoy metrics. Choose Import (shown below), and then enter the ID 11022. This will import the Envoy Global dashboard so you can start analyzing the envoy metrics.

As you can see from the screenshots, you can view envoy metrics like downstream latency, connections, response code, and more. You can use the filters shown to drill down to the envoy metrics of a particular application.

Configure alerts on AMG

You can configure Grafana alerts when the metric increases beyond the intended threshold. With AMG, you can configure how often the alert must be evaluated in the dashboard and send the notification. Currently, AMG supports Amazon SNS, Opsgenie, Slack, PagerDuty, VictorOp notifier types. Before you create alert rules, you must create a notification channel.

In this example, configure Amazon SNS as a notification channel. The SNS topic must be prefixed with grafana for notifications to be successfully published to the topic.

Use the following command to create an SNS topic named grafana-notification and subscribe an email address.

aws sns create-topic --name grafana-notification

aws sns subscribe --topic-arn arn:aws:sns:<region>:<account-id>:grafana-notification --protocol email --notification-endpoint <email-id>

Add a new notification channel from the Grafana dashboard.

Configure the new notification channel named grafana-notification. For Type, use AWS SNS from the drop down. For Topic, use the ARN of the SNS topic you just created. For Auth provider, choose AWS SDK Default.

Now configure an alert if downstream latency exceeds five milliseconds in a one-minute period. In the dashboard, choose Downstream latency from the dropdown, and then choose Edit. On the Alert tab of the graph panel, configure how often the alert rule should be evaluated and the conditions that must be met for the alert to change state and initiate its notifications.

In the following configuration, an alert is created if the downstream latency exceeds the threshold and notification will be sent through the configured grafana-alert-notification channel to the SNS topic.

Conclusion

In summary, we’ve deployed an application running on EKS, integrated the app with AppMesh, scraped the Prometheus metrics using Grafana Agent to ingest the metrics to AMP and then visualize them in AMG. Because AMP and AMG are fully managed, serverless monitoring services, you can spend your time on the applications that transform your business and leave the job of managing the monitoring environment to AWS. AMP can be used with any Prometheus-compatible collectors such as a Prometheus server, ADOT, and Grafana Agent to collect metrics from other container environments such as Amazon ECS, self-managed Kubernetes on AWS, or on-premises infrastructure.

Containers