Containers
Monitoring your service mesh container environment using Amazon Managed Service for Prometheus
NOTICE: October 04, 2024 – This post no longer reflects the best guidance for configuring a service mesh with Amazon ECS and Amazon EKS, and its examples no longer work as shown. For workloads running on Amazon ECS, please refer to newer content on Amazon ECS Service Connect, and for workloads running on Amazon EKS, please refer to Amazon VPC Lattice.
——–
Observability is critical for any application and to understand system behavior and performance. It takes a lot of time and effort to detect and remediate performance slowdowns or disruptions. It’s even more challenging in a multi-tenant environment where numerous microservices are running and the processing of a request spans a handful of services. Service meshes handle the network communication between microservices.
It’s easy to feel lost when you’re troubleshooting such a complex environment. Not only do application traces play an important role in identifying performance issues, so do the metrics for the service mesh component. Prometheus is a widely used open source systems monitoring and alerting solution. It includes an alert manager to route, group, and silence alerts. It also provides a multi-dimensional data model with time series data identified by metric name and key-value pairs and a flexible query language for aggregating metrics across a high volume of labels. To instantly query, correlate, and visualize operational metrics collected by Prometheus, we use Grafana, a widely deployed visualization tool.
Although it’s easy to deploy Prometheus and Grafana, it can take a significant amount of engineering effort to scale across multiple servers and configure a highly available, scalable, and secure environment. You have to commit more time to optimize resources like memory and storage to control costs and improve query response times. Maintaining the environment on an ongoing basis, including patching and upgrading Prometheus servers, is a chore. With the help of Amazon Managed Service for Prometheus (AMP) and Amazon Managed Service for Grafana (AMG), you can operate a Prometheus service that can scale on demand without impacting the ability to monitor critical resources.
AMP is a Prometheus-compatible monitoring service for container infrastructure and application metrics for containers that makes it easy to securely monitor container environments at scale. With AMP, you can automatically scale the ingestion, storage, and querying of operational metrics as workloads scale up and down. AMG is a fully managed service that makes it easy for you to visualize and analyze operational data at scale.
Envoy metrics
Organizations are expanding their microservices footprint with the help of orchestration engines like Amazon Elastic Kubernetes Service (Amazon EKS) and Amazon Elastic Container Service (Amazon ECS). They often have applications owned by different teams running in different AWS accounts or VPCs and rely on a service mesh like AWS App Mesh to get consistent connectivity, observability, and security between applications. App Mesh is a managed service mesh that can be used with Amazon ECS, Amazon EKS, Amazon Elastic Compute Cloud (Amazon EC2) instances, and with self-managed Kubernetes on EC2. App Mesh makes it easy to connect applications running on different compute platforms into a common mesh. App Mesh uses the envoy proxy to standardize communication between applications, which provides end-to-end visibility and helps ensure high availability. Although AMP and AMG act as single source of truth by ingesting metrics from containers across the organization running on different compute platforms, it’s important to gather envoy metrics, too. Envoy is an integral part of App Mesh, which helps route traffic between different applications.
In this blog post, we will walk through the steps to ingest App Mesh envoy metrics in an Amazon EKS cluster to AMP and create a custom dashboard on AMG to monitor the health and performance of microservices.
As a part of implementing the solution, we:
- Create an AMP workspace.
- Install the App Mesh Controller for Kubernetes and inject the envoy container into the pods.
- Collecting Envoy metrics using Grafana Agent configured in the Amazon EKS cluster and write them to AMP.
- Create an AMG workspace and configure AMP as a data source.
- Create and configure the Grafana dashboard on AMG.
Prerequisites
You will need the following to complete the steps in this blog post:
- AWS CLI version 2
- eksctl
- kubectl
- jq
- helm
- An AMP workspace configured in your AWS account. For instructions, see Create a workspace in the AMP User Guide.
- AWS Single Sign-On (AWS SSO). For instructions, see Enable AWS SSO in the AWS Single Sign-On User Guide.
Create an EKS cluster
First, create an EKS cluster that will be enabled with App Mesh for running the sample application. The eksctl CLI tool will be used to deploy the cluster using the eks-cluster-config.yaml
file:
Execute the following command to create the EKS cluster:
This creates an EKS cluster named AMP-EKS-CLUSTER
and a service account named appmesh-controller
that will be used by the App Mesh controller for EKS.
Next, use the following commands to install the App Mesh controller.
First, get the Custom Resource Definitions (CRDs) in place:
Create an AMP workspace
The AMP workspace is used to ingest the Prometheus metrics collected from envoy. A workspace is a logical and isolated Prometheus server dedicated to Prometheus resources such as metrics. A workspace supports fine-grained access control for authorizing its management such as update, list, describe, and delete, and the ingestion and querying of metrics.
Next, as an optional step, create an interface VPC endpoint to securely access the managed service from resources deployed in your VPC. A public endpoint for AMP is also available. This ensures that data ingested by the managed service does not leave the VPC in your AWS account. You can use the AWS CLI as shown here. Replace the placeholder strings such as VPC_ID
, AWS_REGION
, and others with your values.
Scrape the metrics
AMP does not directly scrape operational metrics from containerized workloads in a Kubernetes cluster. You must deploy and manage a Prometheus server or an OpenTelemetry agent such as the AWS Distro for OpenTelemetry Collector or the Grafana Agent to perform this task. In this blog post, we walk you through the process of configuring the Grafana Agent to scrape the envoy metrics and analyze them using AMP and AMG.
Configure the Grafana Agent for AMP
The Grafana Agent is a lightweight alternative to running a full Prometheus server. It keeps the necessary parts for discovering and scraping Prometheus exporters and sending metrics to a Prometheus-compatible backend, which in this case is AMP, and removes subsystems such as the storage, query, and alerting engines. The Grafana Agent is 100 percent compatible with Prometheus metrics and uses the Prometheus service discovery, scraping, write-ahead log, and remote write mechanisms from the Prometheus project. The Grafana Agent also supports basic sharding across every node in your Amazon EKS cluster by only collecting metrics running on the same node as the Grafana Agent pod, removing the need to decide between one giant machine to collect all of your Prometheus metrics and sharding through multiple manually managed Prometheus configurations. The Grafana Agent also includes native support for AWS Signature Version 4 for AWS Identity and Access Management (IAM) authentication, which means you don’t need to run a sidecar Signature Version 4 proxy, which reduces complexity, memory, and CPU demand.
In this blog post, we walk through the steps to configure an IAM role to send Prometheus metrics to AMP. We install the Grafana Agent on the EKS cluster and forward metrics to AMP.
Configure permissions
The Grafana Agent scrapes operational metrics from containerized workloads running in the Amazon EKS cluster and sends them to AMP for long-term storage and subsequent querying by monitoring tools such as Grafana. Data sent to AMP must be signed with valid AWS credentials using the AWS Signature Version 4 algorithm to authenticate and authorize each client request for the managed service.
The Grafana Agent can be deployed to an Amazon EKS cluster to run under the identity of a Kubernetes service account. With IAM roles for service accounts (IRSA), you can associate an IAM role with a Kubernetes service account and thus provide IAM permissions to any pod that uses the service account. This follows the principle of least privilege by using IRSA to securely configure the Grafana Agent, which includes the AWS Signature Version 4 that helps ingest Prometheus metrics to AMP.
The agent-permissions-aks shell script can be used to execute the following actions. Replace the placeholder variable YOUR_EKS_CLUSTER_NAME with the name of your Amazon EKS cluster.
- Creates an IAM role named EKS-GrafanaAgent-AMP-ServiceAccount-Role with an IAM policy that has permissions to remote-write into an AMP workspace.
- Creates a Kubernetes service account named grafana-agent under the grafana-agent namespace that is associated with the IAM role.
- Creates a trust relationship between the IAM role and the OIDC provider hosted in your Amazon EKS cluster.
You need kubectl and eksctl CLI tools to run the script. They must be configured to access your Amazon EKS cluster.
Now create a manifest file, grafana-agent.yaml
, with the scrape configuration to extract envoy metrics and deploy the Grafana Agent. This example deploys a DaemonSet named grafana-agent
and a deployment (with one replica) named grafana-agent-deployment
. The grafana-agent
DaemonSet collects metrics from pods on the cluster. The grafana-agent-deployment
collects metrics from services that do not live on the cluster, such as the Amazon EKS control plane. At the time of this writing, this solution will not work for the AWS Fargate data plane due to the lack of support for DaemonSet.
After the grafana-agent
is deployed, it will collect the metrics and ingest them into the specified AMP workspace. Now deploy a sample application on the Amazon EKS cluster and start analyzing the metrics.
Deploy a sample application on the EKS cluster
To install an application and inject an envoy container, use the AppMesh controller for Kubernetes you created earlier. AWS App Mesh Controller for K8s manages App Mesh resources in your Kubernetes clusters. The controller is accompanied by CRDs that allow you to define App Mesh components, such as meshes and virtual nodes, using the Kubernetes API just as you define native Kubernetes objects, such as deployments and services. These custom resources map to App Mesh API objects that the controller manages for you. The controller watches these custom resources for changes and reflects those changes into the App Mesh API.
Create an AMG workspace
Creating an AMG workspace is straightforward. Follow the steps in the Getting Started with Amazon Managed Service for Grafana blog post. To grant users access to the dashboard, you must enable AWS SSO. After you create the workspace, you can assign access to the Grafana workspace to an individual user or a user group. By default, the user has a user type of viewer. Change the user type based on the user role. Add the AMP workspace as the data source and then start creating the dashboard.
In this example, the user name is grafana-admin
and the user type is Admin. Select the required data source. Review the configuration, and then choose Create workspace.
Configure the data source and custom dashboard
To configure AMP as a data source, in the Data sources section, choose Configure in Grafana, which will launch a Grafana workspace in the browser. You can also manually launch the Grafana workspace URL in the browser. Use the instructions in the Getting Started with Amazon Managed Service for Grafana blog post and specify the AMP workspace you created earlier as a data source.
After the data source is configured, import a custom dashboard to analyze the envoy metrics. Choose Import (shown below), and then enter the ID 11022
. This will import the Envoy Global dashboard so you can start analyzing the envoy metrics.
As you can see from the screenshots, you can view envoy metrics like downstream latency, connections, response code, and more. You can use the filters shown to drill down to the envoy metrics of a particular application.
Configure alerts on AMG
You can configure Grafana alerts when the metric increases beyond the intended threshold. With AMG, you can configure how often the alert must be evaluated in the dashboard and send the notification. Currently, AMG supports Amazon SNS, Opsgenie, Slack, PagerDuty, VictorOp notifier types. Before you create alert rules, you must create a notification channel.
In this example, configure Amazon SNS as a notification channel. The SNS topic must be prefixed with grafana
for notifications to be successfully published to the topic.
Use the following command to create an SNS topic named grafana-notification
and subscribe an email address.
Add a new notification channel from the Grafana dashboard.
Configure the new notification channel named grafana-notification
. For Type, use AWS SNS from the drop down. For Topic, use the ARN of the SNS topic you just created. For Auth provider, choose AWS SDK Default.
Now configure an alert if downstream latency exceeds five milliseconds in a one-minute period. In the dashboard, choose Downstream latency from the dropdown, and then choose Edit. On the Alert tab of the graph panel, configure how often the alert rule should be evaluated and the conditions that must be met for the alert to change state and initiate its notifications.
In the following configuration, an alert is created if the downstream latency exceeds the threshold and notification will be sent through the configured grafana-alert-notification
channel to the SNS topic.
Conclusion
In summary, we’ve deployed an application running on EKS, integrated the app with AppMesh, scraped the Prometheus metrics using Grafana Agent to ingest the metrics to AMP and then visualize them in AMG. Because AMP and AMG are fully managed, serverless monitoring services, you can spend your time on the applications that transform your business and leave the job of managing the monitoring environment to AWS. AMP can be used with any Prometheus-compatible collectors such as a Prometheus server, ADOT, and Grafana Agent to collect metrics from other container environments such as Amazon ECS, self-managed Kubernetes on AWS, or on-premises infrastructure.
Further reading
- Using Amazon Managed Service for Prometheus to monitor EC2 environments
- Setting up cross-region metrics collection for Amazon Managed Prometheus workspaces
- Configuring Grafana Agent for Amazon Managed Service for Prometheus
- Setting up Grafana on EC2 to query metrics from Amazon Managed Service for Prometheus
- Using Service Meshes in AWS