Automating Amazon EC2 Instances Monitoring with Prometheus EC2 Service Discovery and AWS Distro for OpenTelemetry

Traditionally, scraping application Prometheus metrics required manual updates to a configuration file, posing challenges in dynamic AWS environments where Amazon EC2 instances are frequently created or terminated. This not only proves time consuming but also introduces the risk of configuration errors, lacking the agility necessary in dynamic environments.

In this blog post, we will demonstrate how Prometheus service discovery, particularly EC2 service discovery, can help overcome these challenges providing the following benefits:

Automatic target discovery
Reduced manual effort and enhanced agility
Minimized configuration errors

We will showcase how to configure AWS Distro for OpenTelemetry (ADOT) collector to perform EC2 service discovery in order to dynamically identify the EC2 targets for scraping Prometheus metrics. Subsequently, we are going to simulate a dynamic environment to showcase how EC2 service discovery automatically updates the list of targets to be scraped. We will collect the Prometheus metrics using Amazon Managed Service for Prometheus workspace and visualize them using Amazon Managed Grafana.

Solution Overview

To showcase the dynamic discovery of EC2 instance targets using EC2 service discovery, we are going to provision the following resources through AWS CloudFormation:

AWS Distro for OpenTelemetry (ADOT) collector running on EC2 instance with name ADOT_COLLECTOR to scrape Prometheus metrics.
Two Amazon EC2 instances with name APP_SERVER launched by an AWS AutoScaling Group (ASG) named ApplicationASG. They will be configured to run node_exporter to expose OS level Prometheus metrics.
The ADOT collector is configured to dynamically identify these targets using EC2 service discovery and filter them based on tag-key=service_name and tag-value=node_exporter.
An Amazon Managed Service for Prometheus and Amazon Managed Grafana workspace.

EC2 Service Discovery Archiecture with ADOT collector Figure 1: Solution Architecture

Prerequisites

Before starting, make sure you have AWS CloudShell, a browser-based shell setup into your AWS account and region to run the commands described in this blog post.
(Optional) We will be configuring user access through AWS IAM Identity Center for Amazon Managed Grafana workspace. Make sure you have enabled IAM Identity Center in your AWS account.

Solution Walkthrough

To deploy the architecture shown in Figure 1, please follow the below steps:

From the AWS CloudShell command line interface, enter the below commands to clone the sample project from the aws-samples GitHub repository.

git clone https://github.com/aws-samples/amazon-ec2-dynamic-monitoring-with-prometheus-service-discovery.git 
cd amazon-ec2-dynamic-monitoring-with-prometheus-service-discovery/templates

Next, to provision the resources, enter the following command. Replace the <aws-region> with your AWS Region name.

AWS_REGION=<aws-region>
aws cloudformation create-stack --stack-name adot-ec2-service-discovery-demo --template-body file://adot_ec2_service_discovery_cfn.yml --capabilities CAPABILITY_IAM --region $AWS_REGION

Setting up Amazon Managed Grafana Workspace

A managed Grafana workspace has been already created using AWS CloudFormation. Next you need to set up the following two configurations on this workspace:

Amazon Managed Grafana lets you to configure user access through AWS IAM Identity Center or other SAML based Identity Providers (IdP). In this post, we’re using the AWS IAM Identity Center option with Amazon Managed Grafana. To set up Authentication and Authorization, follow the instructions in the Amazon Managed Grafana User Guide for enabling AWS IAM Identity Center.

Console screenshot of Amazon Managed Grafana user access using AWS SSO. Figure 2: Example of Amazon Managed Grafana user access using AWS SSO.

Further, follow these steps to configure Amazon Managed Service for Prometheus as a data source for this Amazon Managed Grafana workspace.

Figure 3: Configuring Amazon Managed Prometheus as data source for Amazon Managed Grafana

Visualizing Prometheus Metrics with Amazon Managed Grafana

Now, let’s visualize the Prometheus metrics that have been pushed by the ADOT collector to the Amazon Managed Service for Prometheus workspace.

Navigate to Amazon Managed Grafana workspace from your AWS Management Console, choose the Workspace URL to sign in to your Grafana dashboard. As demonstrated in Figure 4, we are visualizing the Prometheus metric node_cpu_seconds_total for all the EC2 target instances that were dynamically discovered by the ADOT collector agent using EC2 service discovery.

Figure 4: Visualizing Prometheus metrics of dynamically scrapped targets

Additionally, you can visualize Prometheus metrics for individual EC2 instance targets by utilizing the instance_id label, as shown in Figure 5.

Figure 5: Visualizing Prometheus metrics of specific scrapped target

Simulating Dynamic EC2 Environment

To simulate a dynamic environment, we will increase the “Desired capacity” of the ApplicationASG Auto Scaling Group. Currently, this ASG is configured with a minimum size of 2, a maximum size of 4, and a desired capacity of 2. We will adjust the Desired capacity value from 2 to 4. Please follow the below steps to change this parameter:

Steps:

Navigate to AWS CloudShell console.

Run the following AWS CLI command in the terminal:

ASG_NAME=$(aws cloudformation describe-stacks --stack-name adot-ec2-service-discovery-demo --region $AWS_REGION --query 'Stacks[0].Outputs[?OutputKey==`ASG`].OutputValue' --output text)
echo $ASG_NAME 
aws autoscaling set-desired-capacity --auto-scaling-group-name $ASG_NAME --desired-capacity 4 --honor-cooldown --region $AWS_REGION

Wait 2-5 minutes for the ADOT collector to identify the new EC2 targets launched by the ASG service. Then, navigate to your Amazon Managed Grafana console to visualize the associated Prometheus metrics for these targets (see Figure 6).

Figure 6: Visualizing Prometheus metrics of newly launched targets

This showcases how ADOT collector leverages EC2 service discovery to identify newly added EC2 instances during scale-out activities in the Auto Scaling Group (ASG) and seamlessly collects Prometheus metrics from the newly identified targets, facilitating real-time monitoring and scalability within dynamic environments.

Let’s delve into how the ADOT collector manages to automatically identify these newly launched targets:

The ADOT collector initiates a DescribeInstances API call, specifying filter parameters to search for instances tagged with service_name as the key and node_exporter as the value.
The EC2 API responds with a filtered list of instances that meet the specified criteria. This updated list now includes the two recently launched instances from the ASG. This list is automatically refreshed based on the refresh_interval parameter.
The filtered targets will then be scraped by the ADOT collector in order to collect Prometheus metrics.
Prometheus metrics are retrieved from the targets and substantially pushed to desired destination e.g. Amazon Service for Managed Prometheus in this scenario.
Amazon Managed Grafana then queries Prometheus metrics from Amazon Managed Service for Prometheus.

EC2 Service Discovery Flow Diagram Figure 7: Flow diagram of how ADOT collector performs EC2 service discovery

Design Considerations

Here are some key design aspects you should consider while configuring EC2 service discovery with the ADOT collector on Amazon EC2.

1. IAM Role Permissions

When deploying the ADOT collector in conjunction with EC2 service discovery, make sure EC2 IAM Role must be equipped with ec2:DescribeInstances and ec2:DescribeAvailabilityZones permissions.

2. DescribeInstances API Requests Limit

By default, the ADOT collector will refresh the list of EC2 instances every 60s by making DescribeInstances API. You can configure the refresh_interval option to control how frequently ADOT collector makes the API requests in order to update this list. An example of such configuration can be found below snippet:

# EC2 service discovery with refresh interval as 5 minutes
  - job_name: 'node_exporter'
    ec2_sd_configs:
    - region: eu-west-1
    - refresh_interval: 5m

Refer Request throttling for the Amazon EC2 API for more information.

3. Configuring EC2 Security Groups

EC2 Service Discovery uses the EC2 instance private IP Address by default to scrape Prometheus metric. In order for ADOT collector to successfully scrape EC2 instances in VPC make sure you allow ingress traffic on port to scrape metrics under security group associated with your instances. For instance, if your application exposes Prometheus metrics via TCP port 9100, make sure to allow ingress traffic specifically on this port within the security group settings.

4. Tagging Strategies to Discover EC2 Instances

Tagging is a crucial aspect of effectively utilizing Prometheus EC2 service discovery. Employ essential metadata tags like Application or Service Name, Environment Name, and Role or Function to streamline grouping and identification of instances. Additionally, implement hierarchical tags, such as tier or cluster to represent relationships and dependencies, facilitating organized monitoring.

These best practices empower selective and targeted discovery, ensuring efficient monitoring of EC2 instances in dynamic AWS environments. Further insights can be found under the Tagging Best Practices Whitepaper.

5. Scaling ADOT Collector

Below are some ADOT collector scaling strategies running on EC2 instances while scraping a large number of targets:

Vertical Scaling: Initiate the scaling process by vertically expanding your ADOT Collector instance. This involves allocating more CPU and memory resources. You can accomplish this by modifying the EC2 instance type on which the ADOT collector runs.
Sharding by Availability Zones (AZ): In cases where you are scraping metrics from a vast array of EC2 instances spread across multiple Availability Zones (AZ) within a VPC, consider sharding the ADOT collector instance per AZ. This approach evenly distributes the workload across multiple ADOT collector instances. Below snippet is an example ADOT configuration to achieve this:
```
# ADOT Collector configuration to scrape targets from specific Availability Zone "ap-south-1a"
---
ec2_sd_configs:
  - region: ap-south-1
    port: 9100
    filters:
      - name: __meta_ec2_availability_zone
        values:
          - ap-south-1a
relabel_configs:
  - source_labels:
      - __meta_ec2_instance_id
    target_label: instance_id
```
Sharding by Metrics Type: Another sharding approach is based on the type of metrics you want to collect. For example, if you are running node_exporter to gather infrastructure-level metrics and jmx_exporter to collect application-level metrics, you can distribute the collection of these metrics using two ADOT collector instances. Likewise, you can shard them based on the environment or application. Here’s a snippet of ADOT configuration to achieve this:
```
# Scraping targets running jmx exporter by filtering using tag key "application" and value "JMX"
---
ec2_sd_configs:
  - region: ap-south-1
    port: 9999
    filters:
      - name: tag:application
        values:
          - JMX
relabel_configs:
  - source_labels:
      - __meta_ec2_instance_id
    target_label: instance_id
```

Cleaning up

To decommission all the resources deployed during walkthrough, navigate to AWS CloudShell command line interface and run the below command.

aws cloudformation delete-stack --stack-name adot-ec2-service-discovery-demo --region $AWS_REGION

Conclusion

In this blog post, we demonstrated how you can use EC2 service discovery with AWS Distro for OpenTelemetry (ADOT) collector in order to automatically identify targets for scraping Prometheus metrics from dynamic EC2 environments. This leads to significant reduction of time spent to manually maintaining list of targets and also mitigates the risk of configuration errors.

We also highlighted key design considerations aimed at enhancing operational efficiency and ensuring a more reliable monitoring process while using EC2 service discovery with ADOT collector. As a next step, we encourage you to try and customize this solution for your specific use cases in managing Prometheus metric scraping with ADOT collector in dynamic EC2 environments.

To learn more about AWS Observability services, please check the below resources:

AWS Cloud Operations & Migrations Blog