Automating the installation and configuration of Prometheus using Systems Manager documents

As organizations migrate workloads to the cloud, they want to ensure their teams spend more time on tasks that move the organization forward and less time managing infrastructure. Installing patches and configuring software is what AWS calls undifferentiated heavy lifting, or the hard IT work that doesn’t add value to the mission of the organization. Using manual processes to install or configure software can also introduce subtle problems that can be hard to debug and fix. In this blog post, I’m going to show you how you can use AWS Systems Manager documents to automatically install and configure Prometheus on your Amazon Elastic Compute Cloud (Amazon EC2) instances or on instances in your on-premises environment that you’ve configured for Systems Manager.

Prometheus is a very common monitoring and alerting system that customers run in their on-premises environments. Prometheus works by scraping metrics from instrumented applications and infrastructure. Engineers can then view the collected metrics or create alerting or recording rules. Many customers who are starting their cloud journey want to move their Prometheus workloads to the cloud.

In December 2020, AWS announced the release of Amazon Managed Service for Prometheus (AMP), a service that provides a fully managed environment that is tightly integrated with AWS Identity and Access Management (IAM) to control authentication and authorization. To configure an AMP workspace, see the Getting Started with Amazon Managed Service for Prometheus blog post.

By using Systems Manager Command documents, organizations can ensure that Prometheus (or any other software) is installed in exactly the same way every time. A Command document defines actions that Systems Manager runs on your behalf. Each action in a Command document is called a plugin. These plugins can run shell or PowerShell scripts on an EC2 instance, join an EC2 instance to a domain, and configure or run Docker on an EC2 instance, to name a few examples. When you run a Systems Manager Command document, you can define which instances it should run on, making it easy to manage fleets of servers at scale.

AWS Systems Manager Parameter Store provides a hierarchical storage mechanism for application and system parameters. It removes data from code, making it easier to manage and maintain automation. In this blog post, I’ll be using AWS Systems Manager Parameter Store to store the Prometheus configuration I want to apply to all of my installations. By storing Prometheus configuration information in AWS Systems Manager Parameter Store, I can easily update the configuration of my Prometheus servers in the future, without needing to modify any code that is executed on the EC2 instances.

Prerequisites

Before you begin, do the following:

Create an AMP workspace.
Create an IAM role for Amazon EC2 that includes the AmazonEC2RoleForSSM and AmazonPrometheusRemoteWriteAccess managed policies. AmazonEC2RoleForSSM allows AWS Systems Manager to run commands against the EC2 instance. AmazonPrometheusRemoteWriteAccess allows the EC2 instance to write Prometheus data to the AMP workspace.
Attach the IAM role to the EC2 instances where you want to install Prometheus.

Create the automation

I built an AWS CloudFormation template that creates three AWS Systems Manager Parameter Store values, an AWS Systems Manager Command document, an Amazon EventBridge rule, an AWS Lambda function, and the appropriate IAM permissions. This configuration allows you to install or upgrade Prometheus on a server manually or through automation (for example, when an instance is created as part of an EC2 Auto Scaling group).

Choose this link to set up your environment.

In Specify stack details, enter the AMP workspace ID to use as the location for remote writing Prometheus data. You can also modify the URLs used to install Prometheus and the Prometheus Node Exporter on EC2. See Figure 1.

In Specify stack details, prometheus-config is entered for the stack name. Under Parameters, the AMP workspace ID and installation URLs for Node Exporter and Linux installer for Prometheus are displayed.

Figure 1: Specify stack details page

After the stack is created, on the Resources tab, go to the three AWS Systems Manager Parameter Store values that were created. When I choose one of these values, I can view the configuration I want to apply across my Prometheus instances. In Figure 2, you can see the Prometheus scrape configuration, which defines how the Prometheus server will scrape an instrumented application.

The Prometheus scrape configuration stored in Parameter Store specifies a job_name of node and a remote write configuration to push data to the AMP workspace.

Figure 2: Details for Prometheus-config-ScrapeConfig

In this example, I have created a very simple scrape configuration for Prometheus, but you can modify this configuration as appropriate for your organization’s needs. This configuration sends metrics from the Prometheus server running on EC2 to my AMP workspace.

The Systems Manager Command document created by the CloudFormation stack appears on the Documents page of the AWS Systems Manager console. Choose the Owned by me tab to view documents that are owned by your account. In Figure 3, you’ll see my document is named prometheus-config-prometheus-installer.

The Owned by me tab displays the Command document that was created by the CloudFormation stack.

Figure 3: Viewing the AWS Systems Manager Command document

This document is configured to download the Linux files for Prometheus and extract the tar.gz file. It then sets up a Prometheus user and copies files to the right file system location with the appropriate permissions. After Prometheus has been installed and configured, it configures the Prometheus service to start.

Launch the automation manually

To apply this automation manually to one or more EC2 instances, choose the Command document (in my case, prometheus-config-prometheus-installer). Figure 4 shows the details page, where I can see the commands that will be executed.

The details of the AWS Systems Manager Command document include the description (Installs and configures Prometheus on Linux instances), platform, created date, owner, status, and more.

Figure 4: Details page for prometheus-config-prometheus-installer

Choose Run command to configure how to execute this automation against EC2.

Figure 5 shows the Run a command page, where the Command document is selected. In Command parameters, I see the values that are passed to the Command document. The Service Config, Node Exporter Service Config, and the Scrape Config parameter values refer to the Parameter Store values that were created by the CloudFormation stack. The Node Exporter Package Url and Package Url parameters refer to the installation URL for the Prometheus Node Exporter and the URL for Prometheus, respectively. You can use these parameter values as-is, or you can change them to point to different tar.gz files.

The Run a command page displays parameters for Node Exporter Package Url, Service Config, Package Url, Node Exporter Service Config, and Scrape Config.

Figure 5: Command parameters

Figure 6 shows the Targets page, where I can specify which EC2 instances to execute this automation against. I can specify a set of EC2 instances using tags, choose instances manually, or choose a resource group. For this example, I select Choose instances manually. I have a single EC2 instance that I want to run as my Prometheus server (which has the tag Name:prometheus). The IAM role I mentioned in the Prerequisites section is attached to this instance.

Under Targets, Choose instances manually is selected. Under Instances, an instance named prometheus is selected. The Command document will be run on this instance.

Figure 6: Selecting the instances on which to run the Command document

To execute the automation, I choose Run.

The command starts to run on the instance I specified. After a few moments, the status changes from In Progress to Success, as shown in Figure 7.

The Run Command details show that the command was successfully executed against the selected EC2 instance. The page also displays Command description and Command parameters sections.

Figure 7: Running the command on the selected EC2 instance

To verify that the automation to install and configure Prometheus has run successfully, I can view the Run Command output or navigate to my configured instance. After browsing to the DNS name of the instance on port 9090, I can view the Prometheus console. See Figure 8. Prometheus was successfully configured through repeatable automation!

The Prometheus query page shows statistics for the node_cpu_seconds_total metric.

Figure 8: Prometheus is successfully running on the new EC2 instance

Under Status, choose Configuration to view the Prometheus configuration. You’ll see that this server is remote writing Prometheus metrics to the AMP workspace. To view metrics from the AMP workspace using Grafana, see the Getting Started with Amazon Managed Service for Grafana blog post.

Launch the automation automatically

It’s common to run Prometheus in an Amazon EC2 Auto Scaling group to ensure high availability. To create an Auto Scaling group, you must first define the launch template, which defines the parameters required to launch an EC2 instance. I created a launch template, Prometheus-Launch-Template, in which I specified the AMI, instance type, security group, and IAM instance profile to use when launching a new EC2 instance. The IAM instance profile I specified uses the same IAM role I mentioned in the Prerequisites section. This IAM instance profile ensures the EC2 instance has the right permissions to support automation and can write metrics to the AMP workspace.

After you create the launch template, you can create an Auto Scaling group. The end goal is to create an Amazon EC2 Auto Scaling group with a lifecycle hook. Lifecycle hooks allow you to perform configuration on newly launched instances so that they are fully configured when the instance exits the Pending state. Lifecycle hooks generate events. Amazon EventBridge can receive these events and direct them to a new target, like an AWS Lambda function. As part of the CloudFormation template executed earlier, I created an Amazon EventBridge rule that receives instance launch notifications from an Auto Scaling group named AutoScale-prometheus-config. These events are sent to an AWS Lambda function (in my example, prometheus-config-AutoInstall-Prometheus) that simply automates the same calls to the AWS Systems Manager Run Command that were used to run the Prometheus installation manually in the previous section. See Figure 9.

The Prometheus event rule in the EventBridge console shows that aws.autoscaling events that have a lifecycle transition value of autoscaling:EC2_INSTANCE_LAUNCHING and belong to an Auto Scaling group named AutoScale-prometheus-config will be sent to a Lambda function for processing

Figure 9: EventBridge rule to enable automatic installation of Prometheus on new EC2 instances

The AWS Lambda function receives the Auto Scaling group lifecycle event, executes the AWS Systems Manager Command document that was created by the CloudFormation template, and finally calls complete-lifecycle-action. This API call tells the Auto Scaling group that the lifecycle hook has been completed and the EC2 instance is ready for service.

import json
import boto3
import time
import os

autoscaling = boto3.client('autoscaling')
ssm = boto3.client('ssm')

def send_lifecycle_action(event, result):
    try:
        response = autoscaling.complete_lifecycle_action(
                LifecycleHookName=event['detail']['LifecycleHookName'],
                AutoScalingGroupName=event['detail']['AutoScalingGroupName'],
                LifecycleActionToken=event['detail']['LifecycleActionToken'],
                LifecycleActionResult=result,
                InstanceId=event['detail']['EC2InstanceId']
            )
        print('AutoScaling lifecycle hook completed successfully')
    except:
        print('Error completing lifecycle action')
    

def run_command(event):
    doc_name = os.environ['DOCUMENT_NAME']
    ec2_instance=event['detail']['EC2InstanceId']
    
    attempt = 0
    while attempt < 10:
        attempt = attempt + 1
        time.sleep(5 * attempt)
        try: 
            response = ssm.send_command(
                InstanceIds=[ ec2_instance ],
                DocumentName=doc_name
                )
        
            if 'Command' in response:
                break
        except:
            print('Error calling send_command. Retrying...')
            continue
        
    command_id = response['Command']['CommandId']
    
    attempt = 0
    while attempt < 20:
        attempt = attempt + 1
        time.sleep(5 * attempt)
        result = ssm.get_command_invocation(
                CommandId=command_id,
                InstanceId=ec2_instance
            )
        if result['Status'] == 'Success':
            print('RunCommand completed successfully!')
            break

def lambda_handler(event, context):
    run_command(event)
    send_lifecycle_action(event, 'CONTINUE')

To match the Amazon EventBridge rule, I named my Auto Scaling group AutoScale-prometheus-config and selected the launch template (Prometheus-Launch-Template) I created in the previous step. I set the Desired capacity and Minimum capacity values of the Auto Scaling group to 0. This prevents the Auto Scaling group from creating instances until I am ready to proceed.

Figure 10 shows the details page for the Auto Scaling group I created. On the Instance management tab, in the Lifecycle hooks section, I choose Create lifecycle hook. I enter a name for the lifecycle hook and then choose Create. This step sends Auto Scaling events to Amazon EventBridge, which in turn calls the AWS Lambda function. Prometheus is now configured to automatically install on any new EC2 instances created through this Auto Scaling group.

The lifecycle hook section of the AutoScale-prometheus-config Auto Scaling group. The hook has been configured for a lifecycle transition of autoscaling:EC2_INSTANCE_LAUNCHING, a default result of ABANDON, and a heartbeat timeout of 3600 seconds.

Figure 10: Lifecycle hooks of the EC2 Auto Scaling group

On the Details tab of the Auto Scaling group, in Group details, choose Edit. For Desired capacity, enter 1, and then choose Update. A new EC2 instance will appear in the EC2 Instances dashboard. After about a minute, the instance will begin to respond. Prometheus will be automatically installed through the lifecycle hook that I set up. If the instance is terminated for some reason, the Auto Scaling group will detect it and spin up a new instance to replace it. Just as before, the Auto Scaling group lifecycle hook will be executed, ensuring that Prometheus is always installed on any new instance created through this Auto Scaling group!

Cost considerations

In this blog post, I used AWS Systems Manager Parameter Store parameters and an AWS Systems Manager Command document that was executed through the AWS Systems Manager Run Command. There are no charges for using any of these AWS Systems Manager features. I am only charged for the EC2 instances I am running and the metrics I ingest and store with Amazon Managed Service for Prometheus. For more information, see the AWS Systems Manager pricing page, the Amazon EC2 pricing page, and the Amazon Managed Service for Prometheus pricing page.

Conclusion

AWS Systems Manager provides a powerful set of tools to manage an organization’s compute instances at scale. In this blog post, I showed you how to run installation and configuration automation to ensure that Prometheus is installed and configured exactly the same way every time.

For more information about creating your own AWS Systems Manager documents, see Creating SSM documents in the AWS Systems Manager User Guide. For more information about Run Command, see Running commands from the console. For more information about Amazon Managed Service for Prometheus, see Getting started in the Amazon Managed Service for Prometheus User Guide.

AWS Cloud Operations & Migrations Blog