AWS Cloud Operations Blog

Centralized monitoring and alerting for AWS Systems Manager Agent status on managed nodes across AWS Organization

Has the AWS Systems Manager Agent (SSM Agent) running on your critical servers on-premises or on Amazon Elastic Compute Cloud (Amazon EC2) lost healthy connection to AWS Systems Manager (SSM) for some reason and you wanted to be proactively notified when this happens? Do you wish to improve observability of your SSM Agent status and monitor from a dashboard? This blog post describes an automated mechanism to achieve these objectives.

This post shows you how to monitor the status of the SSM Agent running on your critical managed nodes in your AWS Organizations from a centralized Amazon CloudWatch Dashboard and also configure Amazon CloudWatch Alarms to send messages to an Amazon Simple Notification Service (SNS) topic that you define, whenever the SSM agent loses healthy connection to AWS Systems Manager. You can subscribe your email or mobile phone number to the SNS topic so that you can receive the alerts whenever the Amazon CloudWatch Alarm is activated. The monitored critical managed nodes—which can be Amazon EC2 instances or on-premises nodes—are filtered out from the rest using specific tags that you have applied to these resources e.g. env:prod or SSMMonitoring:true.

Solution Overview

This solution is enabled by the following services:

Solution Diagram

Figure 1: Solution Diagram

The solution uses an AWS Lambda function to check the health of the SSM Agent connection on your critical managed nodes across your AWS Organization, using the specific tags and regions you define, and report their PingStatus metric to a centralized Amazon CloudWatch dashboard. Whenever your managed nodes have a healthy connection to Systems Manager, the PingStatus of the node from the DescribeInstanceInformation API reports Online. The Lambda function creates a PingStatus metric in Amazon CloudWatch, such that when the PingStatus is Online, the metric value is 0, otherwise it is 1. The Lambda Function also creates Amazon CloudWatch Alarms for the critical managed nodes and configures alerts to send messages to your defined SNS topic when activated. This Lambda function is periodically invoked by an Amazon EventBridge custom rule. You can define how frequent you want the Lambda Function to be invoked by defining your frequency in the Amazon EventBridge rule.

The workflow of the architecture you’ll create is as follows:

  1. An IAM role to be assumed by the Lambda Function in the target account(s).
  2. AWS EventBridge rule that invokes a Lambda function on a schedule e.g. every 15 minutes.
  3. An AWS Lambda Function checks the SSM agent health status of your managed nodes with specific tags that you provide and creates a custom PingStatus metric in Amazon CloudWatch. The Lambda Function also calls other CloudWatch APIs to configure CloudWatch Alarm with the target Amazon SNS Topic for the PingStatus metrics (if desired), and create an Amazon CloudWatch Dashboard.
  4. If a running instance with the tag does not appear in Systems Manager or has a Ping Status other than Online, the Pingstatus for the managed node is reported as Missing and the metric value is set to 1. If the PingStatus metric is Online, the metric is set to 0.
  5. Whenever the PingStatus metric for any of the managed nodes flips to 1, the alarm is activated and the notification is sent to the subscribers of your Amazon SNS Topic.
  6. If an instance that was monitored by the solution is terminated or the monitoring tags are removed, the corresponding alarm is deleted the next time the Lambda function is invoked and the CloudWatch dashboard is updated.

Prerequisites

For this walkthrough, you should have the following:

  • An AWS Account or list of AWS Accounts or AWS Organization
  • AWS SSM Managed nodes – on-premises or on Amazon EC2
  • Tags applied to the managed nodes e.g. SSMMonitoring:true
  • An Amazon SNS Topic with subscribers in central dashboard region. The subscribers can be emails, SMS etc.

Walkthrough

There are two CloudFormation templates you will deploy for this solution:

  1. Create an IAM role in all your accounts in AWS Organization or specific AWS accounts. These IAM roles will be assumed by the SSMPingStatus Lambda Function to be created in next step.
  2. Deploy the SSMPingStatus Monitoring solution by launching a CloudFormation stack in your desired central dashboard region and account using the provided CloudFormation template. This CloudFormation Template will create the required components – AWS Lambda Function, CloudWatch Alarms (optional), Amazon EventBridge Rule, and AWS CloudWatch Dashboard.

Step 1: Deploy IAM Role using CloudFormation template and CloudFormation StackSets.

  1. Download the CloudFormation template.
  2. Navigate to the AWS CloudFormation console in the Organization management account or CloudFormation delegated administrator.
  3. From the navigation pane, choose StackSets.
  4. At the top right of the StackSets page, choose Create StackSet.
  5. Under Prerequisite – Prepare template. Choose template is ready.
  6. Under Specify template, select Upload a template file, choose file, choose the file you downloaded from step 1, and select Next.
  7. On the Specify StackSet details page, perform the following steps:
    1. Set the StackSet name and Description to SSMPingStatus-IAMRole.
    2. Under Parameter, for CentralAccount, enter account ID of the monitoring account where the solution will be hosted.
    3. For CrossAccountExecutionRoleName, leave the default value AWS-Lambda-SSMPingStatus-Cross-Account-Role or enter a custom name for the IAM role to be assumed by Lambda from central account.
    4. Click Next.

    CloudFormation StackSet configuration parametersFigure 2: CloudFormation StackSet configuration parameters

  8. On the Configure StackSet options page, optionally add required tags, and then choose Next.
  9. On the Set deployment options page, under Specify region select the desired region e.g. us-east-1. Since you’re creating IAM resources, you only need to specify one region and then choose Next.
  10. Review all the information. Select I acknowledge that AWS CloudFormation might create IAM resources with custom names. Next, choose Submit to submit your stack configuration.
  11. After refreshing the page, the status of your StackSet should be Running. When the status changes to Succeeded, proceed to the next section. You can view the outcome of the individual Stack instances under the Stack instances tab of the CloudFormation StackSet console as shown in Figure 3.

NB: According to the documentation, CloudFormation StackSets doesn’t deploy stack instances to the organization’s management account, even if the management account is in your organization or in an OU in your organization. Hence, if you want to include the management account or delegated administrator account as a target for this monitoring solution, you will need to create IAM role SSMAgent_IAM_role.yml as a stack in the management account or delegated administrator account.

CloudFormation stack output resources (IAM Role) in target account(s)

Figure 3: CloudFormation StackSet output resources (IAM Role) in target account(s)

Step 2: Deploy SSMPingStatus Solution using CloudFormation template

  1. Download the CloudFormation template.
  2. Navigate to the CloudFormation console in the AWS account where you monitor the SSM agent status.
  3. For Create Stack, choose with new resources (standard).
  4. For Template source, choose Upload a template file. Choose file and select the template you downloaded in step 1.
  5. Click Next.
  6. Enter the stack name: SSMPingStatus.
  7. Under Parameters, provide the parameters to the AWS CloudFormation stack:
    1. For Target, enter select AWS Organization if targeting all accounts in AWS Organizations or Accounts for specific accounts. NB: If your Target parameter is AWS Organization, this stack should be deployed in either the management account or a delegated administrator for an AWS Service. However, if your Target is Accounts, then you can launch this solution in any account in your AWS Organization.
    2. (Optional) For TargetAccounts, if Accounts was selected in 7.1 above, provide the list of accounts hosting your managed nodes to be monitored, using comma separated values e.g. 1111111111,222222222,333333333. Otherwise, leave blank.
    3. For TargetRegionIds, enter a list of the region(s) hosting your managed nodes to be monitored using comma separated values e.g. us-east-1,us-east-2,eu-west-2.
    4. For Tag, enter the key:value pair of the tag for the specific managed nodes to be monitored in CloudWatch e.g. SSMMonitoring:true.
    5. For EventBridgeSchedule, enter the frequency for the monitoring solution in cron format. e.g. cron(0/15 * * * ? *) = 15 minutes schedule. This schedule will determine how frequent the Lambda Function will be invoked to track the status of your managed nodes. The time zone used is UTC. For more information, see Amazon EventBridge schedule.
    6. For CrossAccountExecutionRoleName, enter the name of the Lambda automation role created in all the target AWS account(s) in step 1 e.g. AWS-Lambda-SSMPingStatus-Cross-Account-Role.
    7. For CloudwatchCentralDashboardRegion, enter the name of the region where the Amazon CloudWatch Dashboard is to be created to track your managed nodes across accounts and regions e.g. us-east-1.
    8. For CreateCloudWatchAlarm, enter true if you want an alarm to be created for each monitored managed node or false if otherwise.
    9. (Optional) For SNSTopicArn, enter the Amazon Resource Name (ARN) of the SNS Topic to be used as a target of the CloudWatch Alarm if CreateCloudWatchAlarm parameter is set to true.
    10. For RetainCloudwatchResourcesOnDelete, enter true if you want CloudWatch Alarms and Dashboard to be retained on Stack Delete operation, otherwise leave as false.

      CloudFormation template parametersFigure 4: CloudFormation template parameters

    11. On the Configure Stack Options page, apply tags if needed, otherwise click Next.
    12. On the Review and create page, Select I acknowledge that AWS CloudFormation might create IAM resources with custom names, then choose Submit.

After the template has deployed, choose Outputs and note the values of the following as shown in Figure 5:

  • AWSLambdaSSMPingStatusRoleName
  • EventBridgeRule
  • SSMPingStatusLambdaFunctionName

CloudFormation output resources (SSMPingStatus Solution)

Figure 5: CloudFormation output resources (SSMPingStatus Solution)

Viewing Monitoring Dashboard

  1. Navigate to Amazon CloudWatch console.
  2. At the top left menu, select Dashboards.
  3. Under Custom Dashboards, select AWSOrganization-SSMAgentPingStatus.

Amazon CloudWatch Dashboard for managed nodes with specific tags

Figure 6: Amazon CloudWatch Dashboard for managed nodes with specific tags

NB: Each of the four widgets in the screenshot above represents a managed node that is monitored using the tag provided when deploying the CloudFormation Stack. If there are no managed instances found, there will not be any graphed widgets.

You can also click one of the instance widgets to zoom in:

CloudWatch Dashboard showing specific managed node with PingStatus Values = 1.0

Figure 7: CloudWatch Dashboard showing specific managed node with PingStatus Values = 1.0

Viewing CloudWatch Alarms Created

When the PingStatus metric for a managed node goes to 1.0, CloudWatch Alarm is activated and notification sent to SNS Topic subscribers (if alarm configuration is enabled in the CloudFormation template parameters). To simulate this, you can logon to the instance and stop the SSM agent service or alternatively shutdown the managed instance and wait the next invocation of the Lambda Function by the EventBridge rule. To view the alarms:

  1. Navigate to Amazon CloudWatch console.
  2. At the top left menu, select Alarms.
  3. Choose In Alarm to see any of the instances currently in alarm state as shown in Figure 8.

CloudWatch Alarm dashboard showing specific managed node with activated Alarm

Figure 8: CloudWatch Alarm dashboard showing specific managed node with activated Alarm

  1. Email notification is dispatched to the SNS topic provided in the solution as shown in Figure 9.

Email notification when Alarm is activated

Figure 9: Email notification when Alarm is activated

To remediate a server not reporting as a managed node or reporting status as ConnectionLost, refer to this guide. The cost for deploying the solution would be $20-$25 per month approximately for 50 managed instances.

Clean up

To avoid incurring future charges, delete the resources by deleting the CloudFormation Stack and StackSets. To clean up the resources created by CloudFormation:

  1. Navigate to the AWS CloudFormation console in the Organization management account or CloudFormation delegated administrator that was used to create SSMPingStatus-IAMRole in child accounts.
    1. Choose StackSets and select the CloudFormation StackSets named SSMPingStatus-IAMRole.
    2. Delete the associated Stacks from StackSets using this guide.
    3. Delete the StackSets using this guide.
  1. Navigate to monitoring account and delete the CloudFormation Stack named SSMPingStatus used to create the solution resources in the monitoring account.
    1. Open the AWS CloudFormation console and in the navigation pane, choose Stacks.
    2. Choose the CloudFormation stack named SSMPingStatus, choose Delete, and choose Delete stack.

Conclusion

By deploying this solution and utilizing an Amazon CloudWatch Dashboard and CloudWatch Alarms to monitor your SSM agent health, you will now have increased observability of your SSM agent across your managed nodes in AWS Organization. This enables faster response time to resolve SSM agent issues across critical servers in your fleet and reduce the overall downtime caused by SSM agent failures. This solution could be further expanded to create incidents in AWS Systems Manager Incident Manager whenever the CloudWatch Alarm created for the managed nodes is activated. Furthermore, an Amazon EventBridge rule could be used to monitor the CloudWatch Alarm and a custom remediation/playbook could be defined as target for the rule.

About the authors:

Charles Adebayo

Charles Adebayo

Charles Adebayo is a Cloud Support Engineer at AWS Cape Town office. Charles works with global customers, helping them migrate, modernize and streamline their centralized operations. Charles specializes in AWS Systems Manager, EC2 Windows and migration services. Outside technology, Charles is an advanced pianist and enjoys playing for the orchestra.

Suhail Fouzan

Suhail Fouzan

Suhail Fouzan is a Cloud Support Engineer in AWS Premium Support specializing in Systems Manager (SSM), EC2, and migration services. His focus on SSM ensures streamlined and centralized system management for AWS customers. Outside work, Suhail likes to play cricket and spend time with family.