Detecting and remediating process issues on EC2 instances using Amazon CloudWatch and AWS Systems Manager

Customers want to have visibility into processes running inside their Amazon Elastic Compute Cloud (Amazon EC2) instances. Critical processes and services in these instances can crash unexpectedly and when they do, it’s crucial for customers to be notified so they can maintain continued business operations.

There are multiple ways to see if a service is running as expected. One way is to instrument the application code to send heartbeats to an observer process at certain intervals. Applications can also expose a liveness or readiness endpoint that can be polled by external monitoring systems to check if it is running and functioning properly. In cases where you don’t want to introduce more observability logic into your application, it’s a common practice to write operating system scripts to watch for the liveness of specific processes. But as an alternative to writing custom scripts, you can use the Amazon CloudWatch agent procstat plugin, which continuously watches specified processes and reports their metrics to Amazon CloudWatch. After the data is in Amazon CloudWatch, you can associate alarms to trigger actions like notifying teams or remediations like restarting the processes, resizing the instances, and so on.

In this blog post, I’ll show you how to use AWS Systems Manager Run Command on EC2 instances to restart the processes that the procstat plugin detected were down.

System architecture

By completing the steps in this post, you can create a system that uses the following architecture.

An EC2 instance with the Amazon CloudWatch agent installed pushes metrics data to Amazon CloudWatch.
The procstat plugin collects process metrics. As long as the process is running inside the instance, the plugin will continuously monitor the specified metrics.
You’ll add a CloudWatch alarm on one of these metrics and set its missing data policy to Treat missing data as bad (breaching threshold) to trigger an alarm condition.
When the process is stopped or crashed on the instance, the alarm goes into the In alarm A notification is sent to the specified Amazon Simple Notification Service (Amazon SNS) topic, which is consumed by an AWS Lambda function. The Lambda function extracts the hostname attribute from the payload and looks it up through the Amazon EC2 API to find its instance ID. This ID is required in the API call.
AWS Lambda then issues a Run Command API call against AWS Systems Manager to initiate the execution of a shell command (service sshd restart) on the instance, which restarts the stopped service.

The architecture for this system is shown in Figure 1:

] In step 1, the metric data is pushed to CloudWatch. In step 2, the missing data triggers an alarm. In step 3, a Lambda function fetches the alarm. In step 4, a lookup is performed in EC2. In step 5, a Lambda function uses Run Command to restart the process on the EC2 instance.

Figure 1: Architecture showing EC2 instance pushing metric data to Amazon CloudWatch

Deploy the stack

To deploy the sample architecture in your AWS environment, use the AWS CloudFormation template included with this post.

The template creates the following resources in your account:

A VPC and private subnet.
An EC2 instance with the Amazon Linux 2 operating system.
Multiple VPC endpoints.
An Amazon SNS topic.
A Lambda function with a subscription to the SNS topic.
Other supporting resources, such as security groups and IAM roles.

To start your deployment, use the following link.

After deployment, the stack will provide you the EC2 instance ID in its outputs section. The AWS Systems Manager Agent (SSM Agent) is installed on Amazon Linux instances. The template you deployed performed the network and security configuration required for SSM Agent to communicate with AWS Systems Manager. This means you can use SSM Agent to install software on this instance. You can also install the SSM Agent manually using scripts or configuration management tools like Chef, Puppet, or Ansible.

Use Systems Manager to install the CloudWatch agent on the EC2 instance

Open the AWS Systems Manager console and from the left navigation pane, choose Run Com mand.
On the Commands page, choose Run command to add a command.
In Command document, type AWS-ConfigureAWSPackage to find and select the document.

Command document shows AWS-ConfigureAWSPackage in the list.

Figure 2: Command document page in the Systems Manager console

In Command parameters, for Name, enter AmazonCloudWatchAgent.

Command parameters displays an Action field set to Install, an Installation Type field set to Uninstall and reinstall, and a Name field where AmazonCloudWatchAgent is entered.

Figure 3: Command parameters

In Targets, select the Choose instances manually In the real world, you might want to specify instance tags to select EC2 instances or resource groups.

There are two options for selecting targets: Specify instance tags and Choose instances manually. There is an instance displayed with an Instance state of running in the us-east-1a Availability Zone with a Ping status of Online.[

Figure 4: Choose instances manually

If you want to export the command output to an S3 bucket, in Output options, select the Write command output to an S3 bucket box and enter an S3 destination bucket. Leave the other parameters at their defaults.
Choose Run. You can watch the command’s progress from the Commands

After the status transitions to Success, you can configure the CloudWatch agent.

Configure the CloudWatch agent

To configure the agent, connect to the EC2 instance. Because this instance is deployed in a private subnet, you must use Systems Manager to connect to it.

In the Amazon EC2 console, choose your EC2 instance, and then choose Connect.
On the Session Manager tab, choose Connect to establish the connection.

Connect to instance includes three tabs: EC2 Instance Connect, Session Manager, and SSH client. The Session Manager tab is selected and displays information about how sessions are secured, where session commands can be logged, and more.[

Figure 5: Session Manager tab in the EC2 console

After the connection is successful, create a CloudWatch agent configuration file.

sudo vi /opt/aws/amazon-cloudwatch-agent/bin/config.json

Add the following configuration JSON and then save the file.

{
        "agent": {
                "run_as_user": "cwagent"
        },
        "metrics": {
                "metrics_collected": {
	                "procstat": [
	                {
	                    "pid_file": "/var/run/sshd.pid",
	                    "measurement": [
	                        "cpu_usage",
	                        "memory_rss"
	                    ]
	                }
	            	]
                }
        }
}

This configuration enables the procstat plugin and tells it to monitor the sshd process identified by the sshd.pid file. The plugin will monitor the cpu_usage and memory_rss metrics of this process and send information to Amazon CloudWatch.

Now use the following command to start the CloudWatch agent with its new configuration:

sudo /opt/aws/amazon-cloudwatch-agent/bin/amazon-cloudwatch-agent-ctl -a fetch-config -m ec2 -s -c file:/opt/aws/amazon-cloudwatch-agent/bin/config.json

This command will configure the amazon-cloudwatch-agent service on this machine where you can start, stop, and restart with systemctl commands.

Create a CloudWatch alarm

You can observe the newly created namespace and its metrics in the CloudWatch console.

From the left navigation pane, choose Metrics.
On the All metrics tab, in Custom namespaces, choose CWAgent to see CWAgent-specific metrics.

There are two metrics displayed: procstate_memory_rss and procstat_cpu_usage.

Figure 6: Metrics emitted by the CloudWatch agent

Select the checkbox for the procstat_memory_rss metric to observe its data on the metrics graph.
Choose the Graphed Metrics tab.

Graphed metrics tab is selected. Procstat_memory_rss is selected in the list.

Figure 7: Graphed metrics tab

Under Actions, choose the bell icon to open the Create alarm page.
In Metric, for Period, choose 1 minute. Leave the other parameters at their defaults.

The alarm metric parameters include metric name (procstat_memory_rss), pidfile (/var/run/sshd.pid), process_name (sshd), host, statistic (Average), and period (1 minute).

Figure 8: Alarm metric parameters

In Conditions, complete the fields as shown in Figure 9:

Threshold type is set to Static. The alarm condition is defined as Whenever procstate_memory_rss is Lower/Equal than 0. Datapoints to alarm is set to 1 out of 1. Missing data treatment is set to Treat missing data as bad (breaching threshold).

Figure 9: Alarm parameters

These settings tell CloudWatch to go into the In alarm state if the procstat_memory_rss metric value goes lower or equal to 0, or missing data is detected. Note that a memory consumption metric will never go below or equal to zero for a running process, so this alarm configuration is practically tracking a missing data situation. Whenever the sshd process is stopped or crashed, the operating system will delete its pid file. The CloudWatch agent will detect the deletion and will not emit any metric data as long as the file is missing. This will cause the CloudWatch alarm to go into the In alarm state.

Choose Next to configure actions for this alarm.

Notification page includes a section for alarm state trigger (in this example, in alarm). Under Select an SNS topic, the notification will be sent to blog-topic.

Figure 10: Alarm notification parameters

In Notification, choose the SNS topic that was created by the CloudFormation template, and then choose Next.
Enter a name and optional description for the alarm, and then choose Next.

Under Alarm name, SshdAlarm is entered.

Figure 11: Alarm name

Review your alarm configuration, and then choose Create alarm. The status displayed for the alarm will be Insufficient Data. When the status changes to OK, continue to the next section.

Test your configuration

To test your configuration, use Systems Manager to reconnect to your EC2 instance and then issue the following command to make sure the sshd service is already running on the server:

sudo systemctl status sshd

sshd service is running on the instance.

Figure 12: sshd service is running on the instance

Use the following command to stop the service:

sudo systemctl stop sshd

sshd service is stopped on the instance.

Figure 13: sshd service is stopped on the instance

After the daemon is stopped, go to the CloudWatch console to observe the alarm condition. The alarm will go into the In alarm state in about five minutes.

On the Alarms page, the state of the SshdAlarm is In alarm.

Figure 14: SshdAlarm

When this happens, CloudWatch will notify the SNS topic, which will then trigger a Lambda function. The Lambda function will fetch the host ID from the SNS message payload and do a lookup against the EC2 API to find the instance ID. It will use this ID to send the Run Command API call. While the Lambda function is executing this operation, logs will be created. At this point, the sshd service comes back up and the agent will start emitting its metrics. The arrival of metrics will transition the alarm state back to OK.

In the CloudWatch console, check the alarm until it goes into the OK state, reconnect to the EC2 instance, and then use the following command to check the service status.

sudo systemctl status sshd

Observe the active status of the process in the output of the command.

sshd service is running on the instance.

Figure 15: sshd service is running on the instance

Cleanup

To clean up the resources you created in your account, open the AWS Cloudformation console and delete the stack. Open the CloudWatch console and delete the alarm and log group where the Lambda function’s logs are stored.

Conclusion

The Amazon CloudWatch agent can be used on your EC2 instances and on-premises servers. It provides a reliable way to collect process liveness information. This way, customers can be notified when critical processes and services go down unexpectedly which will allow them to take remedial actions for continued business operations.

In this blog post, I showed you how to build fully automated processes that can detect and act upon service or process crashes using AWS System Manager Run Command. The architecture used includes the procstat plugin to monitor processes metrics. An alarm attached to this metric was triggered because of a missing data, indicating a crashed process. This alarm then triggered a Lamda function to issue the Run Command to restart the process if there is an issue.

You can watch critical processes and services in your servers and remediate them quickly by using these AWS automation capabilities.

AWS Cloud Operations & Migrations Blog