AWS Cloud Operations Blog
Detecting and remediating process issues on EC2 instances using Amazon CloudWatch and AWS Systems Manager
Customers want to have visibility into processes running inside their Amazon Elastic Compute Cloud (Amazon EC2) instances. Critical processes and services in these instances can crash unexpectedly and when they do, it’s crucial for customers to be notified so they can maintain continued business operations.
There are multiple ways to see if a service is running as expected. One way is to instrument the application code to send heartbeats to an observer process at certain intervals. Applications can also expose a liveness or readiness endpoint that can be polled by external monitoring systems to check if it is running and functioning properly. In cases where you don’t want to introduce more observability logic into your application, it’s a common practice to write operating system scripts to watch for the liveness of specific processes. But as an alternative to writing custom scripts, you can use the Amazon CloudWatch agent procstat plugin, which continuously watches specified processes and reports their metrics to Amazon CloudWatch. After the data is in Amazon CloudWatch, you can associate alarms to trigger actions like notifying teams or remediations like restarting the processes, resizing the instances, and so on.
In this blog post, I’ll show you how to use AWS Systems Manager Run Command on EC2 instances to restart the processes that the procstat plugin detected were down.
System architecture
By completing the steps in this post, you can create a system that uses the following architecture.
- An EC2 instance with the Amazon CloudWatch agent installed pushes metrics data to Amazon CloudWatch.
- The procstat plugin collects process metrics. As long as the process is running inside the instance, the plugin will continuously monitor the specified metrics.
- You’ll add a CloudWatch alarm on one of these metrics and set its missing data policy to Treat missing data as bad (breaching threshold) to trigger an alarm condition.
- When the process is stopped or crashed on the instance, the alarm goes into the In alarm A notification is sent to the specified Amazon Simple Notification Service (Amazon SNS) topic, which is consumed by an AWS Lambda function. The Lambda function extracts the hostname attribute from the payload and looks it up through the Amazon EC2 API to find its instance ID. This ID is required in the API call.
- AWS Lambda then issues a Run Command API call against AWS Systems Manager to initiate the execution of a shell command (
service sshd restart
) on the instance, which restarts the stopped service.
The architecture for this system is shown in Figure 1:
Figure 1: Architecture showing EC2 instance pushing metric data to Amazon CloudWatch
Deploy the stack
To deploy the sample architecture in your AWS environment, use the AWS CloudFormation template included with this post.
The template creates the following resources in your account:
- A VPC and private subnet.
- An EC2 instance with the Amazon Linux 2 operating system.
- Multiple VPC endpoints.
- An Amazon SNS topic.
- A Lambda function with a subscription to the SNS topic.
- Other supporting resources, such as security groups and IAM roles.
To start your deployment, use the following link.
After deployment, the stack will provide you the EC2 instance ID in its outputs section. The AWS Systems Manager Agent (SSM Agent) is installed on Amazon Linux instances. The template you deployed performed the network and security configuration required for SSM Agent to communicate with AWS Systems Manager. This means you can use SSM Agent to install software on this instance. You can also install the SSM Agent manually using scripts or configuration management tools like Chef, Puppet, or Ansible.
Use Systems Manager to install the CloudWatch agent on the EC2 instance
- Open the AWS Systems Manager console and from the left navigation pane, choose Run Command.
- On the Commands page, choose Run command to add a command.
- In Command document, type AWS-ConfigureAWSPackage to find and select the document.
Figure 2: Command document page in the Systems Manager console
- In Command parameters, for Name, enter
AmazonCloudWatchAgent
.
Figure 3: Command parameters
- In Targets, select the Choose instances manually In the real world, you might want to specify instance tags to select EC2 instances or resource groups.
Figure 4: Choose instances manually
- If you want to export the command output to an S3 bucket, in Output options, select the Write command output to an S3 bucket box and enter an S3 destination bucket. Leave the other parameters at their defaults.
- Choose Run. You can watch the command’s progress from the Commands
After the status transitions to Success, you can configure the CloudWatch agent.
Configure the CloudWatch agent
To configure the agent, connect to the EC2 instance. Because this instance is deployed in a private subnet, you must use Systems Manager to connect to it.
- In the Amazon EC2 console, choose your EC2 instance, and then choose Connect.
- On the Session Manager tab, choose Connect to establish the connection.
Figure 5: Session Manager tab in the EC2 console
After the connection is successful, create a CloudWatch agent configuration file.
sudo vi /opt/aws/amazon-cloudwatch-agent/bin/config.json
Add the following configuration JSON and then save the file.
{
"agent": {
"run_as_user": "cwagent"
},
"metrics": {
"metrics_collected": {
"procstat": [
{
"pid_file": "/var/run/sshd.pid",
"measurement": [
"cpu_usage",
"memory_rss"
]
}
]
}
}
}
This configuration enables the procstat
plugin and tells it to monitor the sshd
process identified by the sshd.pid
file. The plugin will monitor the cpu_usage
and memory_rss
metrics of this process and send information to Amazon CloudWatch.
Now use the following command to start the CloudWatch agent with its new configuration:
sudo /opt/aws/amazon-cloudwatch-agent/bin/amazon-cloudwatch-agent-ctl -a fetch-config -m ec2 -s -c file:/opt/aws/amazon-cloudwatch-agent/bin/config.json
This command will configure the amazon-cloudwatch-agent service on this machine where you can start, stop, and restart with systemctl
commands.
Create a CloudWatch alarm
You can observe the newly created namespace and its metrics in the CloudWatch console.
- From the left navigation pane, choose Metrics.
- On the All metrics tab, in Custom namespaces, choose CWAgent to see CWAgent-specific metrics.
Figure 6: Metrics emitted by the CloudWatch agent
- Select the checkbox for the procstat_memory_rss metric to observe its data on the metrics graph.
- Choose the Graphed Metrics tab.
Figure 7: Graphed metrics tab
- Under Actions, choose the bell icon to open the Create alarm page.
- In Metric, for Period, choose 1 minute. Leave the other parameters at their defaults.
Figure 8: Alarm metric parameters
- In Conditions, complete the fields as shown in Figure 9:
Figure 9: Alarm parameters
These settings tell CloudWatch to go into the In alarm state if the procstat_memory_rss metric value goes lower or equal to 0, or missing data is detected. Note that a memory consumption metric will never go below or equal to zero for a running process, so this alarm configuration is practically tracking a missing data situation. Whenever the sshd process is stopped or crashed, the operating system will delete its pid file. The CloudWatch agent will detect the deletion and will not emit any metric data as long as the file is missing. This will cause the CloudWatch alarm to go into the In alarm state.
- Choose Next to configure actions for this alarm.
Figure 10: Alarm notification parameters
- In Notification, choose the SNS topic that was created by the CloudFormation template, and then choose Next.
- Enter a name and optional description for the alarm, and then choose Next.
Figure 11: Alarm name
- Review your alarm configuration, and then choose Create alarm. The status displayed for the alarm will be Insufficient Data. When the status changes to OK, continue to the next section.
Test your configuration
To test your configuration, use Systems Manager to reconnect to your EC2 instance and then issue the following command to make sure the sshd service is already running on the server:
sudo systemctl status sshd
Figure 12: sshd service is running on the instance
Use the following command to stop the service:
sudo systemctl stop sshd
Figure 13: sshd service is stopped on the instance
After the daemon is stopped, go to the CloudWatch console to observe the alarm condition. The alarm will go into the In alarm state in about five minutes.
Figure 14: SshdAlarm
When this happens, CloudWatch will notify the SNS topic, which will then trigger a Lambda function. The Lambda function will fetch the host ID from the SNS message payload and do a lookup against the EC2 API to find the instance ID. It will use this ID to send the Run Command API call. While the Lambda function is executing this operation, logs will be created. At this point, the sshd service comes back up and the agent will start emitting its metrics. The arrival of metrics will transition the alarm state back to OK.
In the CloudWatch console, check the alarm until it goes into the OK state, reconnect to the EC2 instance, and then use the following command to check the service status.
sudo systemctl status sshd
Observe the active status of the process in the output of the command.
Figure 15: sshd service is running on the instance
Cleanup
To clean up the resources you created in your account, open the AWS Cloudformation console and delete the stack. Open the CloudWatch console and delete the alarm and log group where the Lambda function’s logs are stored.
Conclusion
The Amazon CloudWatch agent can be used on your EC2 instances and on-premises servers. It provides a reliable way to collect process liveness information. This way, customers can be notified when critical processes and services go down unexpectedly which will allow them to take remedial actions for continued business operations.
In this blog post, I showed you how to build fully automated processes that can detect and act upon service or process crashes using AWS System Manager Run Command. The architecture used includes the procstat plugin to monitor processes metrics. An alarm attached to this metric was triggered because of a missing data, indicating a crashed process. This alarm then triggered a Lamda function to issue the Run Command to restart the process if there is an issue.
You can watch critical processes and services in your servers and remediate them quickly by using these AWS automation capabilities.