AWS Cloud Operations Blog
Introducing CloudWatch Resource Health to monitor your EC2 hosts
Today, AWS announced Amazon CloudWatch Resource Health, a fully managed solution that customers can use to automatically discover, manage, and visualize the health and performance of Amazon Elastic Compute Cloud (Amazon EC2) hosts across their applications. Resource Health provides a centralized view of your EC2 hosts by performance dimensions such as CPU or memory utilization. You can use Resource Health to slice and dice hosts using filters such as instance type, instance state, or security groups. It enables a side-by-side comparison of a group of EC2 hosts and provides granular insights into an individual host.
Getting started with Resource Health
AWS customers might have hundreds to thousands of hosts driving their applications. Resource Health, available through the Amazon CloudWatch console under ServiceLens, makes it possible for you to visualize your EC2 hosts without any configuration changes. It discovers all the hosts in the account, captures their metrics and associated tags, and provides an easy-to-use experience to visualize health and performance metrics across EC2 hosts in real time. This ease of use makes Resource Health a powerful tool if you are looking for infrastructure-level visibility, with minimal effort.
Figure 1: Resource Health in the CloudWatch console
Customize your view
You can customize Resource Health views to define thresholds and color schemes to easily spot your EC2 hosts. You can choose a metric for your hosts, such as CPU utilization
, Memory utilization
, and Status checks
. To customize your view, choose the settings icon in the menu bar.
Group by capability
You can use Resource Health to group EC2 hosts into smaller chunks, which makes it possible to isolate applications or hosts that are experiencing performance issues. For example, you can group your hosts based on the EC2 host’s CPU architecture, instance type, instance state, instance lifecycle, image ID, the VPC it is launched in, the Availability Zone it resides in, or the Auto Scaling group it is a part of.
Filter by capability
In addition to grouping EC2 hosts, you can also filter them by tags and properties, including Auto Scaling group, Availability Zone, CPU utilization, EBS volume, instance type, instance lifecycle, load balancer, memory utilization, security group, instance state, status check, and VPC.
Sort by capability
You can also sort your nodes from left to right based on the status check, instance state, health, memory, CPU, or alarms. The sort order can be increasing or decreasing.
You can slice and dice your hosts in any way you want.
Navigating Resource Health
Resource Health shows summary of the health of all hosts in a single AWS Region, where each host represent a square cell. When you choose a cell, you’ll see summary information for that host, including alarms and CPU and memory utilization. You can also dive deeper into host-level dashboards.
There are two ways to navigate in Resource Health and visualize your EC2 host metrics:
- You can choose an EC2 host from the aggregated view and then choose View dashboard to navigate to the host overview page. This view provides information about instance metadata, metrics from the EC2 host and the Amazon CloudWatch agent installed on it, and alerts such as alarms set up on the hosts. It also provides information about the resources attached to the host, such as EBS, VPCs, and load balancers so you can correlate the health of the host to these resources. You can navigate to the EC2 console from the host overview page and then take actions such as restart or terminate on the selected EC2 hosts.
- Second, you can group your hosts in the Resource Health aggregated view to visualize a subset of your infrastructure. Choose View dashboard to navigate to the group dashboard page. This view provides information for all EC2 hosts in the group, based on the group by and filter by properties you chose. The group dashboard provides easy-to-understand graphs for:
- CPU utilization.
- Disk utilization details like
DiskReadOps
,DiskReadBytes
,DiskWriteOps
,DiskWriteBytes
. - Network utilization details like
Average NetworkPacketsIn
,NetworkPacketsOut
,NetworkBytesIn
,NetworkBytesOut
. - Status details.
Troubleshooting with Resource Health
I’ll share an example of how I used Resource Health to troubleshoot an issue with my application.
I have a small gaming application that has three main services. Service A manages user authentication. Service B manages game state. Service C displays leaderboards. When the leaderboard was loading more slowly than usual, I wanted to identify which service was causing the slowdown. The EC2 instances in Service A are tagged with svcName: SvcUserAuth
. In Service B, they are tagged with svcName: SvcManageGameState
, and in Service C, they are tagged with svcName: SvcDisplayLeaderBoard
.
From the left navigation pane of the Amazon CloudWatch console, I expand ServiceLens, and then choose Resource Health.
My application is configured to raise an alarm when a host is utilizing more than 75% of memory or 85% of CPU. On the Resource Health page, I immediately noticed the hosts triggering the alarms. In this case, the impacted hosts were identified by the In alarm icon, as shown in Figure 2.
Figure 2: Resource Health aggregate view showing EC2 CPU utilization
I am interested in visualizing the health of my EC2 instances across the three services. When I filter the EC2 instances by svcName: SvcUserAuth
, I see that none of the hosts in my user authentication service are raising any alarms, as shown in Figure 3.
Figure 3: Resource Health filter set to svcName: SvcUserAuth
When I filter the EC2 instances by svcName: SvcManageGameState
, I see that multiple hosts are raising alarms, as shown in Figure 4.
Figure 4: Resource Health filter set to svcName: SvcManageGameState
I can group the hosts in my SvcManageGameState service by instance type, instance state, and instance lifecycle. This allows me to quickly identify the instances that are raising alarms so I can dive deeper into the performance issue.
Figure 5: Resource Health filter set to svcName: SvcManageGameState
To investigate the instance that is in alarm, I choose the EC2 host and then choose View dashboard. On the host overview page, I see that the CPU utilization is over 90%. This is causing a delay in the communication of game state to the SvcDisplayLeaderBoard. After I scale the number of hosts serving the display of the leaderboard, I see the CPU utilization drop to 75%, which is below the 85% utilization that triggers alarms.
Figure 6: Resource Health dashboard
Resource Health helped me quickly identify the hosts in an alarm state and reduce MTTR for incidents affecting my application. The dashboard quickly surfaced the information I needed.
Conclusion
In this blog post, I shared an example of how I used Resource Health to monitor and troubleshoot the performance of EC2 hosts across an application.
Resource Health is generally available for monitoring the performance of EC2 instances, across all AWS Regions. If you have the Amazon CloudWatch Agent installed, you can get memory utilization insights through Resource Health. For more information, see Using Resource Health in the Amazon CloudWatch User Guide. To learn more about the observability functionalities of Amazon CloudWatch, see the One Observability Demo workshop.