AWS Cloud Operations & Migrations Blog

Introducing CloudWatch Resource Health to monitor your EC2 hosts

Today, AWS announced Amazon CloudWatch Resource Health, a fully managed solution that customers can use to automatically discover, manage, and visualize the health and performance of Amazon Elastic Compute Cloud (Amazon EC2) hosts across their applications. Resource Health provides a centralized view of your EC2 hosts by performance dimensions such as CPU or memory utilization. You can use Resource Health to slice and dice hosts using filters such as instance type, instance state, or security groups. It enables a side-by-side comparison of a group of EC2 hosts and provides granular insights into an individual host.

Getting started with Resource Health

AWS customers might have hundreds to thousands of hosts driving their applications. Resource Health, available through the Amazon CloudWatch console under ServiceLens, makes it possible for you to visualize your EC2 hosts without any configuration changes. It discovers all the hosts in the account, captures their metrics and associated tags, and provides an easy-to-use experience to visualize health and performance metrics across EC2 hosts in real time. This ease of use makes Resource Health a powerful tool if you are looking for infrastructure-level visibility, with minimal effort.

Figure1 Resource Health is available through the Amazon CloudWatch console under ServiceLens

Figure 1: Resource Health in the CloudWatch console

Customize your view

You can customize Resource Health views to define thresholds and color schemes to easily spot your EC2 hosts. You can choose a metric for your hosts, such as CPU utilization, Memory utilization, and Status checks. To customize your view, choose the settings icon in the menu bar.

Group by capability

You can use Resource Health to group EC2 hosts into smaller chunks, which makes it possible to isolate applications or hosts that are experiencing performance issues. For example, you can group your hosts based on the EC2 host’s CPU architecture, instance type, instance state, instance lifecycle, image ID, the VPC it is launched in, the Availability Zone it resides in, or the Auto Scaling group it is a part of.

Filter by capability

In addition to grouping EC2 hosts, you can also filter them by tags and properties, including Auto Scaling group, Availability Zone, CPU utilization, EBS volume, instance type, instance lifecycle, load balancer, memory utilization, security group, instance state, status check, and VPC.

Sort by capability

You can also sort your nodes from left to right based on the status check, instance state, health, memory, CPU, or alarms. The sort order can be increasing or decreasing.

You can slice and dice your hosts in any way you want.

Navigating Resource Health

Resource Health shows summary of the health of all hosts in a single AWS Region, where each host represent a square cell. When you choose a cell, you’ll see summary information for that host, including alarms and CPU and memory utilization. You can also dive deeper into host-level dashboards.

There are two ways to navigate in Resource Health and visualize your EC2 host metrics:

  • You can choose an EC2 host from the aggregated view and then choose View dashboard to navigate to the host overview page. This view provides information about instance metadata, metrics from the EC2 host and the Amazon CloudWatch agent installed on it, and alerts such as alarms set up on the hosts. It also provides information about the resources attached to the host, such as EBS, VPCs, and load balancers so you can correlate the health of the host to these resources. You can navigate to the EC2 console from the host overview page and then take actions such as restart or terminate on the selected EC2 hosts.
  • Second, you can group your hosts in the Resource Health aggregated view to visualize a subset of your infrastructure. Choose View dashboard to navigate to the group dashboard page. This view provides information for all EC2 hosts in the group, based on the group by and filter by properties you chose. The group dashboard provides easy-to-understand graphs for:
    • CPU utilization.
    • Disk utilization details like DiskReadOps, DiskReadBytes, DiskWriteOps, DiskWriteBytes.
    • Network utilization details like Average NetworkPacketsIn, NetworkPacketsOut, NetworkBytesIn, NetworkBytesOut.
    • Status details.

Troubleshooting with Resource Health

I’ll share an example of how I used Resource Health to troubleshoot an issue with my application.

I have a small gaming application that has three main services. Service A manages user authentication. Service B manages game state. Service C displays leaderboards. When the leaderboard was loading more slowly than usual, I wanted to identify which service was causing the slowdown. The EC2 instances in Service A are tagged with svcName: SvcUserAuth. In Service B, they are tagged with svcName: SvcManageGameState, and in Service C, they are tagged with svcName: SvcDisplayLeaderBoard.

From the left navigation pane of the Amazon CloudWatch console, I expand ServiceLens, and then choose Resource Health.

My application is configured to raise an alarm when a host is utilizing more than 75% of memory or 85% of CPU. On the Resource Health page, I immediately noticed the hosts triggering the alarms. In this case, the impacted hosts were identified by the In alarm icon, as shown in Figure 2.

Figure2 Resource Health aggregate view showing the EC2 CPU utilization, and EC2 hosts with more than 85% CPU utilization in alarm

Figure 2: Resource Health aggregate view showing EC2 CPU utilization

I am interested in visualizing the health of my EC2 instances across the three services. When I filter the EC2 instances by svcName: SvcUserAuth, I see that none of the hosts in my user authentication service are raising any alarms, as shown in Figure 3.

Figure3 Resource Health filter set to svcName: SvcUserAuth shows no EC2 hosts with alarms

Figure 3: Resource Health filter set to svcName: SvcUserAuth

When I filter the EC2 instances by svcName: SvcManageGameState, I see that multiple hosts are raising alarms, as shown in Figure 4.

Figure4 Resource Health filter set to svcName: SvcManageGameState shows EC2 hosts with alarms

Figure 4: Resource Health filter set to svcName: SvcManageGameState

I can group the hosts in my SvcManageGameState service by instance type, instance state, and instance lifecycle. This allows me to quickly identify the instances that are raising alarms so I can dive deeper into the performance issue.

Figure5 Resource Health filter set to svcName: SvcManageGameState, and grouped by instance type shows c6gn.16xlarge, c6g.16xlarge, and r6g.16xlarge instances grouped together. c6gn.16xlarge instances are raising alarms

Figure 5: Resource Health filter set to svcName: SvcManageGameState

To investigate the instance that is in alarm, I choose the EC2 host and then choose View dashboard. On the host overview page, I see that the CPU utilization is over 90%. This is causing a delay in the communication of game state to the SvcDisplayLeaderBoard. After I scale the number of hosts serving the display of the leaderboard, I see the CPU utilization drop to 75%, which is below the 85% utilization that triggers alarms.

Figure6 Resource Health dashboard for my EC2 host shows CPU utilization spiking, and other metrics like DiskReadBytes, DiskWriteBytes, NetworkIn, NetworkOut

Figure 6: Resource Health dashboard

Resource Health helped me quickly identify the hosts in an alarm state and reduce MTTR for incidents affecting my application. The dashboard quickly surfaced the information I needed.

Conclusion

In this blog post, I shared an example of how I used Resource Health to monitor and troubleshoot the performance of EC2 hosts across an application.

Resource Health is generally available for monitoring the performance of EC2 instances, across all AWS Regions. If you have the Amazon CloudWatch Agent installed, you can get memory utilization insights through Resource Health. For more information, see Using Resource Health in the Amazon CloudWatch User Guide. To learn more about the observability functionalities of Amazon CloudWatch, see the One Observability Demo workshop.

About the authors

Purva Upsak

Purva Upasak

Purva Upasak is a Customer Solutions Manager lead for Strategic Accounts at AWS. She works with her team to ensure customers achieve the desired outcomes in their AWS journey. As their advocate, she partners with the customers to understand their concerns, address their challenges, and ensure their success.

Niketan Kapila

Niketan Kapila

Niketan Kapila is a Senior Product Manager for Amazon CloudWatch based in Seattle. He loves building products that improve people’s lives, and strives for success by ensuring customer satisfaction and meeting customer requirements. He joined AWS in 2015 and is focused on making it easy for AWS users to monitor distributed applications built using microservices architecture. Outside of work, Niketan enjoys football, traveling and technology.