Introducing Amazon CloudWatch Container Insights for Amazon ECS

Amazon Elastic Container Service (Amazon ECS) lets you monitor resources using Amazon CloudWatch, a service that provides metrics for CPU and memory reservation and cluster and services utilization. In the past, you had to enable custom monitoring of services and tasks. Now, you can monitor, troubleshoot, and set alarms for all your Amazon ECS resources using CloudWatch Container Insights. This fully managed service collects, aggregates, and summarizes Amazon ECS metrics and logs.

The CloudWatch Container Insights dashboard gives you access to the following information:

CPU and memory utilization
Task and service counts
Read/write storage
Network Rx/Tx
Container instance counts for clusters, services, and tasks
and more

Direct access to these metrics offers you much fuller insight into and control over your Amazon ECS resources.

With CloudWatch Container Insights, you can:

Gain access to CloudWatch Container Insights dashboard metrics
Integrate with CloudWatch Logs Insights to dynamically query and analyze container application and performance logs
Create CloudWatch alarm notifications to track performance and potential issues
Enable Container Insights with one click, no need for additional configuration or the use of sidecars to monitor your tasks.

CloudWatch Container Insights can also support Amazon Elastic Kubernetes Service (Amazon EKS).

Overview

In this post, I guide you through Container Insights setup and introduce you to the Container Insights dashboard. I demonstrate the Amazon ECS metrics available through this console. I show you how to query Container Insights performance logs to obtain specific metric data. Finally, I walk you through the default, automatically generated dashboard, explaining how to right-size tasks and scale services.

Enable Container Insights on new clusters by default

First, configure your Amazon ECS service to enable Container Insights by default for clusters created with your current IAM user or role.

Open the Amazon ECS console.
In the navigation pane, choose Account Settings.
To enable the Container Insights default opt-in, check the box at the bottom of the page. If this setting is not enabled, Container Insights can be enabled later when creating a cluster.

Next, follow the Amazon ECS Workshop for AWS Fargate instructions to create an Amazon ECS cluster with Container Insights enabled. You are creating a cluster with a web frontend and multiple backend services.

When completed, access the frontend application using the URL of the inbound Application Load Balancer, as in the following diagram.

You can also access it by using the output of the following command:

alb_url=$(aws cloudformation describe-stacks --stack-name fargate-demo-alb --query 'Stacks[0].Outputs[?OutputKey==`ExternalUrl`].OutputValue' --output text)
echo "Open $alb_url in your browser"

Explore CloudWatch Container Insights

After the cluster is running, run the container tasks to confirm that Container Insights is enabled and CloudWatch is collecting Amazon ECS metrics.

To view the newly collected metrics, navigate to the CloudWatch console and choose Container Insights. To view your automatic dashboard, select an ECS Resource dimension. The available Amazon ECS options are ECS clusters, ECS services, and ECS tasks. Choose ECS Clusters.

The dashboard should then display the available metrics for your cluster. These metrics should include CPU, memory, tasks, services, and network utilization.

In the following dashboard example, the tasks are running on Fargate and are required to use AWSVPC networking mode. Container Insights doesn’t currently support AWSVPC networking mode for network metrics in this first release. You can see in the following graph that these metrics are omitted. This cluster was set up to support only Fargate tasks, and the container instance count is equal to zero.

At the bottom of the page, select an ECS cluster. Choose Actions, View performance logs. This selection leads to CloudWatch Logs Insights, where you can quickly and effectively query CloudWatch Logs data metrics. CloudWatch Logs Insights includes a purpose-built query language with simple but powerful commands. For more information, see Analyzing Log Data with CloudWatch Logs Insights.

Container Insights provides a default query that can be executed by choosing Run query. This query reports all of the metrics that CloudWatch collects from the cluster, minute by minute. You can expand each data item to investigate individual metrics.

With CloudWatch Container Insights, you can monitor an ECS cluster from a unified view, quickly identifying and responding to any operational issues.

Explore use cases

In this section, I explore how to use CloudWatch and Container Insights to manage your cluster. Start by generating traffic to your cluster. Create one web request per second to your frontend URL:

alb_url=$(aws cloudformation describe-stacks --stack-name fargate-demo-alb --query 'Stacks[0].Outputs[?OutputKey==`ExternalUrl`].OutputValue' --output text)
while true; do curl -so /dev/null $alb_url ; sleep 1 ; done &;

In the CloudWatch Insights dashboard, choose ECS Services, and select your cluster. As the following dashboard screenshot shows, CPU and memory utilization are minimal, and the frontend service can handle this load.

Next, simulate high traffic to the cluster using ApacheBench. The following ApacheBench command generates nine concurrent requests (-c 9), for 60 seconds (-t 60), and ignores length variances (-l). Loop this repeatedly every second.

while true; do ab -l -c 9 -t 60 $alb_url ; sleep 1; done

Task CPU increases significantly while Memory Utilization remains low.

Adjust the dashboard time range to see the average CPU and memory utilization for each task in the past 30 minutes, as shown in the following screenshot.

To see individual resource utilization average over 30 minutes, scroll to the bottom of the dashboard.

Select any task and choose View performance logs. This selection opens CloudWatch Logs Insights for the ECS cluster. In the query box, enter the following query and choose Run query:

stats avg(CpuUtilized), avg(MemoryUtilized) by bin (30m) as period, TaskDefinitionFamily, TaskDefinitionRevision 
| filter Type = "Task" | sort period desc, TaskDefinitionFamily

From the query result, the frontend service has an average CPU utilization of 229.2273 unit and memory utilization of 71 megabytes. The current frontend task configuration uses 512 MiB of memory and 256 unit of CPU. Your frontend service has used almost all CPU resources that it has reserved.

To improve the frontend service performance based on the CPU metrics, one option is to re-size the task. First, create a new revision of this task definition with a new CPU and Memory value. Then, update the frontend service to use the new revision.

Step 1: Create a new task definition revision.

In the ECS console, choose Task Definitions.
For ecsdemo-frontend, select the check box and choose Create new revision.
On the Create new revision of Task Definition screen, under Container Definitions, choose ecsdemo-frontend.
For Memory Limits (MiB), choose Soft limit and enter the value of 1024.
Under Environment, for CPU units, enter 512.
Choose Update.
On the Create new revision of Task Definition screen, under Task size, choose 0.5 vCPU, which is 512 CPU unit.
For Memory, choose 1GB.
Choose Create.

Step 2: Update the frontend service with the new task definition.

Choose Cluster and select the cluster.
On the Services tab, for ecsdemo-frontend, select the check box and choose Update.
For Task Definition, select the revision that you previously created in the first step.
Choose Skip to review then Update Service.

ECS spins up the new tasks with this new revision of the task definition and removes the old ones.

As shown in the dashboard, the average percentage CPU Utilization remains high. The current load still stresses CPU. On the positive note, the load balancer for the frontend service can handle significantly more requests, from 25k–57k requests over 5 minutes.

The benchmarking result from ApacheBench shows the same evidence. From the client perspective, the frontend service is able to process over twice the requests. By increasing the CPU available for your task, you increase the frontend service’s ability to handle the load. Remember that the frontend service consists of three tasks and CPU usage remains high.

To continue to improve the frontend service performance based on the CPU metrics, increase task size or scale out the service. With the current load, the average RequestCountPerTarget value is around 8k/5-minute interval. Update the frontend service to automatically scale tasks and keep RequestCountPerTarget closed to 1000 requests per target.

Run the following command to update the frontend service, setting up Service Auto Scaling with a maximum of 25 tasks and a minimum of 3 tasks. For Scaling policy, use Target Tracking Scaling Policy with the target value of 1000 for RequestCountPerTarget.

cd ~/environment/fargate-demo
export clustername=$(aws cloudformation describe-stacks --stack-name fargate-demo --query 'Stacks[0].Outputs[?OutputKey==`ClusterName`].OutputValue' --output text)
export alb_arn=$(aws cloudformation describe-stack-resources --stack-name fargate-demo-alb | jq -r '.[][] | select(.ResourceType=="AWS::ElasticLoadBalancingV2::LoadBalancer").PhysicalResourceId')
export target_group_arn=$(aws cloudformation describe-stack-resources --stack-name fargate-demo-alb | jq -r '.[][] | select(.ResourceType=="AWS::ElasticLoadBalancingV2::TargetGroup").PhysicalResourceId')
export target_group_label=$(echo $target_group_arn |grep -o 'targetgroup.*')
export alb_label=$(echo $alb_arn |grep -o 'farga-Publi.*')
export resource_label="app/$alb_label/$target_group_label"

aws application-autoscaling register-scalable-target \
    --service-namespace ecs \
    --scalable-dimension ecs:service:DesiredCount \
    --resource-id service/${clustername}/ecsdemo-frontend \
    --min-capacity 3 \
    --max-capacity 25
 
envsubst <config.json.template >/tmp/config.json
 
aws application-autoscaling put-scaling-policy --cli-input-json file:///tmp/config.json

Now the frontend service starts to scale. CloudWatch Container Insights displays the number of tasks in each step of scaling.

From the load balancer of your frontend service request metric, you can see that your cluster can handle even more requests, which is approximately 153k requests or 6.1k request per target over 5 minutes. If you had not set the maximum number of tasks of 25 for the Auto Scaling policy, the frontend service would have scaled out even more as the threshold is 1000 requests per target.

You see the same evidence from ApacheBench: the frontend service is able to process more requests and the time per request is much smaller.

From the query result in CloudWatch Logs Insight, the average CPU utilization is at 158 out of the 512 CPU unit configured earlier, which is relatively low.

You can see that you can use these metrics from Container Insights to help fine-tune your cluster. Similarly to how you spotted a task that was suffering, you could spot an oversized task using similar techniques and reduce the configuration, saving money in turn.

To see the frontend service scale in, stop the load by pressing CRTL-C twice to cancel the ab loop. For curl loop, type fg to bring the process to the foreground, then press CRTL-C. Within a couple of minutes, the frontend services start and continue to scale in.

Diving deeper into ECS tasks with CloudWatch Logs Insights

When you enable CloudWatch Container Insights, your Amazon ECS cluster’s performance metrics are automatically collected in the form of logs using performance log events. You can use CloudWatch Logs Insights to query performance logs to dive deeper into metrics of each Task and the individual Containers inside the Task . One common use case is to see the resource utilization of your application container as well as the sidecar container. As described in this document, there are 4 types of Amazon ECS performance log events that describe performance metrics, these are Container, Task, Service and Cluster. Performance log events use CloudWatch Embedded Metric Format which helps ingestion of complex high-cardinality application data in the form of logs.

Navigate to CloudWatch Log Insights and select the log group and time duration. This query shows all types of performance events log.

fields @timestamp, @message
| limit 20

To see individual tasks for the frontend service, use the following query:

field @timestamp, TaskId, TaskDefinitionFamily, CpuUtilized, CpuReserved, MemoryUtilized, MemoryReserved, StorageReadBytes, StorageWriteBytes
| filter (Type = "Task" and TaskDefinitionFamily = "ecsdemo-frontend")
| sort @timestamp
| limit 20

The following query shows average CPU and Memory utilization by TaskId, ContainerName and TaskDefinitionFamily without filtering the log type. The query results include the data from all log types which show the resource usage of the task and the individual containers inside the task. The container name ~internal~ecs~pause is the Amazon ECS internal system container.

stats avg(CpuUtilized) as CPU, avg(MemoryUtilized) as Mem by bin(30m) as period, TaskId, ContainerName, TaskDefinitionFamily
| sort period, TaskId, Mem, CPU 
| limit 20

You can also see the CPU utilization trend for a specific TaskId using the following query and choosing the Visualization tab to see the query results as a graph.

stats avg(CpuUtilized) by bin (5m) as period
| filter (Type = "Task" and TaskId = "97fe0bd0-9267-4158-92cd-e6c68e619e61")
| sort period

Conclusion

In the past, you had to implement custom metrics. Now, CloudWatch Container Insights for Amazon ECS helps you focus on monitoring and managing your application, so that you can respond quickly to operational issues. The service provides sharper insight into your Amazon ECS clusters, services, and tasks through added CPU and memory metrics. You can use CloudWatch Log Insights for right-sizing, alarms, scaling, and query performance log metrics that drive more informed analysis.

In this post, I introduced these new CloudWatch container metrics. I walked you through the default, automatically generated dashboard, showing you how to use the CloudWatch Container Insights console to right-size tasks and scale services. I dived deep into a performance log event provided by CloudWatch Logs Insights. I showed you how to use query language to find a specific metric’s value and choose the best value for right-sizing purposes.

CloudWatch Container Insights is generally available for Amazon ECS, AWS Fargate, Amazon EKS, and Kubernetes. For more information, see the documentation on Using Container Insights.

About the Author

Sirirat Kongdee is a Sr. Solutions Architect at Amazon Web Services. She loves working with customers and helping them remove roadblocks from their cloud journey. She enjoys traveling (whether for work or not) as much as she enjoys hanging out with her pug in front of the TV.

AWS Cloud Operations & Migrations Blog