Effective use: Amazon ECS lifecycle events with Amazon CloudWatch logs insights

Introduction

We have observed a growing adoption of container services among both startups and established companies. This trend is driven by the ease of deploying applications and migrating from on-premises environments to the cloud. One platform of choice for many of our customers is Amazon Elastic Container Service (Amazon ECS). The powerful simplicity of Amazon ECS allows customers to scale from managing a single task to overseeing their entire enterprise application portfolio and to reach thousands of tasks. Amazon ECS eliminates the management overhead associated with running your own container orchestration service.

When working with customers, we have observed that there is a valuable opportunity to enhance the utilization of Amazon ECS events. Lifecycle events offer troubleshooting insights by linking service events with metrics and logs. Amazon ECS displays the latest 100 events, making it tricky to retrospectively review them. Using Amazon CloudWatch Container Insights resolves this by storing Amazon ECS lifecycle events in Amazon CloudWatch Log Group. This integration lets you analyze events retroactively, enhancing operational efficiency.

Amazon EventBridge, a serverless event bus, which connects applications seamlessly. Along with Container Insights, Amazon ECS can serve as an Event source while Amazon CloudWatch Logs act as the Target in Amazon EventBridge. This enables post-incident analysis using Amazon CloudWatch Logs Insights.

Our post explains how to effectively analyze Amazon ECS service events via Container Insights or Amazon EventBridge or both using Amazon CloudWatch Logs Insights Queries. These queries significantly enhance your development and operational workflows.

Prerequisites

To be able to work through the techniques that will be presented in this technical guide you must have the below feature enabled in your account.

An Amazon ECS Cluster with active workload.
Amazon EventBridge configured to stream events to either Amazon CloudWatch Logs directly or having Amazon ECS CloudWatch Container Insights enabled.

Here is an elaborated guide to set up Amazon EventBridge to stream events to Amazon CloudWatch Logs or Container Insights.

Walkthrough

Useful lifecycle events patterns

The events that the Elastic Container Service (Amazon ECS) emits can be categorized into four groups:

Container instance state change events – These events are triggered when there is a change in the state of an Amazon ECS container instance. This can happen due to various reasons, such as starting or stopping a task, upgrading the Amazon ECS agent, or other scenarios.
Task state change events – These events are emitted whenever there is a change in the state of a task, such as when it transitions from pending to running or from running to stopped. Additionally, events are triggered when a container within a task stops or when a termination notice is received for AWS Fargate Spot capacity.
Service action events – These events provide information about the state of the service and are categorized as info, warning, or error. They are generated when the service reaches a steady state, when the service consistently cannot place a task, when the Amazon ECS APIs are throttled, or when there are insufficient resources to place a task.
Service deployment state change events – These events are emitted when a deployment is in progress, completed, or fails. They are typically triggered by the circuit breaker logic and rollback settings.

For a more detailed explanation and examples of these events and their potential use cases, please refer to the Amazon ECS events documentation.

Let’s dive into some real-world examples of how to use events for operational support. We’ve organized these examples into four categories based on event patterns: Task Patterns, Service Action Patterns, Service Deployment Patterns, and ECS Container Instance Patterns. Each category includes common use cases and demonstrates specific queries and results.

Running Amazon CloudWatch Logs Insights query

Follow below steps to run an Amazon CloudWatch Logs Insights queries, which will be covered in latter section of this post:

Open the Amazon CloudWatch console and choose Logs, and then choose Logs Insights.
Choose log groups containing Amazon ECS events and performance logs to query.
Enter the desired query and choose Run to view the results.

Task event patterns

Scenario 1:

In this scenario, the operations team encounters a situation where they need to investigate the cause of HTTP status 5XX (server-side issue) errors that have been observed in their environment. To do so, they reach out to confirm whether an Amazon ECS task correctly followed its intended task lifecycle. The team suspects that a task’s lifecycle events might be contributing to the 5XX errors, and they need to narrow down the exact source of these issues to implement effective troubleshooting and resolution.

Required query

Query Inputs:

detail.containers.0.taskArn: Intended Task ARN

fields time as Timestamp, `detail-type` as Type, detail.lastStatus as `Last Status`, detail.desiredStatus as `Desired Status`, detail.stopCode as StopCode, detail.stoppedReason as Reason  
| filter detail.containers.0.taskArn = "arn:aws:ecs:us-east-1:111122223333:task/CB-Demo/6e81bd7083ad4d559f8b0b147f14753f"
| sort @timestamp desc
| limit 10

Result:

Let’s see how service events can aide confirmation of Task lifecycle, from the results we can see Last Status of task progressed as shown in the following:

PROVISIONING > PENDING > ACTIVATING > RUNNING > DEACTIVATING > STOPPING > DEPROVISIONING > STOPPED

This confirms to documented task life cycle flow and task was first DEACTIVATED and then STOPPED, we can see that stoppage of this task was initiated by Scheduler ServiceSchedulerInitiated because of reason Task failed container health checks.

Similarly, query can also fetch check lifecycle details of a task failing load balancer health checks, result will be as shown in the following:

In below query replace detail.containers.0.taskArn with intended Task ARN:

fields time as Timestamp, `detail-type` as Type, detail.lastStatus as `Last Status`, detail.desiredStatus as `Desired Status`, detail.stopCode as StopCode, detail.stoppedReason as Reason  
| filter detail.containers.0.taskArn = "arn:aws:ecs:us-east-1:111122223333:task/CB-Demo/649e1d63f0db482bafa0087f6a3aa5ed"
| sort @timestamp desc
| limit 10

Let’s see an example of another task which was stopped manually by calling StopTask, because action was UserInitiated and reason is Task stopped by user:

A table showing the status of ECS tasks. This table has following columns: Timestamp, Type, Last Status, Desired Status, StopCode and Reason At last it shows ECS task was stopped by user and stop code is user Initiated.

As an addition in both cases we can see how Desired State (irrespective of who initiated stop task) drives Last Status of Task.

Task Lifecycle for reference:

The flow chart below shows the task lifecycle flow. When a task is started it moves from PROVISIONING to PENDING to ACTIVATING to RUNNING At this point task is successfully running. When a task is stopped it moves from DEACTIVATING to STOPPING to DEPROVISIONING to STOPPED At this point the task has been successfully stopped.

Scenario 2:

Let’s consider the scenario where you may encounter frequent task failures within a service, necessitating a means to diagnose the root causes behind these issues. Tasks might be terminating due to various reasons, such as resource limitations or application errors. To address this, you can query for the stop reasons for all tasks in the service to uncover underlying issues.

Required Query

Query Inputs:

detail.group : Your intended service name

filter `detail-type` = "ECS Task State Change" and detail.desiredStatus = "STOPPED" and detail.group = "service:circuit-breaker-demo"
|fields  detail.stoppingAt as stoppingAt, detail.stoppedReason as stoppedReason,detail.taskArn as Task
| sort @timestamp desc
| limit 200

TIP: In case if you have service autoscaling enabled and there are frequent scaling events for service you can further add another filter to above query to filter out events related to scaling to focus solely on other stop reason.

filter detail-type = "ECS Task State Change" and detail.desiredStatus = "STOPPED" and detail.stoppedReason not like "Scaling activity initiated by" and detail.group = "service:circuit-breaker-demo"
|fields  detail.stoppingAt as stoppingAt, detail.stoppedReason as stoppedReason,detail.taskArn as Task
| sort @timestamp desc
| limit 200

Result:

In the results, we can see the task stop reasons for tasks within the service, along with their respective task IDs. By analyzing these stop reasons, you can identify the specific issues leading to task terminations. Depending on the stop reasons, potential solutions might involve application tuning, adjusting resource allocations, optimizing task definitions, or fine-tuning scaling strategies.

A table showing ECS task stop reasons for tasks within the service, along with their respective task IDs. For this task it shows that the task is failing load balancer health checks.

Scenario 3:

Let’s consider a scenario where your security team needs critical information about the usage of specific network interfaces, MAC addresses, or attachment IDs. It’s important to note that Amazon ECS automatically provisions and deprovisions Elastic Network Interfaces (ENIs) when tasks start and stop. However, once a task is stopped, there are no readily available records or associations to trace back to a specific Task ID using Elastic Network Interface (ENI) or Media Access Control (MAC) assigned to ENI information. This poses a challenge in meeting the security team’s request for such data, as the automatic nature of ENI management in Amazon ECS may limit historical tracking capabilities for these identifiers.

Required Query

Query Inputs:

detail.attachments.1.details.1.value :Intended mac address of the ENI

Additional: Replace Task ARNs and Cluster ARN Details

fields @timestamp, `detail.attachments.1.details.1.value` as ENIId,`detail.attachments.1.status` as ENIStatus, `detail.lastStatus` as TaskStatus
| filter `detail.attachments.1.details.1.value` = "eni-0e2b348058ae3d639" 
| parse @message "arn:aws:ecs:us-east-1:111122223333:task/CB-Demo/*\"" as TaskId
| parse @message "arn:aws:ecs:us-east-1:111122223333:cluster/*\"," as Cluster
| parse @message "service:*\"," as Service
| display @timestamp, ENIId, ENIStatus, TaskId, Service, Cluster, TaskStatus

To Look up by ENI ID, replace value of detail.attachments.1.details.2.value with intended MAC address:

fields @timestamp, `detail.attachments.1.details.1.value` as ENIId, `detail.attachments.1.details.2.value` as MAC ,`detail.attachments.1.status` as ENIStatus, `detail.lastStatus` as TaskStatus
| filter `detail.attachments.1.details.2.value` = '12:eb:5f:5a:83:93'
| parse @message "arn:aws:ecs:us-east-1:111122223333:task/CB-Demo/*\"" as TaskId
| parse @message "arn:aws:ecs:us-east-1:111122223333:cluster/*\"," as Cluster
| parse @message "service:*\"," as Service
| display @timestamp, ENIId, MAC, ENIStatus, TaskId, Service, Cluster, TaskStatus

Result:

By ENI Id, in results we can details of task/service/cluster for which ENI was provisioned and the state of task to correlate.

A table having columns for time stamp, ENI id of task with service and cluster for which ENI was provisioned along with the state of task to correlate.

Just like ENI, we can query by MAC address, with the same details as ENI:

A table having columns for time stamp, MAC id of task with service and cluster for which ENI was provisioned along with the state of task to correlate.

Service action event patterns

Scenario 4:

You may encounter a situation where you need to identify and prioritize resolution for services with the highest number of faults. To achieve this, you want to query and determine the top N services that are experiencing issues.

Required Query:

filter `detail-type` = "ECS Service Action" and @message like /(?i)(WARN)/
| stats count(detail.eventName) as countOfWarnEvents by resources.0 as serviceArn, detail.eventName as eventFault
| sort countOfWarnEvents desc
| limit 20

Result:

By filtering for WARN events and aggregating service-specific occurrences, you can pinpoint the services that require immediate attention. Prioritizing resolution efforts, for example, the service ecsdemo-auth-no-sd, in this case, is facing the SERVICE_TASK_START_IMPAIRED error. This ensures that you can focus your resources on mitigating the most impactful issues and enhancing the overall reliability of your microservices ecosystem:

A table having service ARN, Fault & count of Events as columns and event fault is SERVICE_TASK_START_IMPAIRED

Service deployment event patterns

Scenario 5:

Since we are aware that any Amazon ECS service comes with an event type of INFO, WARN, or ERROR, we can use this as a search pattern to analysis our workloads for troubled services.

Required Query:

fields @timestamp as Time, `resources.0` as Service, 
`detail-type` as `lifecycleEvent`, `detail.reason` as `failureReason`, @message
| filter `detail.eventType` = "ERROR"
| sort @timestamp desc
| display Time, Service, lifecycleEvent, failureReason
| limit 100

Result:

In results below the ecsdemo-backend service is failing to successfully deploy tasks, which activates the Amazon ECS circuit breaker mechanism that stops the deployment of the service. Using the expand arrow to the left of the table, we can get more details about the event:

This table has Time Stamp, Service name, lifecycle event and failure reasons as columns. It also shows that the ECS deployment circuit breaker was triggered.

A screenshot of event which has details about ECS deployment circuit breaker was trigger event like time stamp, event name, reason and ID etc.

Service deployment event patterns

Scenario 6:

In this scenario, you have received a notification from the operations team indicating that, following a recent deployment to an Amazon ECS service, the previous version of the application is still visible. They are experiencing a situation where the new deployment did not replace the old one as expected, leading to confusion and potential issues. The operations team seeks to understand the series of events that occurred during the deployment process to determine what went wrong, identify the source of the issue, and implement the necessary corrective measures to ensure a successful deployment.

Required Query

Query Inputs:

resources.0 : intended service ARN

fields time as Timestamp, detail.deploymentId as DeploymentId , detail.eventType as Severity, detail.eventName as Name, detail.reason as Detail, `detail-type` as EventType
| filter `resources.0` ="arn:aws:ecs:us-east-1:12345678910:service/CB-Demo/circuit-breaker-demo"
| sort @timestamp desc
| limit 10

Result:

Let’s analyze service events to understand what went wrong during a deployment, by examining the sequence of events, a clear timeline emerges:

We can see that service was initially in steady state (line 7) and there was good deployment (ecs-svc/6629184995452776901 in line 6).
A new deployment (ecs-svc/4503003343648563919) occurs, possibly with a code bug (line 5).
Task from this deployment was failing to start (line 3).
This problematic deployment triggers a circuit breaker logic that initiates a rollback to the previously known good deployment (ecs-svc/6629184995452776901 in line 4).
The service eventually returns to a steady state (lines 1 and 2).

This sequence of events not only provides a chronological view of what happened but also offers specific insights into the deployments involved and the potential reasons for the issue. By analyzing these service events, the operations team can pinpoint the problematic deployment (i.e., ecs-svc/4503003343648563919) and investigate further to identify and address the underlying code issues, ensuring a more reliable deployment process in the future.

A table having time stamp, Deployment ID, Severity, Name, Detail of event and Event Type as columns. We further have event messages showing failure of ECS deployment circuit breaker.

ECS container instance event patterns:

Scenario 7:

You want to track the history of an Amazon ECS Agent updates for container instances in the cluster. A trackable history ensures compliance with security standards by verifying that the agent has the necessary patches and updates installed, and it also allows for the verification of rollbacks in the event of problematic updates. This information is valuable for operational efficiency and service reliability.

Required Query:

fields @timestamp, detail.agentUpdateStatus as agentUpdateStatus, detail.containerInstanceArn as containerInstanceArn,detail.versionInfo.agentVersion as agentVersion
| filter `detail-type` = "ECS Container Instance State Change"
| sort @timestamp desc
| limit 200

Result:

As we can see from the results, the Agent on Container Instance was at v 1.75.0. On Update Agent trigger, the process to update agent started at sequence 9 and finally completed at sequence 1.

Initially, the container instance operated with ECS Agent version 1.75.0. Subsequently, at sequence 9, an update operation was initiated, indicating the presence of a new Amazon ECS Agent version. After a series of update actions, the Agent Update successfully concluded at sequence 1. This information offers a clear snapshot of the version transition and update procedure, underlining the importance of tracking Amazon ECS Agent updates to ensure the security, reliability, and functionality of the ECS cluster.

This table having Time Stamps, container Instance ARN, Agent version and agent update status, which represents change in state of agent status and its version.

Cleaning up

Once you’ve completed with exploring sample queries, please ensure you disable any Amazon EventBridge rules and Amazon ECS CloudWatch Container Insights, so that you do not incur any further cost.

Conclusion

In this post, we’ve explored ways to harness the full potential of Amazon ECS events, a valuable resource for troubleshooting. Amazon ECS provides useful information about tasks, services, deployments, and container instances. Analyzing ECS events in Amazon CloudWatch Logs enables you to identify patterns over time, correlate events with other logs, discover recurring issues, and conduct various forms of analysis.

We’ve outlined straightforward yet powerful methods for searching and utilizing Amazon ECS events. This includes tracking the lifecycle of tasks to swiftly diagnose unexpected stoppages, identifying tasks with specific network details to bolster security, pinpointing problematic services, understanding deployment issues, and ensuring the Amazon ECS agent is up-to-date for reliability. This broader perspective on your system’s operations equips you to proactively address problems, gain insights into your container performance, facilitate smooth deployments, and fortify your system’s security.

Additional references

Now that we have covered the basics of these lifecycle events, let’s look at best practices for querying these lifecycle events in the Amazon CloudWatch Log Insights console for troubling shooting purposes. To learn more about the Amazon CloudWatch query domain-specific language (DSL) visit the documentation (CloudWatch Logs Insights query syntax).

You can further setup Anomaly Detection by further processing Amazon ECS events event bridge, which is explained in detail in Amazon Elastic Container Service Anomaly Detector using Amazon EventBridge.

Containers

Effective use: Amazon ECS lifecycle events with Amazon CloudWatch logs insights

Introduction

Prerequisites

Walkthrough

Useful lifecycle events patterns

Running Amazon CloudWatch Logs Insights query

Task event patterns

Service action event patterns

Service deployment event patterns

Service deployment event patterns

ECS container instance event patterns:

Cleaning up

Conclusion

Additional references

Resources

Follow