Monitoring the Amazon ECS Agent

Introduction

Amazon Elastic Container Service (Amazon ECS) is a fully managed container orchestration service that allows organizations to deploy, manage, and scale containerized workloads. It’s deeply integrated with the AWS ecosystem to provide a secure and easy-to-use solution for managing applications not only in the cloud but now also on your infrastructure with Amazon ECS Anywhere.

Within Amazon ECS components, the ECS Agent is a vital piece that’s in charge of communication between the Amazon ECS Container Instances and the ECS control plane. Among other tasks, the ECS Agent registers your ECS Container Instance within an ECS Cluster, receives instructions from the ECS Scheduler for placing, starting, and stopping Tasks; and also reporting Tasks and container status changes.

Platform operators often look for guidance on how to monitor the availability of the ECS Agent, and therefore the ECS Container Instance, by ensuring the failure relevant alerts are in place. In this post, we provide an example solution to monitor and detect ECS Agent down event, provide alerting through Amazon Simple Notification Service (Amazon SNS) and provide metrics using Amazon CloudWatch. This architecture is flexible, with the option to add custom actions to be ran on an ECS Agent failure event.

As a part of its normal operation the Amazon ECS Agent disconnects and reconnects several times per hour. Considering this, several agent connection events should be expected without necessarily meaning there is an underlying issue occurring. These events aren’t an indication of a problem with the container agent or the Container Instance; however, if the agent isn’t capable of re-connecting this can indeed be a clear sign of an ongoing issue.

Solution overview

In this solution, we combine Amazon EventBridge, Amazon Simple Queue Service (Amazon SQS), AWS Lambda, Amazon SNS, and Amazon CloudWatch to deliver a simple solution capable of detecting and alerting about container instances becoming disconnected from the cluster.

The monitoring setup is provisioned via AWS CloudFormation. For the sake of understanding the different components and internals of the design, please refer to the architecture diagram below.

Diagram shows the architecture and flow of the data, starting from the Amazon ECS Container agent till the execution of the AWS Lambda function.

In this approach, we’ll use Amazon EventBridge to capture Container Instance state change events. We are only interested in disconnection events for Amazon ECS Instances that’re in ACTIVE status, in which they are ready for accepting new Tasks. Any other status (e.g., DRAINING or INACTIVE) are either transitory or do not require alerting. For achieving this, events are filtered by agentConnected status as false and Instance status as ACTIVE. This will reduce the number of events that the Amazon EventBridge catches.

As a continuation, the Amazon EventBridge Rules are integrated with an Amazon SQS delay queue. These kind of SQS Queues delay the delivery of messages for a certain fixed period of time. This is a convenient and cost-effective approach for consistently and implicitly implement a grace period. It allows enough time to the ECS Agent to reconnect in case it was a transient disconnection or a regular re-connection event (eliminating false positives). After the delay time elapsed, an AWS Lambda function is invoked to consume and process the events from the SQS queue. The AWS Lambda Function has the following logic:

By default, all Amazon ECS Clusters in the account are monitored. If the custom parameter MonitorAllECSClusters was disabled when deploying the AWS CloudFormation stack, the function logic validates that the Cluster where the Instance belongs to has the proper tags in order to be monitored.
Confirm that the Amazon ECS Instance is still running and ACTIVE. This avoids alerting in scenarios where the instance has been already terminated (while scaling in, for example) or is DRAINING.
Describe the Container Instance and confirm if the ECS Agent is still disconnected.

If the ECS Instance matches all the checks and filters, then this means there is an issue with the Agent in that specific instance and a notification email is sent.

The solution is flexible and provides simple settings for tweaking the behavior:

There is a parameter in the AWS CloudFormation Stack for enabling and/or disabling the monitoring, in case you require to temporarily pause the monitoring. This disables the Amazon EventBridge rule, so no messages are delivered to the SQS queue and the AWS Lambda function isn’t invoked at all while the solution is disabled.
As mentioned above, you can decide whether you would like to monitor all the Amazon ECS Clusters within the region or only specific Clusters. This approach allows you to monitor every existing cluster by default including newly created clusters that you may create afterwards, which is useful for production accounts or comprehensive monitoring. Alternatively, you can instruct the monitoring to only consider Clusters with specific Tags. In this scenario, the AWS Lambda function specifically validates if the proper tags are present. If you choose to enable the Tag monitoring, the Amazon ECS Clusters you want to monitor must have the following Tag:
- Tag Key: ecs-agent-monitoring
- Tag Value: <any-valid-custom-tag-value>

You can specify your custom tag via the MonitoringTagKeyValue parameter in the AWS CloudFormation template. This also allows you to have multiple stacks deployed to notify different teams using the referenced tags.

You can also decide whether you would like to encrypt Amazon SQS and Amazon SNS content at rest. This is useful in case you have specific compliance requirements that need to be met. In this scenario, while deploying the AWS CloudFormation template you need to provide a valid KMS Administrator Role, Group, or User. This is required for managing the KMS Key that is automatically provisioned. It’s unlikely that you ever need to administer this key, but KMS Keys shouldn’t be created without an entity that can manage it. Please note that encryption at rest for AWS CloudWatch logs is always enabled via server-side encryption for the log data at rest. Enabling this encryption option also uses your custom KMS key for CloudWatch logs.

This is an open and scalable solution, which can be modified and extended. The tool gives you the option to execute custom actions on the affected Amazon ECS Instances besides alerting. To keep the solution maintainable and scalable, you should deploy and create these custom actions via AWS CloudFormation. The AWS Lambda code itself has a placeholder where you can develop any feature you require:

def custom_actions(expired_instance):
    """Allow the end user to implement any custom/personalized action and/or
    operation on the affected EC2 Instances.

    Args:
        expired_instance (list): A list of the affected EC2 Instances (ECS Agent disconnected)
    """
    pass

As a parameter, you’ll receive the ID list of the node(s) with the issue (i.e., either the Amazon EC2 Instance(s) ID(s) or the node(s) ID(s) in case of ECS Anywhere setups). For example, you can write a custom action for automatically terminating the faulty Amazon EC2 Instance. Please note that depending on what custom actions you define, you may need to add extra AWS Identity and Access Management (AWS IAM) permissions for these actions in the ECSEventBridgeMonitorECSAgentExecutionRole role within the AWS CloudFormation template.

As a final remark, in order to have further insights and metrics, you can use Amazon CloudWatch Logs Insights queries on the AWS Lambda logs and also can check the Amazon CloudWatch graph generated by the solution. Furthermore, you can configure your container instances to send log information to Amazon CloudWatch Logs to have them available to go deep into the issue and see what happened on that specific container instance.

Walkthrough

You can find the major steps that need to be complete in the following details:

Clone the repository and upload AWS CloudFormation artifacts to Amazon Simple Storage Service (Amazon S3).
Deploy the AWS CloudFormation template.
Check and validate the solution components.
Cleanup

Prerequisites

For this walkthrough, you should have the following prerequisites:

An AWS account with necessary permissions to create the resources.
AWS Command Line Interface (AWS CLI) with appropriate credentials
Amazon ECS Cluster with an ECS Container instance registered, up, and running.
A valid email to receive the notification.
Amazon S3 bucket in the same region to upload the AWS Lambda code.

1. Clone the repository and upload AWS CloudFormation artifacts to Amazon S3

Clone the project GitHub repository to your local machine. Before deploying the template, you need to pack it. This is a process that uploads local artifacts to an Amazon S3 bucket. This consolidates the project templates and AWS Lambda function code.

Locate an Amazon S3 bucket where AWS CloudFormation templates are stored. This bucket must be in the same region where you deploy the solution. If using Linux or MacOS, then you can export the variables for smooth usage:

$ export BUCKET=<your-selected-s3-bucket>

$ git clone https://github.com/aws-samples/amazon-ecs-agent-connection-monitoring.git
$ cd amazon-ecs-agent-connection-monitoring
$ aws --region=<your-aws-region> cloudformation package \
  --template-file ./ecs-agent-monitoring.yaml \
  --s3-bucket $BUCKET \
  --output-template-file ./packaged-template.yaml

2. Deploy the AWS CloudFormation template

You can now deploy the generated packed AWS CloudFormation template (‘ecs-agent-monitoring.yaml’, from step 1) and create the stack.

The screenshot shows the possible AWS CloudFormation parameters and their descriptions.

We briefly discuss each of the options and parameters in the following details:

Do you want to enable the monitoring solution? – A simple on/off switch that allows you to quickly stop the monitoring engine without needing to delete all the resources. Useful during maintenance windows.
Do you want to only monitor tagged Amazon ECS Clusters? – This allows you to decide whether you would like to automatically monitor all the Clusters within the region or fine-control which of them you want to monitor, by tagging them accordingly. For production accounts, you may want to automatically monitor all Amazon ECS Clusters (including newly Clusters created in the future).
Tag Value – This allows you to specify a custom Tag Value that your Amazon ECS Clusters must have in order to be monitored. The Tag Key is fixed and expected to be ecs-agent-monitoring, whereas you have the freedom to choose your own Tag Value. This allows you to deploy multiple instances of the solution with different target Tag Values for monitoring different Amazon ECS Clusters.
Destination email for receiving notifications. – Email that receives disconnection updates.
Do you want to enable encryption for the Amazon SQS Queue? – Amazon SQS encryption at rest.
Amazon Resource Name (ARN) for the AWS Key Management Service (Amazon KMS) Key administrator. – KMS Keys need to have an administrator. In case you require to encrypt the Amazon SQS, this parameter is mandatory.

3. Check and validate the solution components

You can explore the created resources.

To test and validate the solution, you can simulate a failure with the Amazon ECS Agent by stopping the correspondent service within an Amazon ECS Container Instance. It’s strongly advised to perform this in a testing or development environment, with instances that aren’t running any relevant workload. Intentionally stopping the Amazon ECS Agent on production cluster may affect your current workloads.

For achieving this, you can follow these instructions:

Connect via a shell session to the Amazon ECS Container Instance where you want to test.
Stop the Amazon ECS service within the Container Instance:
1. systemctl stop ecs
Validate that the service was effectively stopped:
1. systemctl status ecs
Wait till you receive the email alert/notification.
You can now restore the Amazon ECS Agent functionality by restarting the service:
1. systemctl start ecs

4. Cleaning up

To avoid unnecessary cost, make sure you clean up the resources that we just created for this walkthrough. You can delete the AWS CloudFormation Stack in case you don’t want to use the monitoring solution for now.

Conclusion

In this post, we showed you an example solution to monitor and detect ECS Agent down events, by alerts via Amazon SNS and Amazon CloudWatch metrics. The Amazon ECS Agent is a core component on Amazon ECS infrastructure, and its failure impacts workload availability and scalability. Our solution is flexible, open source, and available in our GitHub repository. Feel free to open issues and send pull request in GitHub. We hope that this post helps increased your awareness on notifications and actions that you can take to avoid workload issues.

Containers