Monitor Cluster State with Amazon ECS Event Stream

September 8, 2021: Amazon Elasticsearch Service has been renamed to Amazon OpenSearch Service. See details.

Thanks to my colleague Jay Allen for this great blog on how to use the ECS Event stream for operational tasks.

—-

In the past, in order to obtain updates on the state of a running Amazon ECS cluster, customers have had to rely on periodically polling the state of container instances and tasks using the AWS CLI or an SDK. With the new Amazon ECS event stream feature, it is now possible to retrieve near real-time, event-driven updates on the state of your Amazon ECS tasks and container instances. Events are delivered through Amazon CloudWatch Events, and can be routed to any valid CloudWatch Events target, such as an AWS Lambda function or an Amazon SNS topic.

In this post, I show you how to create a simple serverless architecture that captures, processes, and stores event stream updates. You first create a Lambda function that scans all incoming events to determine if there is an error related to any running tasks (for example, if a scheduled task failed to start); if so, the function immediately sends an SNS notification. Your function then stores the entire message as a document inside of an Elasticsearch cluster using Amazon Elasticsearch Service, where you and your development team can use the Kibana interface to monitor the state of your cluster and search for diagnostic information in response to issues reported by users.

Understanding the structure of event stream events

An ECS event stream sends two types of event notifications:

Task state change notifications, which ECS fires when a task starts or stops
Container instance state change notifications, which ECS fires when the resource utilization or reservation for an instance changes

A single event may result in ECS sending multiple notifications of both types. For example, if a new task starts, ECS first sends a task state change notification to signal that the task is starting, followed by a notification when the task has started (or has failed to start); additionally, ECS also fires container instance state change notifications when the utilization of the instance on which ECS launches the task changes.

Event stream events are sent using CloudWatch Events, which structures events as JSON messages divided into two sections: the envelope and the payload. The detail section of each event contains the payload data, and the structure of the payload is specific to the event being fired. The following example shows the JSON representation of a container state change event. Notice that the properties at to the top level of the JSON document describe event properties, such as the event name and time the event occurred, while the detail section contains the information about the task and container instance that triggered the event.

The following JSON depicts an ECS task state change event signifying that the essential container for a task running on an ECS cluster has exited, and thus the task has been stopped on the ECS cluster:

{
  "version": "0",
  "id": "8f07966c-b005-4a0f-9ee9-63d2c41448b3",
  "detail-type": "ECS Task State Change",
  "source": "aws.ecs",
  "account": "244698725403",
  "time": "2016-10-17T20:29:14Z",
  "region": "us-east-1",
  "resources": [
    "arn:aws:ecs:us-east-1:123456789012:task/cdf83842-a918-482b-908b-857e667ce328"
  ],
  "detail": {
    "clusterArn": "arn:aws:ecs:us-east-1:123456789012:cluster/eventStreamTestCluster",
    "containerInstanceArn": "arn:aws:ecs:us-east-1:123456789012:container-instance/f813de39-e42c-4a27-be3c-f32ebb79a5dd",
    "containers": [
      {
        "containerArn": "arn:aws:ecs:us-east-1:123456789012:container/4b5f2b75-7d74-4625-8dc8-f14230a6ae7e",
        "exitCode": 1,
        "lastStatus": "STOPPED",
        "name": "web",
        "networkBindings": [
          {
            "bindIP": "0.0.0.0",
            "containerPort": 80,
            "hostPort": 80,
            "protocol": "tcp"
          }
        ],
        "taskArn": "arn:aws:ecs:us-east-1:123456789012:task/cdf83842-a918-482b-908b-857e667ce328"
      }
    ],
    "createdAt": "2016-10-17T20:28:53.671Z",
    "desiredStatus": "STOPPED",
    "lastStatus": "STOPPED",
    "overrides": {
      "containerOverrides": [
        {
          "name": "web"
        }
      ]
    },
    "startedAt": "2016-10-17T20:29:14.179Z",
    "stoppedAt": "2016-10-17T20:29:14.332Z",
    "stoppedReason": "Essential container in task exited",
    "updatedAt": "2016-10-17T20:29:14.332Z",
    "taskArn": "arn:aws:ecs:us-east-1:123456789012:task/cdf83842-a918-482b-908b-857e667ce328",
    "taskDefinitionArn": "arn:aws:ecs:us-east-1:123456789012:task-definition/wpunconfiguredfail:1",
    "version": 3
  }
}

Setting up an Elasticsearch cluster

Before you dive into the code for handling events, set up your Elasticsearch cluster. On the console, choose Elasticsearch Service, Create a New Domain. In Elasticsearch domain name, type elasticsearch-ecs-events, then choose Next.

For Step 2: Configure cluster, accept all of the defaults by choosing Next.

For Step 3: Set up access policy, choose Next. This page lets you establish a resource-based policy for accessing your cluster; to allow access to the cluster’s actions, use an identity-based policy associated with your Lambda function.

Finally, on the Review page, choose Confirm and create. This starts spinning up your cluster.

While your cluster is being created, set up the SNS topic and Lambda function you need to start capturing and issuing notifications about events.

Create an SNS topic

Because your Lambda function emails you when a task fails unexpectedly due to an error condition, you need to set up an Amazon SNS topic to which your Lambda function can write.

In the console, choose SNS, Create Topic. For Topic name, type ECSTaskErrorNotification, and then choose Create topic.

When you’re done, copy the Topic ARN value, and save it to a text editor on your local desktop; you need it to configure permissions for your Lambda function in the next step. Finally, choose Create subscription to subscribe to an email address for which you have access, so that you receive these event notifications. Remember to click the link in the confirmation email, or you won’t receive any events.

The eagle-eyed among you may realize that you haven’t given your future Lambda function permission to call your SNS topic. You grant this permission to the Lambda execution role when you create your Lambda function in the following steps.

Handling event stream events in a Lambda function

For the next step, create your Lambda function to capture events. Here’s the code for your function (written in Python 2.7):

import requests
import json
from requests_aws_sign import AWSV4Sign
from boto3 import session, client
from elasticsearch import Elasticsearch, RequestsHttpConnection

es_host = '<insert your own Amazon ElasticSearch endpoint here>'
sns_topic = '<insert your own SNS topic ARN here>'

def lambda_handler(event, context):
    # Establish credentials
    session_var = session.Session()
    credentials = session_var.get_credentials()
    region = session_var.region_name or 'us-east-1'

    # Check to see if this event is a task event and, if so, if it contains
    # information about an event failure. If so, send an SNS notification.
    if "detail-type" not in event:
        raise ValueError("ERROR: event object is not a valid CloudWatch Logs event")
    else:
        if event["detail-type"] == "ECS Task State Change":
            detail = event["detail"]
            if detail["lastStatus"] == "STOPPED":
                if detail["stoppedReason"] == "Essential container in task exited":
                  # Send an error status message.
                  sns_client = client('sns')
                  sns_client.publish(
                      TopicArn=sns_topic,
                      Subject="ECS task failure detected for container",
                      Message=json.dumps(detail)
                  )

    # Elasticsearch connection. Note that you must sign your requests in order
    # to call the Elasticsearch API anonymously. Use the requests_aws_sign
    # package for this.
    service = 'es'
    auth=AWSV4Sign(credentials, region, service)
    es_client = Elasticsearch(host=es_host,
                              port=443,
                              connection_class=RequestsHttpConnection,
                              http_auth=auth,
                              use_ssl=True,
                              verify_ssl=True)

    es_client.index(index="ecs-index", doc_type="eventstream", body=event)

Break this down: First, the function inspects the event to see if it is a task change event. If so, it further looks to see if the event is reporting a stopped task, and whether that task stopped because one of its essential containers terminated. If these conditions are true, it sends a notification to the SNS topic that you created earlier.

Second, the function creates an Elasticsearch connection to your Amazon ES instance. The function uses the requests_aws_sign library to implement Sig4 signing because, in order to call Amazon ES, you need to sign all requests with the Sig4 signing process. After the Sig4 signature is generated, the function calls Amazon ES and adds the event to an index for later retrieval and inspection.

To get this code to work, your Lambda function must have permission to perform HTTP POST requests against your Amazon ES instance, and to publish messages to your SNS topic. Configure this by setting up your Lambda function with an execution role that grants the appropriate permission to these resources in your account.

To get started, you need to prepare a ZIP file for the above code that contains both the code and its prerequisites. Create a directory named lambda_eventstream, and save the code above to a file named lambda_function.py. In your favorite text editor, replace the es_host and sns_topic variables with your own Amazon ES endpoint and SNS topic ARN, respectively.
Next, on the command line (Linux, Windows or Mac), change to the directory that you just created, and run the following command for pip (the de facto standard Python installation utility) to download all of the required prerequisites for this code into the directory. You need to ship these dependencies with your code, as they are not pre-installed on the instance that runs your Lambda function.

NOTE: You need to be on a machine with Python and pip already installed. If you are using Python 2.7.9 or greater, pip is installed as part of your standard Python installation. If you are not using Python 2.7.9 or greater, consult the pip page for installation instructions.

pip install requests_aws_sign elasticsearch -t .

Finally, zip all of the contents of this directory into a single zip file. Make sure that the lambda-eventstream.py file is at the top of the file hierarchy within the zip file, and that it is not contained within another directory. From within the lambda-eventstream directory, you can use the following command on Linux and MacOS systems:

zip lambda-eventstream.zip *

On Windows clients with the 7-Zip utility installed, you can run the following command from PowerShell or, if you’re really so inclined, a command prompt:

7z a -tzip lambda-eventstream.zip *

Now that your function and its dependencies are properly packaged, install and test it. Navigate to the Lambda console, choose Create a Lambda Function, and then on the Select Blueprint page, choose Blank function. Choose Next on the Configure triggers screen; you wire up your function to your ECS event stream in the next section.

On the Configure function page, for Name, enter lambda-eventstream. For Runtime, choose Python 2.7. Under Lambda function code, for Code entry type, choose Upload a .ZIP file, and choose Upload to select the ZIP file that you just created.

Under Lambda function handler and role, for Role, choose Create a custom role. This opens a new window for configuring your policy. For IAM Role, choose Create a New IAM Role, and type a name. Then choose View Policy Document, Edit. Paste in the IAM policy below, making sure to replace every instance of AWSAccountID with your own AWS account ID.

{
"Version":"2012-10-17",
   "Statement":[
      {
         "Effect":"Allow",
         "Action":"logs:CreateLogGroup",
         "Resource":"arn:aws:logs:us-east-1:<AWSAccountID>:*"
      },
     {
         "Effect":"Allow",
         "Action":[
            "logs:CreateLogStream",
            "logs:PutLogEvents"
         ],
         "Resource":[
            "arn:aws:logs:us-east-1:<AWSAccountID>:log-group:/aws/lambda/ecs-events:*"
         ]
      },
      {
          "Effect": "Allow",
          "Action": [
              "es:ESHttpPost"
          ],
          "Resource": "arn:aws:es:us-east-1:<AWSAccountID>:domain/ecs-events-cluster/*"
      },
      {
            "Effect": "Allow",
            "Action": [
                "sns:Publish"
            ],
            "Resource": "arn:aws:sns:us-east-1:<AWSAccountID>:ECSTaskErrorNotification"        
      }
   ]
}

This policy establishes every permission that your Lambda function requires for execution, including permission to:

Create a new CloudWatch Logs log group, and save all outputs from your Lambda function to this group
Perform HTTP PUT commands on your Elasticsearch cluster
Publish messages to your SNS topic

When you’re done, you can test your configuration by scrolling up to the sample event stream message provided earlier in this post, and using it to test your Lambda function in the console. On the dashboard page for your new function, choose Test, and in the Input test event window, enter the JSON-formatted event from earlier.

Note that, if you haven’t correctly input your account ID in the correct places in your IAM policy file, you may receive a message along the lines of:

User: arn:aws:sts::123456789012:assumed-role/LambdaEventStreamTake2/awslambda_421_20161017203411268 is not authorized to perform: es:ESHttpPost on resource: ecs-events-cluster.

Edit the policy associated with your Lambda execution role in the IAM console and try again.

Send event stream events to your Lambda function

Almost there! Now with your SNS topic, Elasticsearch cluster, and Lambda function all in place, the only remaining element is to wire up your ECS event stream events and route them to your Lambda function. The CloudWatch Events console offer everything you need to set this up quickly and easily.

From the console, choose CloudWatch, Events. On Step 1: Create Rule, under Event selector, choose Amazon EC2 Container Service. CloudWatch Events enables you to filter by the type of message (task state change or container instance state change), as well as to select a specific cluster from which to receive events. For the purposes of this post, keep the default settings of Any detail type and Any cluster.

Under Targets, choose Lambda function. For Function, choose lambda-eventstream. Behind the scenes, this sends events from your ECS clusters to your Lambda function and also creates the service role required for CloudWatch Events to call your Lambda function.

Verify your work

Now it’s time to verify that messages sent from your ECS cluster flow through your Lambda function, trigger an SNS message for failed tasks, and are stored in your Elasticsearch cluster for future retrieval. To test this workflow, you can use the following ECS task definition, which attempts to start the official WordPress image without configuring an SQL database for storage:

{
    "taskDefinition": {
        "status": "ACTIVE",
        "family": "wpunconfiguredfail",
        "volumes": [],
        "taskDefinitionArn": "arn:aws:ecs:us-east-1:244698725403:task-definition/wpunconfiguredfail:1",
        "containerDefinitions": [
            {
                "environment": [],
                "name": "web",
                "mountPoints": [],
                "image": "wordpress",
                "cpu": 99,
                "portMappings": [
                    {
                        "protocol": "tcp",
                        "containerPort": 80,
                        "hostPort": 80
                    }
                ],
                "memory": 100,
                "essential": true,
                "volumesFrom": []
            }
        ],
        "revision": 1
    }
}

Create this task definition using either the AWS Management Console or the AWS CLI, and then start a task from this task definition. For more detailed instructions, see Creating a Task Definition and Running Tasks.

A few minutes after launching this task definition, you should receive an SNS message with the contents of the task state change JSON indicating that the task failed. You can also examine your Elasticsearch cluster in the console by selecting the name of your cluster and choosing Indices, ecs-index. For Count, you should see that you have multiple records stored.

You can also search the messages that have been stored by opening up access to your Kibana endpoint. Kibana provides a host of visualization and search capabilities for data stored in Amazon ES. To open up access to Kibana to your computer, find your computer’s IP address, and then choose Modify access policy for your Elasticsearch cluster. For Set the domain access policy to, choose Allow access to the domain from specific IP(s) and enter your IP address.

(A more robust and scalable solution for securing Kibana is to front it with a proxy. Details on this approach can be found in Karthi Thyagarajan’s post How to Control Access to Your Amazon Elasticsearch Service Domain.)

You should now be able to kick the Kibana endpoint for your cluster, and search for messages stored in your cluster’s indexes.

Conclusion

After you have this basic, serverless architecture set up for consuming ECS cluster-related event notifications, the possibilities are limitless. For example, instead of storing the events in Amazon ES, you could store them in Amazon DynamoDB, and use the resulting tables to build a UI that materializes the current state of your clusters.

You could also use this information to drive container placement and scaling automatically, allowing you to “right-size” your clusters to a very granular level. By delivering cluster state information in near-real time using an event-driven model as opposed to a pull model, the new ECS event stream feature opens up a much wider array of possibilities for monitoring and scaling your container infrastructure.

If you have questions or suggestions, please comment below.

AWS Compute Blog