Implement Amazon ECS Anywhere enhanced workload resilience in disconnected scenarios

Introduction

Amazon Elastic Container Service (ECS) Anywhere is a feature of Amazon ECS that lets you run and manage container workloads on your infrastructure. This feature helps you meet compliance requirements and scale your business without sacrificing your on-premises investments.

When extending Amazon ECS to customer-managed infrastructure, external instances are registered to a managed Amazon ECS cluster hosted in an AWS Region. External instances are compute resources (i.e., hosts) external to an AWS region where Amazon ECS can schedule tasks to run. External instances are typically an on-premises server or virtual machine (VM).

Amazon ECS Anywhere currently supports operation in deployment scenarios where there’s consistent and reliable network connectivity between external instances and the Amazon ECS cluster. Amazon ECS monitors for errors or failures that occur to managed containers running on external instances, and restarts any containers that have stopped due to an error. However, Amazon ECS Anywhere doesn’t currently support a disconnected mode of operation. This means that for any period of time that an external instance loses network connectivity to the Amazon ECS cluster, managed containers aren’t restarted after stopping due to an error condition.

The open source Amazon ECS External Instance Network Sentry (eINS) has been developed to augment the function of Amazon ECS Anywhere, by providing an additional layer of resilience for Amazon ECS external instances in deployment scenarios where connectivity to the Amazon ECS control-plane may be unreliable or intermittent.

The eINS is designed to detect any loss of network connectivity between an external instance and the associated Amazon ECS cluster, and to proactively ensure that for the duration of the outage that any Amazon ECS-managed containers will be restarted in the following circumstances:

the container exits due to an error, which manifests as a non-zero exit code;
the Docker daemon is restarted;
the external instance is rebooted.

This post describes how to implement the eINS to provide an additional layer of resilience for Amazon ECS external instances in deployment scenarios where connectivity to the associated Amazon ECS cluster may be unreliable or intermittent.

Note: The eINS isn’t an officially supported feature of Amazon ECS. Please submit an eINS GitHub issue for any feature requests, bugs, or documentation improvements.

Solution overview

The eINS is a Python application that can either be run manually, or be configured to run as a service on Amazon ECS Anywhere external instances. See the Installation section below for instruction for both deployment scenarios.

eINS regular operation with region connectivity

When running on an Amazon ECS external instance, the function of the eINS is entirely automatic. The following Connected and Disconnected Operation scenarios provide a detailed description of how the eINS functions as the availability of the on-region Amazon ECS control plane changes over time.

Connected operation

This scenario describes eINS behavior during periods when the on-region Amazon ECS control plane is reachable.

The eINS periodically attempts to establish a network connection with the Amazon ECS on-region control-plane to determine region availability status, and the on-region Amazon ECS control-plane responds without error.

In reference to the diagram:

eINS network connection with the Amazon ECS on-region control-plane [1] completes successfully:
- eINS takes no further action.
In communication with the on-region control-plane [2] the Amazon ECS agent on the external instance orchestrates local managed container lifecycle, including restarting containers which exit due to error condition [3].

Disconnected operation

This scenario describes eINS behavior during periods where the on-region Amazon ECS control plane is unreachable.

eINS operation with no region connectivity

In reference to the diagram:

eINS network connection with the Amazon ECS on-region control-plane [1] experiences timeout or return error condition:
- The Amazon ECS agent is paused [3] via the local Docker API [2]*.
- eINS updates Docker restart policy updated to on-failure for each Amazon ECS-managed container [4]. This ensures that any Amazon ECS-managed containers restarts if exiting due to error, the Docker daemon is restarted, or the external instance is rebooted.
When the Amazon ECS control-plane becomes reachable:
- Amazon ECS-managed containers that have been automatically restarted by the Docker daemon during network outage are stopped and removed.**
- Amazon ECS managed containers that haven’t been automatically restarted during network outage have their Docker restart policy set back to no.
- The local Amazon ECS agent is un-paused.

At this point the operational environment has been restored back to the Connected Operation scenario. eINS continues to monitor for network outage or Amazon ECS control-plane error.

Notes

*Amazon ECS agent is paused, as if left in a running state, and the agent detects and kills Amazon ECS-managed containers that have been restarted by the Docker daemon during the period of network outage.

**These containers are stopped and removed by eINS to avoid duplication:

Containers that have been restarted by the Docker daemon during a network outage become orphaned by Amazon ECS once back online.
The related Amazon ECS tasks are relaunched by Amazon ECS on the external instance once the Amazon ECS agent has established communication with the control-plane.

Configuration parameters

The eINS provides the ability to submit configuration parameters as command line arguments. Running the application with the –help parameter generates a summary of available parameters:

$ python3 ecs-external-instance-network-sentry.py --help
usage: ecs-external-instance-network-sentry [-h] -r REGION [-i INTERVAL] [-n RETRIES] [-l LOGFILE] [-k LOGLEVEL]

Purpose:
--------------
For use on ECS Anywhere external hosts:
Configures ECS orchestrated containers to automatically restart
on failure when on-region ecs control-plane is detected to be unreachable.

Configuration Parameters:
--------------
  -h, --help            Show this help message and exit.
  -r REGION, --region REGION
                        AWS region where ecs cluster is located.
  -i INTERVAL, --interval INTERVAL
                        Interval in seconds sentry will sleep between connectivity checks.
  -n RETRIES, --retries RETRIES
                        Number of times Docker will restart a crashing container.
  -l LOGFILE, --logfile LOGFILE
                        Logfile name & location.
  -k LOGLEVEL, --loglevel LOGLEVEL
                        Log data event severity.

Configuration parameters are described in further detail following:

--region

Provide the name of the AWS region where the Amazon ECS cluster that manages the external instance is hosted. eINS attempts to establish a network connection to the Amazon ECS public endpoint at the nominated region to evaluate Amazon ECS control-plane availability.

optional=no
default=””

$ python3 ecs-external-instance-network-sentry.py --region ap-southeast-2

--interval

Specify the number of seconds between connectivity tests.

optional=yes
default=20

$ python3 ecs-external-instance-network-sentry.py --region ap-southeast-2 --interval 15

--retries

Specify the number of times failing containers will be restarted during periods where the Amazon ECS control-plane is unavailable. The default setting is 0, which configures the Docker daemon to restart containers an unlimited number of times.

optional=yes
default=0

$ python3 ecs-external-instance-network-sentry.py --region ap-southeast-2 --interval 15 --retries 5

--logfile

Specify log file name and file-system path. The default value is /tmp/ecs-anywhere-network-sentry.log.

optional=yes
default=/tmp/ecs-external-instance-network-sentry.log

$ python3 ecs-external-instance-network-sentry.py --region ap-southeast-2 --interval 15 --retries 5 --logfile /mypath/myfile.log

--loglevel

Specify log data event severity.

optional=yes
default=INFO

$ python3 ecs-external-instance-network-sentry.py --region ap-southeast-2 --interval 15 --retries 5 --logfile /mypath/myfile.log --loglevel DEBUG

Walkthrough

Prerequisites

The following prerequisites should be implemented prior to deploying the eINS.

External instance host operating system

Amazon ECS Anywhere has been certified to run on a range of supported operating systems and system architectures. The eINS installation commands and procedure herein have been tested for compatibility with external instances provisioned with Ubuntu 20 as the host operating system. As the eINS is a Python application, it functions on the other supported Linux based distributions and system architectures; however, installation commands and procedure may vary.

Amazon ECS Anywhere

For each external instance you register with an Amazon ECS cluster, it requires the AWS Systems Manager Agent (SSM Agent), the Amazon ECS container agent, and Docker installed. To register the external instance to an Amazon ECS cluster, it must first be registered as an AWS Systems Manager managed instance. You can generate the comprehensive installation script in a few clicks on the Amazon ECS console. Follow the instructions as described here.

Python

The eINS has been developed and tested running on Python version 3.8.10.

Python Docker SDK

The eINS interacts with the Docker API, which requires installation of the Python Docker SDK on each external instance where the eINS will run. To install the Python Docker SDK, run the commands as follows:

# update package index files..
$ apt get update
# install python docker sdk..
$ python3 pip install docker

Clone the eINS git repository

On the Amazon ECS external instance, clone the ecs-external-instance-network-sentry repository:

# clone eins git repo..
$ git clone https://github.com/aws-samples/ecs-external-instance-network-sentry.git

Commands from this point forward assume that you’re in the root directory of the local git repository clone.

Manual operation

At this point, the external instance host operating system is ready to run the eINS. For testing or evaluation, the application can be launched manually according to the below procedure. However, it is recommended to configure the eINS as a Background Service in production deployment scenarios to ensure that the application is running at all times.

The eINS is located within the /python directory of the git repository. See the Configuration Parameters section for required and optional parameters to be submitted at runtime, and Logging to validate successful operation. Remember to provide the correct AWS region code:

# manual launch..
$ python3 python/ecs-external-instance-network-sentry.py --region ap-southeast-2

Background service

Configuring the application as an OS background service is an effective mechanism to ensure that the eINS remains running in the background at all times.

Service configuration requires the implementation of a unit configuration file, which encodes information about the process that will be controlled and supervised by systemd.

Configuration procedure

The following describes configuring the eINS as an OS background service.

Copy application and configuration files

Run the following commands to copy application and configuration files to the appropriate locations on the external instance file system:

# copy eins application file..
$ cp python/ecs-external-instance-network-sentry.py /usr/bin
# copy eins service unit config file..
$ cp config/ecs-external-instance-network-sentry.service /lib/systemd/system

Update service unit configuration file

Next, update the service unit configuration file /lib/systemd/system/ecs-external-instance-network-sentry.service.

$ cat /lib/systemd/system/ecs-external-instance-network-sentry.service
[Unit]
Description=Amazon ECS External Instance Network Service Documentation=https://github.com/aws-samples/ecs-external-instance-network-sentry Requires=docker.service
After=ecs.service
[Service]
Type=simple
Restart=on-failure RestartSec=10s
ExecStart=python3 /usr/bin/ecs-external-instance-network-sentry.py --region <INSERT-REGION-NAME-HERE>
[Install] WantedBy=multi-user.target

Make necessary modifications to the service unit config file ExecStart directive on line-11 as follows:

Update the –region configuration parameter with the AWS region name where your on-region Amazon ECS cluster is provisioned.
Optionally, include any additional Configuration Parameters to suit the particular requirements of your deployment scenario.

Configure and start service

# reload systemd..
$ systemctl daemon-reload # enable eins service..
$ sudo systemctl enable ecs-external-instance-network-sentry.service
# start eins service..
$ systemctl start ecs-external-instance-network-sentry

Check service status

To validate that the service has started successfully, run the following command. If the service has started correctly, the output should be similar to the following:

$ systemctl status ecs-external-instance-network-sentry

● ecs-external-instance-network-sentry.service - Amazon ECS External Instance Network Service
     Loaded: loaded (/lib/systemd/system/ecs-external-instance-network-sentry.service; enabled; vendor preset: enabled)
     Active: active (running) since Fri 2021-07-30 07:57:08 UTC; 22min ago
       Docs: https://github.com/aws-samples/ecs-external-instance-network-sentry
   Main PID: 28366 (python3)
      Tasks: 1 (limit: 9412)
     Memory: 19.7M
     CGroup: /system.slice/ecs-external-instance-network-sentry.service
             └─28366 /usr/bin/python3 /usr/bin/ecs-external-instance-network-sentry.py --region ap-southeast-2 --interval 10 --retries 3 --logfile /tmp/ecs->

Jul 30 07:57:08 ubu20 systemd[1]: Started Amazon ECS External Instance Network Service.

Logging

The eINS has been configured to provide basic logging regarding its operation.

The default logfile location is /tmp/ecs-external-instance-network-sentry.log, which can be modified by submitting the –logfile configuration parameter.

Log level

By default, the loglevel is set to logging.INFO and can be updated at runtime using the –loglevel configuration parameter.

Log output

The following eINS log file excerpt illustrates;

A detected loss of connectivity to on-region control-plane, and associated Docker policy configuration actions for Amazon ECS managed containers;
Container cleanup and Docker policy configuration once Amazon ECS control-plane becomes reachable.

2021-07-10 09:00:01,200 INFO PID_713928 [startup] ecs-external-instance-network-sentry - starting..
2021-07-10 09:00:01,200 INFO PID_713928 [startup] arg - aws region: ap-southeast-2
2021-07-10 09:00:01,200 INFO PID_713928 [startup] arg - interval: 10
2021-07-10 09:00:01,201 INFO PID_713928 [startup] arg - retries: 0
2021-07-10 09:00:01,201 INFO PID_713928 [startup] arg - logfile: /tmp/ecs-external-instance-network-sentry.log
2021-07-10 09:00:01,201 INFO PID_713928 [startup] arg - loglevel: logging.INFO......
2021-07-10 09:39:33,756 INFO PID_713928 [begin] connectivity test..
2021-07-10 09:39:33,757 INFO PID_713928 [connect] connecting to ecs at ap-southeast-2..
2021-07-10 09:39:33,757 INFO PID_713928 [connect] create network socket..
2021-07-10 09:39:43,764 ERROR PID_713928 [connect] error creating network socket: [Errno -3] Temporary failure in name resolution
2021-07-10 09:39:43,764 INFO PID_713928 [connect] connecting to host..
2021-07-10 09:39:43,765 INFO PID_713928 [ecs-offline] ecs unreachable, configuring container restart policy..
2021-07-10 09:39:43,880 INFO PID_713928 [ecs-offline] container name: ecs-alpine-crash-test-9adba798f5f189968701
2021-07-10 09:39:43,881 INFO PID_713928 [ecs-offline] ecs cluster: ecs-anywhere-cluster-1
2021-07-10 09:39:43,882 INFO PID_713928 [ecs-offline] set container restart policy: {'Name': 'on-failure', 'MaximumRetryCount': 0}
2021-07-10 09:39:43,958 INFO PID_713928 [ecs-offline] container name: ecs-nginx-1-nginx-eaa6e7a9b0cd88988201
2021-07-10 09:39:43,959 INFO PID_713928 [ecs-offline] ecs cluster: ecs-anywhere-cluster-1
2021-07-10 09:39:43,959 INFO PID_713928 [ecs-offline] set container restart policy: {'Name': 'on-failure', 'MaximumRetryCount': 0}
2021-07-10 09:39:44,022 INFO PID_713928 [ecs-offline] ecs agent paused..
2021-07-10 09:39:44,022 INFO PID_713928 [end] sleeping for 10 seconds..
......
2021-07-10 09:41:14,298 INFO PID_713928 [begin] connectivity test..
2021-07-10 09:41:14,299 INFO PID_713928 [connect] connecting to ecs at ap-southeast-2..
2021-07-10 09:41:14,299 INFO PID_713928 [connect] create network socket..
2021-07-10 09:41:23,133 INFO PID_713928 [connect] connecting to host..
2021-07-10 09:41:23,258 INFO PID_713928 [connect] send/receive data..
2021-07-10 09:41:30,563 INFO PID_713928 [connect] ecs at ap-southeast-2 is available..
2021-07-10 09:41:30,564 INFO PID_713928 [ecs-online] ecs is reachable..
2021-07-10 09:41:30,621 INFO PID_713928 [ecs-online] container name: ecs-alpine-crash-test-9adba798f5f189968701
2021-07-10 09:41:30,621 INFO PID_713928 [ecs-online] ecs cluster: ecs-anywhere-cluster-1
2021-07-10 09:41:30,622 INFO PID_713928 [ecs-online] container has been restarted by docker, stopping & removing..
2021-07-10 09:41:41,330 INFO PID_713928 [ecs-online] container name: ecs-nginx-1-nginx-eaa6e7a9b0cd88988201
2021-07-10 09:41:41,330 INFO PID_713928 [ecs-online] ecs cluster: ecs-anywhere-cluster-1
2021-07-10 09:41:41,331 INFO PID_713928 [ecs-online] set container restart policy: {'Name': 'no', 'MaximumRetryCount': 0}
2021-07-10 09:41:41,470 INFO PID_713928 [ecs-online] ecs agent unpaused..
2021-07-10 09:41:41,471 INFO PID_713928 [end] sleeping for 10 seconds..

Log rotation

The log file rotates at 5Mb, and a history of the five most recent log files will be maintained.

Considerations

The eINS currently has the following limitation:

As described in the Disconnected Operation section, containers restarted during a period where the Amazon ECS control-plane is unavailable will be stopped and relaunched once the Amazon ECS control-plane becomes available.

Cleaning up

In order to avoid incurring future costs associated with this solution, follow this procedure to deregister your external instance from both Amazon ECS and AWS Systems Manager.

Following deregistration, the external instance is no longer able to accept new tasks. If you have tasks that are running on the external instance when you deregister it, the tasks remain running until they stop through some other means. However, these tasks are no longer monitored or accounted for by Amazon ECS.

Conclusion

In this post we have provided a detailed overview of the open source Amazon ECS External Instance Network Sentry, and we’ve showed you how to implement the eINS as an operating system background service on your ECS Anywhere external instances.

If you are deploying your workloads to Amazon ECS Anywhere external instances and require enhanced workload resiliency during periods where the on-region control plane isn’t contactable, then the eINS is a great open source solution that provides enhanced availability. This might include external instances deployed in well connected, but mission critical situations (e.g., data center, warehouse, manufacturing plant, etc.) or environments where internet connectivity may be more unreliable (e.g., maritime or rural use cases). To learn more, see ECS Anywhere in the Amazon ECS Developer Guide, and we encourage you to give it a try with the ECS Anywhere workshop as a next step.

A public version of our container services feature roadmap is available online. We know that our customers are making decisions and plans based on what we are developing, and we want to provide customers with the insights needed to appropriately plan for the future. If there are any features that you would like to be available, which are not currently on the feature roadmap, then please open an issue! Community submitted issues will be tagged “Proposed” and will be reviewed by the AWS team. You can read more information about how to contribute here.

Containers