Implement Amazon ECS Anywhere enhanced workload resilience in disconnected scenarios
Amazon Elastic Container Service (ECS) Anywhere is a feature of Amazon ECS that lets you run and manage container workloads on your infrastructure. This feature helps you meet compliance requirements and scale your business without sacrificing your on-premises investments.
When extending Amazon ECS to customer-managed infrastructure, external instances are registered to a managed Amazon ECS cluster hosted in an AWS Region. External instances are compute resources (i.e., hosts) external to an AWS region where Amazon ECS can schedule tasks to run. External instances are typically an on-premises server or virtual machine (VM).
Amazon ECS Anywhere currently supports operation in deployment scenarios where there’s consistent and reliable network connectivity between external instances and the Amazon ECS cluster. Amazon ECS monitors for errors or failures that occur to managed containers running on external instances, and restarts any containers that have stopped due to an error. However, Amazon ECS Anywhere doesn’t currently support a disconnected mode of operation. This means that for any period of time that an external instance loses network connectivity to the Amazon ECS cluster, managed containers aren’t restarted after stopping due to an error condition.
The open source Amazon ECS External Instance Network Sentry (eINS) has been developed to augment the function of Amazon ECS Anywhere, by providing an additional layer of resilience for Amazon ECS external instances in deployment scenarios where connectivity to the Amazon ECS control-plane may be unreliable or intermittent.
The eINS is designed to detect any loss of network connectivity between an external instance and the associated Amazon ECS cluster, and to proactively ensure that for the duration of the outage that any Amazon ECS-managed containers will be restarted in the following circumstances:
- the container exits due to an error, which manifests as a non-zero exit code;
- the Docker daemon is restarted;
- the external instance is rebooted.
This post describes how to implement the eINS to provide an additional layer of resilience for Amazon ECS external instances in deployment scenarios where connectivity to the associated Amazon ECS cluster may be unreliable or intermittent.
Note: The eINS isn’t an officially supported feature of Amazon ECS. Please submit an eINS GitHub issue for any feature requests, bugs, or documentation improvements.
The eINS is a Python application that can either be run manually, or be configured to run as a service on Amazon ECS Anywhere external instances. See the Installation section below for instruction for both deployment scenarios.
When running on an Amazon ECS external instance, the function of the eINS is entirely automatic. The following Connected and Disconnected Operation scenarios provide a detailed description of how the eINS functions as the availability of the on-region Amazon ECS control plane changes over time.
This scenario describes eINS behavior during periods when the on-region Amazon ECS control plane is reachable.
The eINS periodically attempts to establish a network connection with the Amazon ECS on-region control-plane to determine region availability status, and the on-region Amazon ECS control-plane responds without error.
In reference to the diagram:
- eINS network connection with the Amazon ECS on-region control-plane  completes successfully:
- eINS takes no further action.
- In communication with the on-region control-plane  the Amazon ECS agent on the external instance orchestrates local managed container lifecycle, including restarting containers which exit due to error condition .
This scenario describes eINS behavior during periods where the on-region Amazon ECS control plane is unreachable.
The eINS periodically attempts to establish a network connection with the Amazon ECS on-region control-plane to determine region availability status, and the on-region Amazon ECS control-plane either does not respond, or returns an error.
In reference to the diagram:
- eINS network connection with the Amazon ECS on-region control-plane  experiences timeout or return error condition:
- The Amazon ECS agent is paused  via the local Docker API *.
- eINS updates Docker restart policy updated to on-failure for each Amazon ECS-managed container . This ensures that any Amazon ECS-managed containers restarts if exiting due to error, the Docker daemon is restarted, or the external instance is rebooted.
- When the Amazon ECS control-plane becomes reachable:
- Amazon ECS-managed containers that have been automatically restarted by the Docker daemon during network outage are stopped and removed.**
- Amazon ECS managed containers that haven’t been automatically restarted during network outage have their Docker restart policy set back to no.
- The local Amazon ECS agent is un-paused.
At this point the operational environment has been restored back to the Connected Operation scenario. eINS continues to monitor for network outage or Amazon ECS control-plane error.
*Amazon ECS agent is paused, as if left in a running state, and the agent detects and kills Amazon ECS-managed containers that have been restarted by the Docker daemon during the period of network outage.
**These containers are stopped and removed by eINS to avoid duplication:
- Containers that have been restarted by the Docker daemon during a network outage become orphaned by Amazon ECS once back online.
- The related Amazon ECS tasks are relaunched by Amazon ECS on the external instance once the Amazon ECS agent has established communication with the control-plane.
The eINS provides the ability to submit configuration parameters as command line arguments. Running the application with the –help parameter generates a summary of available parameters:
Configuration parameters are described in further detail following:
Provide the name of the AWS region where the Amazon ECS cluster that manages the external instance is hosted. eINS attempts to establish a network connection to the Amazon ECS public endpoint at the nominated region to evaluate Amazon ECS control-plane availability.
Specify the number of seconds between connectivity tests.
Specify the number of times failing containers will be restarted during periods where the Amazon ECS control-plane is unavailable. The default setting is 0, which configures the Docker daemon to restart containers an unlimited number of times.
Specify log file name and file-system path. The default value is /tmp/ecs-anywhere-network-sentry.log.
Specify log data event severity.
The following prerequisites should be implemented prior to deploying the eINS.
External instance host operating system
Amazon ECS Anywhere has been certified to run on a range of supported operating systems and system architectures. The eINS installation commands and procedure herein have been tested for compatibility with external instances provisioned with Ubuntu 20 as the host operating system. As the eINS is a Python application, it functions on the other supported Linux based distributions and system architectures; however, installation commands and procedure may vary.
Amazon ECS Anywhere
For each external instance you register with an Amazon ECS cluster, it requires the AWS Systems Manager Agent (SSM Agent), the Amazon ECS container agent, and Docker installed. To register the external instance to an Amazon ECS cluster, it must first be registered as an AWS Systems Manager managed instance. You can generate the comprehensive installation script in a few clicks on the Amazon ECS console. Follow the instructions as described here.
The eINS has been developed and tested running on Python version 3.8.10.
Python Docker SDK
The eINS interacts with the Docker API, which requires installation of the Python Docker SDK on each external instance where the eINS will run. To install the Python Docker SDK, run the commands as follows:
Clone the eINS git repository
On the Amazon ECS external instance, clone the ecs-external-instance-network-sentry repository:
Commands from this point forward assume that you’re in the root directory of the local git repository clone.
At this point, the external instance host operating system is ready to run the eINS. For testing or evaluation, the application can be launched manually according to the below procedure. However, it is recommended to configure the eINS as a Background Service in production deployment scenarios to ensure that the application is running at all times.
The eINS is located within the /python directory of the git repository. See the Configuration Parameters section for required and optional parameters to be submitted at runtime, and Logging to validate successful operation. Remember to provide the correct AWS region code:
Configuring the application as an OS background service is an effective mechanism to ensure that the eINS remains running in the background at all times.
Service configuration requires the implementation of a unit configuration file, which encodes information about the process that will be controlled and supervised by systemd.
The following describes configuring the eINS as an OS background service.
Copy application and configuration files
Run the following commands to copy application and configuration files to the appropriate locations on the external instance file system:
Update service unit configuration file
Next, update the service unit configuration file /lib/systemd/system/ecs-external-instance-network-sentry.service.
Make necessary modifications to the service unit config file ExecStart directive on line-11 as follows:
- Update the –region configuration parameter with the AWS region name where your on-region Amazon ECS cluster is provisioned.
- Optionally, include any additional Configuration Parameters to suit the particular requirements of your deployment scenario.
Configure and start service
Check service status
To validate that the service has started successfully, run the following command. If the service has started correctly, the output should be similar to the following:
The eINS has been configured to provide basic logging regarding its operation.
The default logfile location is /tmp/ecs-external-instance-network-sentry.log, which can be modified by submitting the –logfile configuration parameter.
By default, the loglevel is set to logging.INFO and can be updated at runtime using the –loglevel configuration parameter.
The following eINS log file excerpt illustrates;
- A detected loss of connectivity to on-region control-plane, and associated Docker policy configuration actions for Amazon ECS managed containers;
- Container cleanup and Docker policy configuration once Amazon ECS control-plane becomes reachable.
The log file rotates at 5Mb, and a history of the five most recent log files will be maintained.
The eINS currently has the following limitation:
- As described in the Disconnected Operation section, containers restarted during a period where the Amazon ECS control-plane is unavailable will be stopped and relaunched once the Amazon ECS control-plane becomes available.
In order to avoid incurring future costs associated with this solution, follow this procedure to deregister your external instance from both Amazon ECS and AWS Systems Manager.
Following deregistration, the external instance is no longer able to accept new tasks. If you have tasks that are running on the external instance when you deregister it, the tasks remain running until they stop through some other means. However, these tasks are no longer monitored or accounted for by Amazon ECS.
In this post we have provided a detailed overview of the open source Amazon ECS External Instance Network Sentry, and we’ve showed you how to implement the eINS as an operating system background service on your ECS Anywhere external instances.
If you are deploying your workloads to Amazon ECS Anywhere external instances and require enhanced workload resiliency during periods where the on-region control plane isn’t contactable, then the eINS is a great open source solution that provides enhanced availability. This might include external instances deployed in well connected, but mission critical situations (e.g., data center, warehouse, manufacturing plant, etc.) or environments where internet connectivity may be more unreliable (e.g., maritime or rural use cases). To learn more, see ECS Anywhere in the Amazon ECS Developer Guide, and we encourage you to give it a try with the ECS Anywhere workshop as a next step.
A public version of our container services feature roadmap is available online. We know that our customers are making decisions and plans based on what we are developing, and we want to provide customers with the insights needed to appropriately plan for the future. If there are any features that you would like to be available, which are not currently on the feature roadmap, then please open an issue! Community submitted issues will be tagged “Proposed” and will be reviewed by the AWS team. You can read more information about how to contribute here.