Improvements to Amazon ECS task launch behavior when tasks have prolonged shutdown

Amazon Elastic Container Service (Amazon ECS) now launches tasks faster on container instances that are running tasks that have a prolonged shutdown period. This enables customers to scale their workloads faster and improve infrastructure utilization.

About Amazon ECS scheduling

Amazon ECS is a container orchestrator that’s designed to be able to launch and track application containers across the Amazon Elastic Compute Cloud (Amazon EC2) capacity of an entire AWS Region. While other container orchestrators give each customer their own unique control plane that’s just for them, the Amazon ECS control plane is designed with extreme efficiency in mind. Under the hood of Amazon ECS control plane is a shared tenancy scheduler that serves hundreds of thousands of customers who are collectively launching many thousands of tasks every second.

In order for Amazon ECS to work at such incredible scale it is important to make sure that local state is properly synced to the control plane’s overall understanding of every Amazon EC2 instance under its management. Amazon ECS uses a local agent that runs on your Amazon EC2 instance or AWS Fargate task. This agent has the most instant and accurate information about what is happening on your compute. It knows how much CPU and memory is available, and it knows when a container starts or stops. It communicates this information back to the Amazon ECS control plane. Amazon ECS collects information from every agent that you run. It keeps track of all of this information so that it has a total overview of the state of all of your Amazon EC2 instances. This allows Amazon ECS to make appropriate decisions about how to keep your services up and running.

For example, Amazon ECS can see if an Amazon EC2 instance is terminated, and reacts by relaunching any application containers that were running on that host, onto different Amazon EC2 instances that have the Amazon ECS agent. Or it can see if one of your applications has crashed and therefore needs to be relaunched. If necessary, it can choose a different host in a different AWS availability zone to launch the replacement on.

Keeping track of the current state of every Amazon EC2 instance in your cluster is a difficult, distributed problem that has unique challenges for optimization. In this article, we dive deeper into a recent improvement to how Amazon ECS handles resource state.

Expanding Amazon ECS to more workload types

At its launch in 2015, Amazon ECS was initially designed for the use case of running large numbers of lightweight microservice containers. These containers are typically designed to be stateless, fast to start, and fast to stop. Since the containerized application is just answering HTTP requests with fast and light responses, it can be stopped at any time without serious consequences.

Amazon ECS stops containers when an Amazon ECS service is being scaled in, or during a rolling deployment. To stop the container the control plane sends a signal to the Amazon ECS agent on a host. This signal tells the agent to stop a local container. The agent then sends a Linux SIGTERM process signal to the container to warn the container that it needs to terminate. Many applications respond to this process signal by shutting down cleanly within a second or two. In some cases, the application ignores this signal. If the first SIGTERM signal is ignored, then the Amazon ECS agent waits for 30 seconds, then sends a SIGKILL signal. This SIGKILL triggers a hard force quit that terminates the application process.

Over time we began to see another significant use case for Amazon ECS: long running batch workloads, like machine learning or video processing. These workloads tend to deal with large amounts of data and do heavy computing against that data. The workload may or may not be safe to stop and restart. If some workloads are stopped in the middle of doing work, they have to be restarted from scratch, which potentially loses hours of progress. Or the containerized process may be able to save a checkpoint file that work can resume from. However, 30 seconds may not actually be enough time to generate and save a large checkpoint file to durable storage.

For this reason, we launched the ability to extend the stop timeout for tasks that go into a stopping state. This allows you to configure up to a 2-minute wait period on AWS Fargate, or as long as you want on Amazon EC2. If the Amazon ECS agent sends a SIGTERM signal to your process, but your process chooses to ignore this signal because it is busy working, then the Amazon ECS agent waits for the stop timeout period before it sends the SIGKILL. For some customers this could be several hours, or even an entire day. This allows customers running heavy computing workloads to use Amazon ECS without worrying that an autoscaling or deployment action will interrupt in progress work.

The problem

When Amazon ECS launched in 2015, longer stop timeouts didn’t exist. The longest period that a task could take to stop was 30 seconds and most tasks took far less than 30 seconds to stop. We made the decision to have the control plane optimistically schedule new tasks onto hosts where there were stopping tasks.

For example, if an instance is running a container that reserves 2 vCPU and 4096 MB of memory, but that task is currently being stopped, then the 2 vCPU and 4096 MB of memory is considered to be available soon. The control plane can instruct the agent to queue up another task to launch into that 2 vCPU and 4096 MB, since these resources are soon to be available. Or maybe Amazon ECS queues up two tasks: each consuming 1 vCPU and 2048 MB of memory. In the vast majority of cases, this behavior contributes to Amazon ECS feeling a bit faster to respond to changes, and faster to relaunch tasks onto capacity as soon as the capacity is available. It also contributes to increased task density. Rather than using two different Amazon EC2 instances that are sparsely populated with tasks, this queuing behavior helps Amazon ECS to get tasks more densely packed onto a single instance.

However, with longer stop timeout period this queueing behavior breaks as Amazon ECS could queue a task up behind a task that takes hours to finish its work. Additionally, the Amazon ECS agent was initially designed to do operations in a serial fashion when there is a stopping container. If a task was in a STOPPING state the agent would wait until the task had fully stopped before launching additional tasks on the host, to ensure that there were no race conditions or double booked resources. This meant that a single stopping container could cause multiple other container launches to queue up and wait, even if the underlying Amazon EC2 instance had plenty of resources available to run additional containers.

These two behaviors meant that configuring a stop timeout could lead to delayed task launches if you operated a cluster that mixed long running batch workloads and lightweight microservice containers.

What we are changing

Today we are announcing an improvement to the Amazon ECS agent. The ECS agent will now launch tasks onto an Amazon EC2 host even if there is a stopping task on the host. We have made improvements to the agent’s ability to do multiple concurrent operations on the instance without creating race conditions. This allows a stopping task to keep its resources reserved, while the rest of the instance capacity is treated as independently available for concurrent task launches.

To benefit from this change, you need to ensure that your Amazon EC2 instances are running the latest version of the Amazon ECS agent. You can launch fresh Amazon EC2 instances using the latest version of the Amazon ECS optimized Amazon Machine Image (AMI) . Or you can use the instance refresh feature of Amazon EC2 Autoscaling Groups to update all your Amazon EC2 instances to the new AMI. If you are building your own customized AMI, then you need to update the Amazon ECS agent to the latest version inside of your custom AMI.

For the vast majority of Amazon ECS customers, there‘s no noticeable difference, or a slight performance improvement where tasks in a rolling deployment launch a little bit faster than they did before. However, for some customers who used extremely long stop timeouts, there is a significant improvement in the rate of task launches, and a reduction in hung tasks that wait in a PENDING state for a long time. This is because the Amazon ECS agent continues to launch tasks around a STOPPING task.

More to come

There is additional improvement planned. Today Amazon ECS control plane still optimistically queues tasks up to launch after a task that is currently stopping. If you use an extremely long stop timeout period, and are running a cluster that is extremely close to full (i.e., no unreserved capacity on any Amazon EC2 instances in the cluster) you may still see Amazon ECS queue up a PENDING task that waits behind a stopping task. We plan to make optimistic scheduling smarter, so that it avoids queuing behind tasks with a long stop timeout, and/or give you the ability to turn this optimistic queueing feature completely off for your cluster.

Today’s launch already avoids almost all issues with long stop timeouts; however, there are two ways that you can ensure that stopping tasks never cause undesirable task queuing:

Use the task protection endpoint inside of your process, instead of long stop timeouts. Task protection gives the Amazon ECS agent and control plane enhanced information about whether your task is doing important work that it does not want to be interrupted. This gives the control plane the information it needs to avoid even attempting to stop a task that is busy. When using this task protection endpoint properly you avoid situations where the task would be placed into a STOPPING state for an extended period of time while it still has work to do. The task remains in a RUNNING state, and other tasks won’t be queued up behind a RUNNING task.
Run tasks that have a long stop timeout in a separate Amazon ECS cluster: one cluster for long running workloads that have an extended stop timeout, and another cluster for lightweight tasks that will all be stopping quickly.

Conclusion

This improvement is part of an ongoing focus this year to address top issues on our public GitHub roadmap. We value your feedback, and welcome you to submit any additional feature requests or improvements as issues on the roadmap.

See also:

Deploy a background worker that uses ECS task scale-in protection to avoid premature stops.
StopTimeout parameter in the container definition

Containers