Deep Dive on Amazon ECS Cluster Auto Scaling

Introduction

Up until recently, ensuring that the number of EC2 instances in your ECS cluster would scale as needed to accommodate your tasks and services could be challenging. ECS clusters could not always scale out when needed, and scaling in could impact availability unless handled carefully. Sometimes, customers would resort to custom tooling such as Lambda functions, custom metrics, and other heavy lifting to address the challenges, but there was no single approach that could work in all situations. Of course, running your tasks on Fargate instead of EC2 instances eliminates the need for scaling clusters entirely, but not every customer is ready or able to adopt Fargate for all of their workloads.

ECS Cluster Auto Scaling (CAS) is a new capability for ECS to manage the scaling of EC2 Auto Scaling Groups (ASG). With CAS, you can configure ECS to scale your ASG automatically, and just focus on running your tasks. ECS will ensure the ASG scales in and out as needed with no further intervention required. CAS relies on ECS capacity providers, which provide the link between your ECS cluster and the ASGs you want to use. Each ASG is associated with a capacity provider, and each such capacity provider has only one ASG, but many capacity providers can be associated with one ECS cluster. In order to scale the entire cluster automatically, each capacity provider manages the scaling of its associated ASG.

One of our goals in launching CAS is that scaling ECS clusters “just works” and you don’t have to think about it. However, you might still want to know what is happening behind the scenes. In this blog post, I’m going to deep dive on exactly how CAS works.

Design goals

Based on the feedback we had received from customers, we set out with three main design goals for CAS.

Design goal #1: CAS should scale the ASG out (adding more instances) whenever there is not enough capacity to run tasks the customers is trying to run.

Design goal #2: CAS should scale in (removing instances) only if it can be done without disrupting any tasks (other than daemon tasks).

Design goal #3: customers should maintain full control of their ASGs, including the ability to set the minimum and maximum size, use other scaling policies, configure instance types, etc.

As we show later in this blog post, the design of CAS meets all three of these goals.

The core scaling logic

The core responsibility of CAS to ensure that the “right” number of instances are running in an ASG to meet the needs of the tasks assigned to that ASG, including tasks already running as well as tasks the customer is trying to run that don’t fit on the existing instances. Let’s call this number M. Let’s also call the current number of instances in the ASG that are already running N. We’ll make extensive use of M and N throughout the rest of the blog post, so it’s important to have a completely clear understanding of how to think about them. For now, we haven’t explained how we know what M should be, but for the purposes of discussion, let’s assume that M is what you need. Given this assumption, if N = M, scaling out is not required, and scaling in isn’t possible. On the other hand, if N < M, scale out is required because you don’t have enough instances. Lastly, if N > M, scale in is possible (but not necessarily required) because you have more instances than you need to run all of your ECS tasks. As we will see later, we also define a new CloudWatch metric based on N and M, called the CapacityProviderReservation. Given N and M, this metric has a very simple definition:

To put it in plain language, the metric is the ratio of how big the ASG needs to be relative to how big it actually is, expressed as a percentage. As explained later in this blog, this metric is used by CAS to control the scaling of the ASG. In the formula above, the number M is the part that CAS controls; in turn, M is driven by the customer’s tasks (both already running and waiting to run). How M is calculated is key to how CAS actually does the scaling.

In order to determine M, we need to have a concept of tasks that the customer is trying to run that don’t fit on existing instances. To achieve this, we adapted the existing ECS task lifecycle. Previously, tasks would either run or not, depending on whether capacity was available. Now, tasks in the provisioning state include tasks that could not find sufficient resources on the existing instances. This means, for example, if you call the RunTask API and the tasks don’t get placed on an instance because of insufficient resources (meaning no active instances had sufficient memory, vCPUs, ports, ENIs, and/or GPUs to run the tasks), instead of failing immediately, the task will go into the provisioning state (note, however, that the transition to provisioning only happens if you have enabled managed scaling for the capacity provider; otherwise, tasks that can’t find capacity will fail immediately, as they did previously). As more instances become available, tasks in the provisioning state will get placed onto those instances, reducing the number of tasks in provisioning. In some sense, you can think of the provisioning tasks as a queue; task that can be placed due to resources get added to the queue, and as more resources become available, tasks get removed from the queue.

At the present time, a maximum of 100 tasks can be in the provisioning state for any cluster, and provisioning tasks will wait for capacity for between 10 and 30 minutes before transitioning to “stopped.”

Given this new task lifecycle behavior, how does CAS determine the desired number of instances M? At a high level, the logic is quite simple:

If every instance is running at least one task (not including daemon service tasks), and there are no tasks in the provisioning state, then M = N. (We exclude daemon service tasks because we don’t want scaling to be driven by daemon service tasks, which are supposed to run on every instance. Otherwise you could end up with an endless scale out). Figure 1 shows a graphical example.

Figure 1. The ASG has three instances (purple boxes, N = 3), each running non-daemon tasks (green boxes). There are no provisioning tasks. Because no more instances are needed, but no instances can be terminated without disrupting existing tasks, M = N = 3.
If there is at least one task in the provisioning state, then M > N. We describe in more detail below exactly how M is calculated in this case. Figure 2 shows a graphical example.

Figure 2. The existing instances (N = 3) have no more room for the three provisioning tasks. In this case more instances are needed to run the provisioning tasks, so M > 3; more work is needed to determine a desirable value for M.
If at least one instance is not running any tasks (other than daemon service tasks), and there are no tasks in the provisioning state, then M < N. More specifically, M = the number of instances running at least one task (again, we exclude daemon service tasks because they are supposed to run on every instance. Scale in would never happen if we included daemon services). Figure 3 shows a graphical example.

Figure 3. The green boxes represent non-daemon tasks, and the blue boxes represent daemon tasks. The existing instances are not all running non-daemon tasks. The third instance can be terminated without disrupting any non-daemon tasks, so M = 2.

Let’s think more about how M is calculated if there is at least one task in the provisioning state. We know that M should be bigger than N, but how much bigger? Ideally, CAS would calculate a value for M that is optimal – that is, no bigger and no smaller than it needs to be to run all provisioning tasks. Unfortunately, this is often impractical or impossible – the tasks might all have different resource requirements, placement constraints and placement strategies, and the ASG might even have multiple instance types with different vCPU, memory, and other resources available. Although a full discussion of this topic would require a graduate course in mathematical optimization, for now we can just say that any algorithm that can solve for an optimal value of M with all possible variations of tasks and instances is not computationally feasible.

Since we can’t in general know the optimal value of M, CAS instead tries to make a good estimate. CAS can estimate a lower bound on the optimal number of instances to add based on the instance types that the ASG is configured to use, and use that value for M. In other words, you will need at least M more instances to run all of the provisioning tasks. CAS calculates M in this case as follows:

For ASGs configured to use a single instance type

Group all of the provisioning tasks so that each group has the exact same resource requirements.
Fetch the instance type and its attributes that the ASG is configured to use.
For each group with identical resource requirements, calculate the number of instances required if a binpack placement strategy were used (placement strategies can’t change the lower bound of the number of instances required, only the distribution of tasks on those instances). This calculation accounts for vCPU, memory, ENI, ports, and GPUs of the tasks and the instances. Task placement constraints are considered. However, any placement constraint other than distinctInstance is not recommended.
Calculate M as the maximum value of step 3 across all task groups.
Finally, require that N + minimumScalingStepSize =< M <= N + maximumScalingStepSize. (These two parameters are defined in the capacity provider configuration).

For ASGs configured to use multiple instance types

Group all of the provisioning tasks so that each group has the exact same resource requirements.
Fetch the instance types and their attributes that the ASG is configured to use.
- A. Sort these instance types by each attribute i.e. vCPU, memory, ENI, ports, and GPU.
- B. The largest instance types across each attribute are selected.
For each group with identical resource requirements, calculate the number of instances required based on each of the largest instance types identified in step 2 if a binpack placement strategy were used (placement strategies can’t change the lower bound of the number of instances required, only the distribution of tasks on those instances). This calculation accounts for vCPU, memory, ENI, ports, and GPUs of the tasks and the instances. Task placement constraints are considered. However, any placement constraint other than distinctInstance is not recommended.
Calculate M as the minimum value within each task group and maximum value of those across all task groups.
Finally, require that N + minimumScalingStepSize =< M <= N + maximumScalingStepSize. (These two parameters are defined in the capacity provider configuration).

This algorithm results in M generally being a lower bound on the number of instances needed, and in some cases it will actually be the exact number of instances needed – for example, if all of the provisioning tasks are identical, your ASG is configured to use a single instance type, and your tasks have no placement constraints, then this algorithm results in exactly the right number of instances needed (assuming M falls within the bounds defined in step 5).

If M turns out not to be enough instances to run all of the provisioning tasks, all is not lost. As we demonstrate later, with a target capacity of 100, the ASG will scale out to M instances. Some, but not all, of the provisioning tasks will get placed on the new instances. Since M wasn’t enough instances, there will still be some additional tasks in provisioning. This will kick off another scaling action, where a new value of M is computed; this process will repeat until there are no more tasks in provisioning. Ideally, we would like this process to complete in one step because each step requires time to complete, so the ASG can get to the correct size more quickly if CAS can scale to the correct size in one step. However, even if it takes multiple steps, it will still eventually reach the correct size.

What if your ASG uses multiple instance types or isn’t confined to a single-AZ? In that case, the algorithm described above isn’t necessarily a lower bound, so CAS falls back to a much simpler approach: M = N + minimumScalingStepSize. While this may be less efficient, it will still reach the correct size eventually.

AWS Auto Scaling and the scaling metric

Once CAS has determined M, why don’t we just directly set the desired capacity of the ASG (in other words, just force an update to N so that N = M)? The reason is that this would not allow us to achieve design goal #3. Directly setting the desired capacity would override any other scaling policies in place, and would require that you hand over all scaling completely to CAS. So, in order to achieve all three design goals, CAS relies on AWS Auto Scaling in addition to instance termination protection. More specifically, when you enable managed scaling and managed termination protection with an ASG capacity provider, ECS does the following for you:

Creates a scaling plan for the ASG.
Creates a target tracking scaling policy and attaches it to the scaling plan. The scaling policy uses a new CloudWatch metric called CapacityProviderReservation that ECS publishes for every ASG capacity provider that has managed scaling enabled (You can use additional scaling policies with the same ASG by attaching them to the scaling plan, and you can even use EC2 predictive scaling).
Begins publishing the CapacityProviderReservation metric periodically (one-minute frequency).
Manages instance termination protection to prevent instances running non-daemon tasks from being terminated due to ASG scaling.

The purpose of the CapacityProviderReservation metric is to control the number of instances in the ASG, while also allowing other scaling policies to work with the ASG. In other words, if you aren’t using any other scaling policies, then the desired count of the ASG should be M (the number of instances CAS has determined are needed). Recall that N is the number of instances already up and running in the ASG. In order to convert M and N into a metric that is compatible with target tracking scaling, we must obey the requirement that the “metric value must increase or decrease proportionally to the number of instances in the Auto Scaling group.” With this requirement in mind, our formula for CapacityProviderReservation is (as stated previously):

There are a few special cases where this formula is not used. If M and N are both zero, meaning no instances, no running tasks, and no provisioning tasks, then CapacityProviderReservation = 100. If M > 0 and N = 0, meaning no instances, no running tasks, but at least one provisioning task, then CapacityProviderReservation = 200. (Target tracking scaling has a special case for scaling from zero capacity, where it assumes for the purposes of scaling that the current capacity is one and not zero).

Let’s look at the scenarios in Figures 1, 2, and 3 again.

In Figure 1, CapacityProviderReservation = 3/3 X 100 = 100.
In Figure 2, let’s suppose that M = 4, because we need one additional instance to run the three provisioning tasks. Then, CapacityProviderReservation = 4/3 X 100 = 133.
In Figure 3, CapacityProviderReservation = 2/3 X 100 = 66.

Target tracking scaling policy

Target tracking scaling policies manage the capacity of your ASG. Given a metric and a target value for that metric, the scaling policy will increase/decrease the size of the ASG, in other words it will adjust N, as the metric increases and decreases, with the goal of keeping the metric close to or equal to the target value. The scaling behavior is built on the assumption that the “metric value must increase or decrease proportionally to the number of instances in the Auto Scaling group.” CapacityProviderReservation is designed for this assumption.

Given a target value of 100 for CapacityProviderReservation, the scaling policy will adjust the ASG size (N) up or down until N = M. To see why this is true, the equation CapacityProviderReservation = Target value (or equivalently M / N X 100 = 100), is only true if N = M. If M changes, by either trying to run more tasks, or shutting down existing tasks, the scaling policy will adjust N to keep it equal to M. Scaling to and from zero is even possible: if M=0, meaning no tasks other than daemon service tasks are running, then N will adjust down to 0 also. Likewise, if N=0 and M>0, meaning tasks are provisioning but no instances are running, then the CapacityProviderReservation = 200 and N will adjust upwards to add instances to the ASG.

Target values less than 100 enable spare capacity in the ASG. For example, if you set the target value to 50, the scaling policy will try to adjust N so that the equation M / N X 100 = 50 is true. (It’s important here to note that that M, which is CAS’ estimate of how many instances are needed to run all of the tasks, is not based on the target value of the scaling policy). Doing a little algebra, we see that N = 2 X M. In other words, with a target value of 50, the scaling policy will adjust N until it is exactly twice the number of instances that CAS has estimated are needed to run all of the tasks. This means that half of the instances will not be running any tasks. These instances are available for running additional tasks immediately without having to add instances before starting the tasks. Once those spare instances are running tasks, the scaling policy will adjust N again (making it bigger) to keep the equation N = 2 X M. Likewise, if some instances that were running tasks later are not running any tasks, N will adjust downwards accordingly.

More generally, the smaller the target value, the more spare capacity you will have available in your ASG. For example, a target value of 10 means that the scaling policy will adjust N (within the limits available) so that about 90% of your ASG’s instances will not be running any tasks, regardless of how many tasks you run. Note that if you use a target value less than 100, scaling to zero is not possible, because the goal of maintaining spare capacity is not compatible with scaling to zero.

An important point to note about target tracking scaling policies is that they cannot always guarantee the metric is exactly equal to the target value. For example, if the target value is 75, and M = 10 instances, it is not possible for M / N X 100 to equal 75, since N must be a whole number. Instead, the scaling policy will adjust N to achieve a value close to the target value, with a preference for the metric to be less than the target value if possible.

Scale in and termination protection

When the scaling policy reduces N, it is adjusting the number of instances but it has no control over which instances actually terminate. The default behavior of the ASG may well terminate instances that are running tasks, even though there are instances not running tasks. This is where managed termination protection comes into the picture.

Figure 4. Three instances, two of which are running tasks. M = 2 and N = 3, so CapacityProviderReservation = 66. (It’s important to note that, even though in this particular case the four running tasks could theoretically run on a single instance, M is computed solely based on the number of instances currently running tasks – not on a hypothetical optimal distribution of tasks on a minimal number of instances). If the target capacity is 100, then the ASG will scale in by one instance.

Consider the example shown in Figure 4. With a metric value of 66 and a target value of 100, the ASG will scale in to reduce N from 3 to 2. With no additional input, there is no way to guarantee that the instance running no tasks will be terminated; the third instance in Figure 4 may well be the instance that is terminated during the scaling action. For this reason, we implemented the option of having ECS dynamically manage instance termination protection on your behalf (thus achieving design goal #2). If enabled for a capacity provider, ECS will protect any instance from scale-in if it is running at least one non-daemon task. Note that this does NOT prevent a Spot Instance from being reclaimed, or the instance being terminated manually; it only prevents the ASG from terminating the instance due to scaling.

Figure 5. With managed termination protection, ECS will prevent instances running non-daemon tasks from terminating due to ASG scaling in. This reduces the disruption of running tasks (design goal #2).

Scaling in action

Now that we have defined all of the pieces of CAS, let’s walk through a complete example of scaling out and scaling in.

Scaling out

Step 1. The cluster has one capacity provider, with an ASG with three instances (as shown above), all of which are running tasks. Managed scaling is enabled with a target capacity of 100, and managed termination protection is enabled. There is only one task definition running in the cluster, so all tasks have the same resource requirements. At this point, M = 3, N = 3, and CapacityProviderReservation = 100.

Step 2. RunTask is called with nine more tasks. Six of them can be placed on the existing instances, and three go to provisioning. Now, M = 4, N = 3, and CapacityProviderReservation = 133. The metric will behave as shown in the picture below.

Step 3. Once the metric goes above the target value of 100, the scaling policy kicks in to adjust the desired count of the ASG upwards from N = 3 to N = 4. The tasks are still (briefly) in the provisioning state because ECS has not yet placed them on the instances.

Step 4. ECS recognizes that additional capacity is available, and places the provisioning tasks on the new instance.

Step 5. The metric updates, because M = 4 and N = 4, so CapacityProviderReservation = 100. No further scaling is required.

Scaling in

Step 1. This is the same step 1 as the previous scenario. The cluster has one capacity provider, with an ASG with three instances (as shown above), all of which are running tasks. Managed scaling is enabled with a target capacity of 100, and managed termination protection is enabled. There is only one task definition running in the cluster, so all tasks have the same resource requirements. At this point, M = 3, N = 3, and CapacityProviderReservation = 100.

Step 2. One task is stopped (due to service scaling for example). Now, the first two instances are still protected from termination but the third is not. No scaling has been triggered yet, so all three instances are still running.

Step 3. Now that one instance is free of non-daemon tasks, the scaling metric is updated: M=2, N=3, so CapacityProviderReservation=66.

Step 4. After 15 minutes, meaning 15 consecutive metric values of 66, the ASG triggers scaling in. Since the third instance is not protected from scale in, it terminates. No existing tasks were disrupted during this scale-in action.

Step 5. Now that the instance has terminated, the metric updates again: N=2, M=2, so CapacityProviderReservation = 100. No further scaling is required.

Conclusion

In this blog post, I gave a high level view of the design goals of ECS cluster auto scaling, and showed the details of how CAS works to achieve those goals. CAS is more than just some new APIs; it encompasses a whole new set of behaviors for ECS, and I encourage you to keep this blog post handy so that you can better understand the behavior of your clusters as they scale.

Finally, this isn’t the end of the story for CAS and capacity providers. Not only do we plan on publishing some additional deep dive posts here on the containers blog covering other aspects of ECS and capacity providers, but we also are actively working on expanding the capabilities we offer. If you have requests for new functionality or want to see our roadmap, please visit the AWS Containers roadmap on GitHub. Thanks!

Containers