Announcing Amazon ECS Task Scale-in protection

Introduction

We are excited to launch Amazon Elastic Container Service (Amazon ECS) Task Scale-in protection, which is a new capability that gives customers control over protecting Amazon ECS service tasks from being terminated by scale-in events from Amazon ECS service Auto Scaling or deployments. Customers can simply mark their mission-critical tasks as protected from scale-in events using a new Amazon ECS Agent endpoint or using new Amazon ECS application programming interfaces (APIs).

Background

Service Auto Scaling is an Amazon ECS feature that lets you set policies to automatically adjust the desired count of tasks of an Amazon ECS service in response to changes in traffic patterns. This enables customers to build applications that are highly scalable for peak traffic conditions and reduces compute costs during periods of low utilization. With Amazon ECS service Auto Scaling, customers can use Amazon ECS-published Amazon CloudWatch metrics that track a service’s average central processing unit (CPU) and memory usage, or even use their own custom-metrics, to track an application’s aspects (i.e., HTTP requests received, the number of messages retrieved from a queue or topic, etc.).

Customers have told us that certain applications require mechanisms to safeguard tasks during times of low utilization. For instance, queue-processing asynchronous applications such as a video transcoding jobs wherein some tasks may be running for hours even when cumulative service utilization is low, in case of a gaming application that runs game servers as Amazon ECS tasks that need to continue running even if all users have logged-out to reduce startup latency of a server reboot, or when a new code version is being deployed while some tasks have not yet finished completing ongoing work that would be expensive to reprocess. Until today, customers had no easy way to safeguard such tasks from being terminated by service Auto Scaling or new deployments, which resulted in complex workarounds to resume and restart jobs or custom scaling policies that suffered from low utilization.

Solution overview

Introducing Task Scale-in protection

With Amazon ECS task scale-in protection, customers can now use a new attribute, protectionEnabled, to protect tasks belonging to their ECS services from termination by service deployment or Auto Scaling scale-in events. Customers can use a new Amazon ECS task scale-in protection endpoint to set the the protectionEnabled attribute from within a container, or use the new Amazon ECS API UpdateTaskProtection. Amazon ECS keep Auto Scaling and deployment events from terminating tasks that have scale-in protection set. Once a task finishes its requisite work, customers can unset the attribute, which enables the task to be terminated by subsequent scale-in events. Customers have a simplified mechanism for orchestrating their long-running applications with Amazon ECS, while also benefiting from the performance and cost-savings of service Auto Scaling, without needing to invest in custom tooling. Customers can use the following two mechanisms to set task scale-in protection:

Using the Amazon ECS agent endpoint: This is the recommended approach when a task can self-determine that it needs to be protected, and is a great fit for queue-based or job-processing workloads. When a container starts processing work, e.g., by consuming an SQS message, you can set the protectionEnabled attribute via the task scale-in protection endpoint path $ECS_AGENT_URI/task-protection/v1/state from within the container. Amazon ECS won’t terminate this task during scale-in events. After your task finishes its work, you can unset the protectionEnabled attribute using the same endpoint, which makes the task eligible for termination during subsequent scale-in events.
Using the Amazon ECS APIs: You can use this approach if your application has a component that tracks the status of active tasks. Using the UpdateTaskProtection API, you can mark one or more tasks as protected. An example of this approach would be if your application is hosting game server sessions as Amazon ECS tasks. When a user logs in to a session on the server (i.e., task), you can mark the task as protected. After the user logs out, you can either unset protection specifically for this task or periodically unset protection for similar tasks that no longer have active sessions, depending on your requirement to keep idle servers. You can even combine this approach with the Amazon ECS agent endpoint to set task protection from within a container, and unset it from your external controller service.

Sample applications built with Task Scale-in protection

To demonstrate the usage of Amazon ECS Task Scale-in protection there are two open source demonstration applications you can use as a reference. You can find these applications on Github. The Github repo contains pre-requisite information for the sample applications and clean up instructions.

SQS queue consumer

The first application is an SQS queue consumer. Without using Amazon ECS Task Scale-in protection, it means that Amazon ECS has no way of knowing whether or not your application is currently busy working on a job that it pulled off of an SQS queue. If Amazon ECS chooses to scale in a task that is currently busy, then the SQS message that it was working on will be dropped. This is handled gracefully by SQS, because SQS sends the message to a different queue consumer later when the messages visibility window expires. However, if the job that’s being processed is a long-lived job, then it isn’t ideal to lose progress on work that has already been done and have to restart the job.

Instead, we use Amazon ECS Task Scale-in protection to protect the task whenever the queue consumer is working on a message from SQS. The sample code demonstrates how this works:

async function pollForWork() {
  console.log('Acquiring task protection');
  await TaskProtection.acquire();

  var message = await receiveMessage();

  if (message) {
    await processMessage(message);
  }

  console.log('Releasing task protection');

  await TaskProtection.release();
  return maybeContinuePolling();
}

Before grabbing a message from SQS, the worker makes an API call to set task protection on itself. It then grabs a message from SQS and processes it. After it’s done working on the message, the worker releases the task protection. By setting and then releasing task protection for each job, it gives Amazon ECS a chance to stop older tasks in between messages, but ensures that Amazon ECS won’t prematurely terminate a task that is currently working on a message.

When running the sample application, we see how this flow works in the application logs:

2022-11-07T17:11:18.593-05:00    Acquiring task protection
2022-11-07T17:11:18.617-05:00    Long polling for messages
2022-11-07T17:11:37.641-05:00    0744dc66-c746-4561-ad38-7c2d90717d4a - Received
2022-11-07T17:11:37.641-05:00    0744dc66-c746-4561-ad38-7c2d90717d4a - Working for 10000 milliseconds
2022-11-07T17:11:47.655-05:00    0744dc66-c746-4561-ad38-7c2d90717d4a - Done
2022-11-07T17:11:47.655-05:00    Releasing task protection
2022-11-07T17:11:47.698-05:00    Task protection released
2022-11-07T17:11:47.698-05:00    Acquiring task protection
2022-11-07T17:11:47.725-05:00    Long polling for messages
2022-11-07T17:11:47.731-05:00    c701269a-07cd-4a83-bb71-3734c5306d88 - Received
2022-11-07T17:11:47.731-05:00    c701269a-07cd-4a83-bb71-3734c5306d88 - Working for 10000 milliseconds
2022-11-07T17:11:57.739-05:00    c701269a-07cd-4a83-bb71-3734c5306d88 - Done
2022-11-07T17:11:57.739-05:00    Releasing task protection

As a test, you can send an SQS message with a body of 360000 to create a simulated job that takes 5 minutes to complete. The queue consumer protects itself, then picks up the job and begins its 5‑minute wait for the simulated job to complete. Throughout this time the task stays protected. You can use the Amazon ECS console to scale the service down to zero, in order to test that the task is protected. Normally without task protection Amazon ECS would begin stopping the task immediately, no matter whether it had work in progress or not. But with this task protection, you’ll see the task remains running. You’ll see a service event message like this:

(service task-protection-test-queue-consumer-Service-tjlfn1NmI0yU, taskSet ecs-svc/6163164058718610164) was unable to scale in due to (reason 1 tasks under protection)

After the 5-minute simulated job finishes, the task removes its own task protection and Amazon ECS will stop the task at that time and finalize scaling in the service.

WebSocket service

Another use case for task protection is long-lived WebSocket services. These types of connections are generally used for low latency bi-directional communication for chat servers or game servers. However, you also generally don’t want to interrupt these connections if they are live, because that will drop players connection to the online game, or cause the chat client to disconnect and reconnect.

The sample application code solves this by counting the number of concurrent connected clients, and setting task protection whenever there are connected clients. You can see this in action in the application logs:

2022-11-07T16:53:26.124-05:00    New client connection opened. There are 1 connections
2022-11-07T16:53:26.170-05:00    Task protection acquired
2022-11-07T16:53:28.153-05:00    received: ping
2022-11-07T16:53:30.153-05:00    received: ping
2022-11-07T16:53:32.156-05:00    received: ping
2022-11-07T16:53:32.451-05:00    Task protection acquired
2022-11-07T16:53:34.156-05:00    received: ping
2022-11-07T16:57:04.160-05:00    received: ping
2022-11-07T16:57:05.557-05:00    Client connection closed. There are 0 connections
2022-11-07T16:57:05.605-05:00    Task protection released

To test out task protection, open a browser tab to the service to open a WebSocket connection between your browser and the server. Then edit the service in the Amazon ECS console and adjust desired count to zero. Just as with the queue consumer service you’ll see that Amazon ECS is waiting instead of immediately stopping the task. You’ll also see a message in the service events tab similar to this:

(service task-protection-test-websocket-Service-M9Q0bnYMjlQ4, taskSet ecs-svc/1888478596191510698) was unable to scale in due to (reason 1 tasks under protection)

Once you close all browser tabs, it closes the client connections to the server. Once the WebSocket server has zero connected, then clients removes its own task protection, which allows Amazon ECS to stop the task.

Conclusion

In this post, we showed you how to use the Amazon ECS Task scale-in protection feature to safeguard your tasks from being terminated by scale-in events. With the new Amazon ECS Task Scale-in protection feature, you have more control over the resiliency of your workloads while saving costs. Task Scale-in protection has been one of the most requested ECS features and we are excited to bring this to you. We look forward to hearing your feedback on the AWS Container services roadmap on GitHub.

Containers