Improving daemon services in Amazon ECS
When using Amazon EC2 for compute capacity in Amazon Elastic Container Service (Amazon ECS) clusters, a common pattern customers follow is to schedule a single instance of a task across all or select nodes in the cluster. This includes running tasks that handle log and/or metrics collection such as Fluentd or the DataDog agent, node monitoring, security agents, or custom in-house services. Regardless of the use case, to achieve this on our own would be challenging, hence we rely on the scheduler to do the heavy lifting. To support this requirement, Amazon ECS includes the ability to schedule tasks using the daemon strategy when defining a service. The daemon scheduling strategy deploys exactly one task on each active container instance that meets all of the task placement constraints specified in your cluster. The service scheduler also evaluates the task placement constraints for running tasks and will stop tasks that do not meet the placement constraints. When using this strategy, there is no need to specify a desired number of tasks, a task placement strategy, or use Service Auto Scaling policies. For more information on daemon scheduling, see the Amazon ECS documentation.
One of the many benefits of using Amazon ECS is that customers get to take advantage of feature updates and enhancements to the control plane without the need to perform cluster level upgrades. This is an often understated value as our customers get to take advantage of the latest features without having to plan for cluster upgrades. With all of that said, over the past few months we have made several improvements to how ECS handles daemon services and prioritizes them when scheduling. Let’s take a look at some of the optimizations we’ve made.
Daemon task launch improvements
A common scenario when working with EC2-backed ECS clusters is around scale-out and how the scheduler handles placing daemon tasks onto the new hosts. The scheduler in this scenario is competing to schedule both replicas and daemons, and depending on how many replica services are running and in need of placement, daemons could lose out on getting placed onto a host. We’ve heard from our customers and have made some improvements to how the scheduler works with daemon tasks.
The most impactful improvement we’ve introduced relates to the frequency at which the scheduler launches daemon tasks relative to replica tasks. Recently, we modified the way we schedule daemon tasks by lowering the interval of how often the scheduler attempts to place daemon tasks relative to replica tasks. This has improved the responsiveness of daemon services during scale out events, giving greater probability to them getting scheduled. From our internal testing we saw a 96% success rate with daemon tasks getting successfully scheduled with a 7.6 second interval from instance launch to daemon tasks getting placed. This is just a small step in improving daemon services, but provides a much more reliable experience.
As mentioned above, it’s expected that EC2 instances will come in and out of ECS clusters as demand ebbs and flows. There are many reasons for this type of scaling activity; whether it’s an autoscaling event, an AMI update, or applying an OS patch, it’s a good practice to set your instances into a draining state when you want to recycle them. This is important for a couple of reasons:
1) The scheduler will not attempt to schedule new tasks onto an EC2 instance if it’s in a draining state.
2) The scheduler will stop those tasks, ensuring that if you do handle interruptions in your containers they can exit gracefully.
As we mentioned earlier, daemon service use cases generally serve as a centrally relied upon task running on a host (log aggregation and/or metrics collection for example). If a daemon task gets killed before other non daemon tasks, the tasks that rely on the daemon task won’t be able to communicate with it as it no longer exists. This was an issue that our customers brought up as the scheduler would kill tasks regardless of them being daemon or replica. In November of 2020, we added an update to the ECS scheduler to drain daemon tasks last on a host. This ensures that when an instance is set to drain, ECS will terminate all other tasks before terminating the daemon task.
We are continually iterating on ECS features and enhancements. While some features garner more attention than others, the users of the service get to take advantage of these features and enhancements on day one without the need to perform cluster control plane upgrades. Daemon service enhancements have been rolled out over the past few months and we will continue to improve them going forward. This means that while using these features, one may be taking advantage of them without being aware of the updates that have been put into place. Ultimately this helps our customers focus more on what matters to their business, and less about the boilerplate that takes time away from what’s important.