AWS Compute Blog

Cluster Management with Amazon ECS

In previous blog posts, we have talked a lot about Amazon EC2 Container Service (Amazon ECS) as a way to run Docker containers in the cloud on AWS, but not as much has been said about the cluster management options exposed through the ECS API.

Today we want to talk a bit about what cluster management is, how ECS empowers it, and — for those familiar with other cluster management systems like Apache Mesos — an example of how existing workloads may take advantage of the ECS API. At the end of this post, you’ll find links to some example code we are open sourcing today; we think it will empower you and your business to create large scale distributed systems with ECS as your cluster management solution.

Cluster management is becoming an important task as developers and businesses increasingly develop and deploy distributed applications in the cloud. Cluster management systems schedule work and manage the state of each cluster resource. A common example of developers interacting with a cluster management system is when you run a MapReduce job via Apache Hadoop or Apache Spark. Both of these systems typically manage a coordinated cluster of machines working together to perform a large task. In the case of Hadoop or Spark, these tasks are most often data analysis jobs or machine learning.

Cluster management systems have two challenges. First, there is a lot of overhead from managing the state of the cluster. For example, software like Hadoop and Spark typically have a Leader, or a part of the software that runs in one place and is in charge of coordination. They’ll then have many, often hundreds or even thousands of Followers, or a part of the software that receives commands from the Leader, executes them, and reports state of their sub-task.

When machines fail, the Leader must detect these failures, replace machines, and restart the Followers that receive commands. This can be a significant portion of code written for applications which need access to a large pool of resources. The second challenge is that each of these applications typically assumes full ownership of the machine where their tasks are running. You will often end up with multiple clusters of machines, each dedicated fully to the management system in use. This can lead to inefficient distribution of resources, and jobs taking longer to run than if a shared pool of resources could be used.

ECS provides a simple solution to cluster state management: the management of followers (using the ECS Agent), dispatching of sub-tasks to the proper location, and state inspection of the cluster are all exposed through the API. Rather than Spark or Hadoop having to manage a set of machines directly, ECS manages your instances. If you need to find out if a sub-task is still running, or what instances are available, you can simply call the ECS List* and Describe* API actions. This allows distributed systems to cut down on the amount of code needed to go from idea to implementation. Much of the undifferentiated heavy lifting and housekeeping has been abstracted behind a set of APIs. The ability to run multiple tasks on a shared pool of resources can also lead to higher utilization and faster task completion than if compute resources are statically partitioned.

One of the core principles behind the design of ECS is the separation of the scheduling logic from the state management. This allows you to use the ECS schedulers, write your own schedulers, or integrate with third party schedulers. A common solution for use cases such as data analysis, batch jobs, and machine learning is the open-source cluster management system Apache Mesos, which “provides applications (e.g., Hadoop, Spark, Kafka, Elasticsearch) with APIs for resource management and scheduling across entire datacenter and cloud environments.”

As an initial proof of concept of how we could start integrating Apache Mesos with ECS, we built an Apache Mesos scheduler driver called ECSSchedulerDriver. This driver allows the Mesos cluster management “start task” commands to be sent directly to ECS. This demonstrates how we can quickly extend ECS based on customer feedback, in some cases, to co-exist and collaborate with existing open source tools such as Mesos and Marathon. You can also write your own schedulers for ECS if you have specific needs.

If this kind of integration is of interest, or you are interested in integration with other cluster management frameworks or schedulers, please give us feedback via the ECS forum.