Mobileye’s journey towards scaling Amazon EKS to thousands of nodes

This post was coauthored by David Peer, DevOps Specialist, AI Engineering, Mobileye and Tsahi Duek, Specialist Solutions Architect for AWS Container services.

This blog post reviews how Mobileye’s AI Engineering Group seamlessly runs their workflows on Amazon Elastic Kubernetes Service (Amazon EKS), supporting around 250 workflows daily.

What is Mobileye?

Mobileye develops self-driving technology and Advanced Driver Assistance Systems (ADAS) using state-of-the-art cameras, computer chips, and software. The Mobileye AI Engineering Group supports the different engineering teams in running diverse types of workloads, such as workflows, DAGs, ML/DL training workflows, and basic batch jobs. When the platform to support the engineering teams and their workloads was first designed, we created the following key design principles and capabilities needed from the platform:

Platform abstraction for the engineering teams—We did not want the engineering teams to handle infrastructure, nor did we want them to have to deep dive into the platform architecture before using it (they can do that if they want, but it is not mandatory for using the platform).
Battle-tested solution that supports running diverse types of workloads concurrently—Since there are multiple requirements for diverse types of workloads from the different teams, we needed a platform that could be easily configured for specific workloads while having commonality across platform management, deployment, and operations.
Scalable platform solution—At Mobileye, we run tens of data processes per hour. Each data process can launch anything between hundreds to thousands of pods simultaneously. An example of such a process can be a Spark cluster spinning up thousands of Spark executors. Therefore, a highly scalable solution was a basic requirement for us to support our engineering teams.
Cost-efficient solution—We needed a solution to reduce costs wherever possible.
Solution to support our network architecture—At Mobileye, we run our workloads on a private network with no internet access for security and regulatory reasons.
Scale and use different instance types to support several types of workloads—These can be workloads such as GPUs and Habana Gaudi accelerators, but we wanted as much compute power as we could get at a given time.

Taking all of the above requirements into consideration, we have decided to use Amazon EKS to run our production workloads. We considered alternatives such as on-premises or cloud batch processing systems, but we had already decided that other services would run on Kubernetes. That meant that any solution or platform not based on Kubernetes would increase our operational burden since we would have to learn how to deploy, configure, and operate multiple different platforms.

Choosing Amazon EKS as our main and, eventually, only containerized workload platform allowed us to better design our workloads for scale, reliability, availability, security, monitoring, and other Day 2 operations all in a single type of infrastructure. It allowed our developers and engineers to focus on the aspects that were important for them to run their workloads while abstracting the deployment and running part for them.

Solution architecture

The managed service and managed node groups of Amazon EKS allowed us to focus on the workload instead of infrastructure, given a well-designed architecture based on our needs. We like to think of Kubernetes and Amazon EKS as an OS for containerized workloads. This allows us to do the following:

Deploy and run critical services.
Scale our workloads and infrastructure while using different pricing options, such as Amazon EC2 Spot Instances, and have the ability to fall back by instance type, need, and price.
Easily integrate with storage systems such as Amazon Elastic Block Store (Amazon EBS) and Amazon Elastic File System (Amazon EFS) using the Container Storage Interface (CSI), which allows applications to request several types of storage systems depending on their needs.
Keep our workloads and infrastructure secure while adhering to the Shared Responsibility Model.

The following diagram describes the configuration of our Amazon EKS cluster. For simplicity, we described how a single Availability Zone is configured. The same configuration applies for all other Availability Zones that our workloads are running in.

Architecture diagram of Amazon EKS and its data-plane of managed node groups used by Mobileye

The diagram demonstrates using different node groups to diversify the instance types as much as possible, according to the best practices for Spot Instances. This configuration allowed us to scale our clusters to ~3,200 nodes used by ~40,000 pods and more than 100,000 vCPUs in a single cluster, while more than 95 percent of the cluster data plane uses Spot Instances.
Another design that was important for us was supporting fully private networks, and the Amazon EKS recommendations for private clusters were helpful in achieving that support.

Argo Workflows—exposing Kubernetes to the crowds

Given the above Amazon EKS architecture, our developers are not and should not be aware of the node group setup, Availability Zones, or any other aspects. To achieve that abstraction, we are using Argo Workflows. It allows our developers to design their processing either by using Argo Workflows manifests in a declarative way or by coding it with Argo’s Python SDK (Hera Workflows, Couler) as part of their code. This allows them to focus on their application-specific requirements, and having Argo Workflows ensures that the workflow runs to completion. This results in workloads that spread over diverse types of instances and GPU accelerators and interact with different data sources.

The following diagram demonstrates a complete workflow generated by a developer. As previously mentioned, the developer can specify requirements for the different types of workload steps, such as container image, resources, specific nodes to use, secrets, and artifacts. Argo Workflows will ensure that the entire workflow runs to completion, including retries or restarts. Kubernetes will make sure the pods are scheduled on the right type of infrastructure.

A workflow design generated by a developer in the Argo Workflows product

In the next paragraphs, we will share what configurations allowed us to scale our Amazon EKS clusters to more than 3,000 nodes to support our diverse types of workloads.

Large cluster considerations

Using Cluster Autoscaler

Cluster Autoscaler is a well-known automatic scaling add-on that supports scaling the needed infrastructure for the workloads running in the cluster. In our case, Cluster Autoscaler was the add-on of choice and was able to support our list of tens of managed node groups. Yet, we still had to pay attention to some of its configurations, described as follows:

Generated list of instance types

Cluster Autoscaler relies on the Amazon EC2 API to generate a supported instance type list at runtime. When running Cluster Autoscaler in a private VPC with no internet access, a VPC endpoint for Amazon EC2 is required to generate the instances list.

Using priority expander to handle Spot and On-Demand pricing offering

Cost is a significant factor when running large-scale clusters. Therefore, we chose to run most of our workload on Amazon EC2 Spot Instances. To support it, we used a priority-based expander so that in the case of some types of instances being temporarily unavailable on Spot Instances, our production services will not be impacted. As shown in the following configuration, we defined three types of priorities, where spot-workflows.* is the highest priority. (For more information on how priority expander works on Cluster Autoscaler, please refer to the project documentation on GitHub.)

To minimize the use of On-Demand Instances in case of insufficient capacity, we have limited the size of the On-Demand node group to around 5 percent of the size of the Spot node groups. In addition, we differentiated between management systems that were running on Spot Instances (such as the Argo server, Elasticsearch data nodes, and so on) to the workflow’s workloads. The reason for this was to reduce the possibility of management systems’ pods being evicted because Cluster Autoscaler performs bin packing to the pods across all node groups. We achieved that by separating the spot-workflows node groups from the spot-mgmt-workflows workload, as shown in the following Cluster Autoscaler configuration.

Note: Since we are running workflows processing on Amazon EKS, we do not need to handle going back from On-Demand to Spot Instances in case of insufficient capacity for Spot Instances. The reason is that the current workload will continue to run on On-Demand Instances, whereas new workloads will again try to run on Spot Instances. The following is our priority expander configuration:

apiVersion: v1
kind: ConfigMap
metadata:
 name: cluster-autoscaler-priority-expander
 namespace: kube-system
data:
 priorities: |-
   80:
   - spot-workflows.*
   50:
   - spot-mgmt-workflows.*
   10:
   - .*

Labeling node groups to provide scheduling flexibility to our users

As described earlier, at Mobileye, we are running several types of workflows that require different instance configurations. This is where node labels, taints and tolerations, and node affinity come in handy. First, they allow us to verify that the right workload gets the right infrastructure and that no other type of infrastructure is being used. Second, allocating the wrong instance type for the workflow can lead to delays in processing and unwanted costs. Third, designing the node groups with custom node labels allows us to add support to newer instance types without changing the pods manifests. The following are examples of the configuration of node groups in eksctl.

Here is an example of a managed node group:

- name: spot-workflows-gaudi-a3
    labels:
      accel: "dl1.24xlarge"
      instance-type: "gaudi"
      Habana.ai/gaudi: "true"
      zone: "us-east-1a"
      autoscaler: "true"
      k8s.amazonaws.com/accelerator: "gaudi"
    taints:
      - key: "Habana.ai/gaudi"
        value: "true"
        effect: "NoSchedule"

If a specific instance type is needed, there is an option for specifying it in the nodeSelector parameter. For example, when a pod needs an NVIDIA A100-based instance, the following pod manifest will allow us to achieve just that:

 nodeSelector:
      accel: " dl1.24xlarge"
    tolerations:
    - key: "Habana.ai/gaudi"
      operator: "Exists"
      effect: "NoSchedule"

Configuring and monitoring CoreDNS deployment

DNS queries and resolution on a massive scale can be an issue as well. We have encountered a situation when even Amazon Simple Storage Service (Amazon S3) bucket name resolution failed, and we immediately suspected the cluster DNS service, which is served by the CoreDNS deployment.

First, there were not enough CoreDNS pods running in the cluster, so we increased the number of replicas. This was the first resolution to the issue, and now we are considering implementing automatic scaling for the CoreDNS deployment per Amazon EKS Best Practices Guide recommendations.

Second, AWS limits the number of DNS queries originating from specific ENI, so placing the CoreDNS pods on different instances is required. This can be done using podAntiAffinity.

Third, we had to increase the resource allocation for the CoreDNS pods, as it defaults to 100 millicores. After experiencing several restarts for the CoreDNS pods, we increased the CPU allocation to 1 CPU, and since then, the cluster has operated as expected. Monitoring your CoreDNS latency, errors, and resource consumption is something to look after. The Amazon EKS Best Practices Guide provides recommendations for that as well.

Future improvement: We are considering using local node DNS cache capability, although the current setup of our CoreDNS deployment works well for our 3,200 node cluster.

Use On-Demand (Reserved) capacity type for your cluster-critical workloads

While most of our workload uses Spot Instances, we do use a small number of Reserved Instances, which provides significant price reduction for On-Demand nodes. You can refer to the Cluster Autoscaler expander configuration above (the lowest priority configuration is 10, which means all node groups that don’t have a “spot” part in the node group name). We mainly configure this for critical services, where a Spot Instances interruption or pod eviction scenario can impact the normal operation or availability of our services.

Amazon EKS managed node groups automatically label nodes with the eks.amazonaws.com/capacityType label key and the ON_DEMAND/SPOT label values.

Our CoreDNS, AWS Load Balancer Controller, Amazon EBS CSI controller, and Amazon EFS CSI controller all run on such instances.

Other critical services, such as Elasticsearch primary nodes (referred to as “master nodes” by Elasticsearch) and Argo workflow controller, use these Reserved Instances as well.

Future improvements

Reaching a steady state never means we are done, and the technological system keeps evolving and improving all the time. Therefore, we also have a backlog of tasks that should introduce improvements and enhancements to our architecture. The following are topics we are considering implementing next.

Karpenter—node provisioning project for Kubernetes

Karpenter is a new project that aims to improve efficiency and cost for running Kubernetes clusters. Karpenter does not rely on constructs such as managed node groups or ASGs, but instead creates the right infrastructure for the workload on the fly. One of the significant improvements that Karpenter brings is fast scaling (the time between the pod being created and the pod reaching a running state) and the fact that it does not require node groups, which at the scale of our architecture, can be challenging to manage.

NodeLocal DNSCache

Another improvement to the DNS deployment in the cluster is to implement caching for the DNS request on every node. The Amazon EKS Best Practices Guide covers this topic as well.

Migrate from Fluentd to Fluent Bit

Fluentd is an open-source data collector for a unified logging layer. When running in Kubernetes, the metadata filter adds metadata to the logs from the Kubernetes API server. Metadata added by the plugin can be labels, annotations, namespace, and more. However, the metadata plugin defaults to watch pods on the API server for changes, and when running on exceptionally large clusters, this can result in a high load on the API server and can result in latency for other API requests for the API server in the control plane. This WATCH command can be disabled, as described in this Containers blog post. Additionally, we are considering migrating from Fluentd to Fluent Bit because it has better performance with minimal impact on the API server.

Sharding clusters

Maintaining a single Amazon EKS cluster might be challenging overall. Imagine cluster upgrade scenarios, node group updates, and upgrades, which can lead to service degradation for the developers.

We plan to extend our VPC CIDR range so we can support more Amazon EKS clusters and manage more Argo Workflows deployments. By doing so, we will be able to load balance and schedule workflows based on a cluster’s load, state, or any other characteristics we implement in our placement policy. Although it will add more complexity to our solution, it should allow Mobileye to better use AWS and Amazon EKS resources in some cases and scale more efficiently.

Conclusion

New technology brings new challenges with it. Our decision to start with a managed service lowers the amount of effort needed to start using it. With Amazon EKS, we have less work to do than if we were building our clusters from the ground up. Also, using resources such as the Amazon EKS Best Practices Guide and communicating with the Kubernetes and AWS Containers community through the different GitHub projects helped us gain the knowledge and experience needed to support such scale.

David Peer, DevOps Specialist, AI Engineering, Mobileye

David Peer is a DevOps team leader and DevOps Specialist at AI Engineering Group, Mobileye Vision Technologies. He has 20+ years of experience in developing and optimizing large-scale Unix-based environments on premises or in the cloud. He has a background in Unix system programming, web services, microservices and DevOps methodologies.

Containers