Spark on Amazon EKS networking – Part 2

This post was co-authored by James Fogel, Staff Software Engineer on the Cloud Architecture Team at Pinterest

Part 2: Spark on EKS network design at scale

Introduction

In this two-part series, my counterpart, James Fogel (Staff Cloud Architect at Pinterest), and I share Pinterest’s journey designing and implementing their networking topology for running large-scale Spark workloads on Amazon Elastic Kubernetes Service (Amazon EKS). In Part 1, we gathered requirements, decided on design tenets, and defined the design process. In this post, we’ll cover the design options, chosen design, and implementation that came as outputs to this process, as well as the results and the forward-looking vision for Pinterest’s data workloads on Amazon EKS. If you haven’t already, we recommend reading part one, as we will be referring back to elements in that post, like the tenets we chose to decide on our design.

With the requirements and tenets gathering phase complete, and Pinterest Engineering and Leadership aligned, we moved on to the design phase. Pinterest presented their Spark on EKS platform at Re:Invent 2023, which is worth a watch.

Solution overview

Design options

For the Spark on Amazon EKS workload, there are three foundational layers of network design that overlap 1/ Amazon EKS Cluster network, which includes the hosts and Pod (i.e., container) network and is generally defined by the k8s Cluster Network Interface driver (CNI), and 2/ Virtual Private Cloud (VPC) network, cross-account connectivity and 3/ ingress and egress strategy. One interesting challenge at Pinterest is that these layers are owned by separate engineering teams, but the layers have complex interdependencies so the design needed to be jointly developed.

For example, the IP allocation configuration defined in the CNI driver, owned by the Spark team, has implications on NAU and subnet-sizes, and potentially usage of shared Amazon Elastic Compute Cloud (Amazon EC2) mutating APIs (AttachNetworkInterface, etc), setting a cell, VPC, and account size boundary and segmenting the environment is typical, but those design decisions are owned by the Cloud-Architecture team.

The Cluster Networking options were: Native IPv6, Overlay CNI driver, IPv4 network address translation (NAT), custom networking or host networking. In the end we chose custom networking, which we’ll explain in detail.

For VPC options, it came down to: VPC peering, Transit Gateway, Shared VPC, or Amazon VPC Lattice. Our preferred architecture is to deploy a single VPC per Cell, and connect back to the rest of the environment via Transit Gateway.

Finally, the internet egress options were to use direct Internet Gateway (IGW) + NAT-Gateway in the subaccount or centralize egress via a transit account model. Egress to the rest of the Pinterest network is handled by the VPC options discussed above. For ingress, we only had a couple of entry points, namely Spark jobs via the Spark-operator, which were via another Pinterest job-routing layer, Archer. There was also the Spark live user interface (UI), which needed an ingress point. The options there were to use the k8s service framework, or to integrate directly with Pinterest’s envoy-based service mesh. We chose to use the native k8s service framework, but further detail in the next section.

In Part 1 of this series, we called out how important cross-team collaboration was for this complex design. Solving ingress for the Spark Live UI was a great example of this cross-team collaboration in action. This one decision required a deep understanding of the Spark Operator, Spark History Server, Kubernetes, ingress, and the proposed cross-account networking design. This necessitated collaboration and decisions from AWS Data on EKS, Pinterest Data, Cloud Architecture, and Traffic teams. It had to happen in a short period of time because this requirement surfaced late in the design process, so having our escalation paths and ongoing dialog established paid off here.

Chosen design and why

Diagram showing inter-cell communication. Workload cell contains a single VPC, /16 non-routable CGNT range for Pods, /20 Routable subnet for EKS hosts and 2x /24 routable ranges for EKS control plane. The workload VPC connects back to other cells via VPC peering today, with a pathway to TGW later if needed.

IPv4, for now: While IPv6 simplifies the design, it was deemed a bridge too far for the initial design. Unfortunately, the Spark Operator used for this design did not support IPv6 and Spark itself only added support in 3.4, which required additional work validating existing jobs already validated to run on 3.2. But for anyone following in these footsteps, we’d strongly recommend considering IPv6 as the primary option before moving on to the other, more complicated options.

Custom networking: We decided on an IP allocation model of unrouteable CGNAT IP ranges for Pods, Carrier-Grade NAT ranges are an extension of the RFC1918 private address ranges. These unrouteable ranges are restricted to within each production cell boundary and routable IPs allocated only to the k8s nodes and ingress controllers. For pod egress out to the wider Pinterest network, NAT on the Amazon EKS host would be used.

Utilizing custom networking with built-in NAT removed one additional service, NAT-GW, from the architecture, further optimizing costs. For most customers, NAT-GW costs would be in the noise, but for Pinterest, Spark to other-Pinterest-service flows equated to > 120 PetaBytes/month of traffic which can become a non-trivial cost component.

By moving pod IPs to CGNAT Pod IP range, we reduce the routable internal IP exhaustion caused by moving to an ip-per-pod model and in time, potentially re-use these ranges. For now, since VPC peering is being used, the ranges must be non-overlapping.

Static clusters, pre-warm IPs: One potential scaling consideration we identified early in experimentation was the Amazon EC2 mutating API throttling, specifically AttachNetworkInterface and AssignPrivateIpAddress.To eliminate this risk during regular operation, Pinterest decided to statically provision the clusters, and pre-allocate IPs during node provisioning. Additionally, the CNI was configured to not allow IPs to be provisioned beyond the pre-allocation settings. In effect, this moves the Amazon EC2 calls to provisioning, instead of pod-launch, and helping achieve a form of static-stability.

Prefix Delegation, six prefixes per node (96 IPs): To reduce the mutating API calls and NAU utilization, six prefixes per host were specified, which allows up to 96 pods per host.

Single VPC per cell, evolving VPC interconnect: When weighing inter-VPC connectivity options, several were considered: Shared-VPC, Transit-Gateway (TGW), and VPC Peering. Our chosen architecture uses TGW to connect VPCs. However, in the short term we will use VPC Peering for VPC interconnect due to a lower cost profile and Pinterest’s experience with this pattern. VPC Peering addresses all of our known scale limits for now at the lower cost profile, but eventually Pinterest will reach a scale where a new solution (TGW, IPv6, etc) will be necessary. Building and maintaining this network will be an iterative process, and as the solution proves its value, Pinterest will evolve the interconnect strategy. The choice to use peering in the short term presents an easy two-way-door decision for future iterations.

Shared VPCs were also considered, but also share several of the quotas and limits in a single blast radius, which is not a good fit for a large multi-cell workload like Spark on Amazon EKS. Shared VPCs also do not support cross-AWS Organizations connections, which was a requirement due to an upcoming migration to a new AWS Organization. We will need to operate across two AWS organizations during the migration. Crucially, shared VPCs for this class of workload also would be a one-way-door. Once large workloads are moved to a single shared VPC, if any scale issues are discovered in the future, it would be difficult to add segmentation between them.

VPC peering does come with two major tradeoffs: 1/ No segmentation of NAU impact radius across peered VPCs, and 2/ High operational-overhead due to its mesh-topology. We mitigated NAU impact for Spark on EKS by sizing each cell such that misconfiguration of prefix-delegation would not exceed NAU limits for the peered mesh (discussed in more depth in Cell shape and sizing and multi-account cellular architecture).

Ingress routing: One proverbial fly in the ointment for this design was an ingress endpoint for the Spark Live UI, which provides job telemetry via an web interface for running jobs. All other ingress to the pods were via operators like Spark Operator. To solve this case, we used an Amazon EKS ingress controller to an NGINX service, then a small shim to register and deregister the Spark driver pods as a new service URL on the NGINX service. We will eventually achieve ingress via Pinterest Service Mesh.

This ingress design allows for a static provisioning of the ingress load-balance layer, with dynamic JobID routing as jobs start and complete. This design was necessary to address all scale challenges and work around capacity challenges by ensuring the high job mutation rate didn’t flow on as necessary calls to either the Amazon EKS control plane or other AWS APIs (for example register-target on the Elastic Load Balancer (ELB), registering additional cluster-IPs or provisioning ELBs per job, etc).

Cell shape and multi-account cellular architecture

As an example of this process in action, we proposed a cell-based strawman architecture, then iterated on the precise cell sizing. For the Spark-EKS workload, a Cell in this context is a deployable unit of the Spark on Amazon EKS platform, bounded by a single AWS account and containing a single VPC and a set ceiling of Amazon EKS nodes.

Some questions we asked: 1/ How big could the Spark-EKS clusters be? 2/ What are our peers at the cutting-edge doing? 3/ What guidance does AWS / EKS have for us? We used these inputs to come up with a fairly accurate strawman number of key parameters like Amazon EKS nodes per cluster and number of IP addresses per VPC. Designing across these parameters is an example of how we made trade-offs based on the treat limits as finite resources tenet in part one of this series. The finite resources we considered were: pod IPv4 addresses, NAUs, Amazon EC2 mutating API request-rate and cluster rebuild time.

Rebuild time versus cluster size: Our boundaries were determined by assessing the limits listed above. To address the AWS API limit, we considered an account-level boundary. We considered the rebuild time with a goal of keeping an individual cell rebuild time under 30 minutes. By roughly calculating the number of API calls made by an Amazon EKS cluster deployment with our chosen networking options, we determined we would consume the EC2 mutating API throttles with approximately 1000 to 1500 nodes within the 30 minute target rebuild period. This node limit defines our account cell size.

Peered VPC limit versus Cell size: To address the relevant AWS networking limits, we began to draw our VPC cell boundaries. The most important limits were the NAU cap in a single VPC and in a peered mesh, and IP limit per VPC. The variables of the cluster design that push these two limits are the number of nodes, pod density, and use of prefix delegation. The risk we are managing is a maximum limit of 512 k NAUs in a peered mesh. We developed a calculator that allowed us to plug in the Spark team’s proposed node group sizes, pod density, and prefix delegation configuration, and that allowed us to iterate with all stakeholders on sizing the cells such that we de-risked the NAU limit scenario. Replacing VPC Peering with Transit Gateway or implementing IPv6 will help manage this risk in the future.

The shape of the cell VPCs were bounded by NAUs and total number of IPs, from which we derived a total number of nodes at a given density. The account boundaries were determined by mutating API calls, where we again derived a number of nodes at a given density. These two boundaries intersect at 1300 nodes, so we were able to simplify our definition of a Cell to be 1:1 VPCs and Account. In the end, we were able to draw a simple limit of 1300 nodes per Cell, which addressed all the more complex scenarios.

Conclusion

In this post, we showed you an iterative and collaborative approach as a strong foundation for Pinterest’s new data platform, but also for the future of Pinterest on the cloud. This is a point in time, and this design serves us well, but we can and will iterate with new technologies, different workloads, and new scale.

While we presently have a single cell for the Moka project, this will eventually become multi-cell in order to meet the capacity demands of the platform. As we grow, we will likely introduce technologies like Transit Gateway or VPC Lattice to add additional risk segmentation and allow for larger cells. This foundation also sets us up for IPv6 adoption. More broadly, this pattern will be used to expand Pinterest’s cellular strategy to other parts of the business. We can expand with velocity because we translated a strong understanding of our boundaries into the architecture, and that design is captured in Infrastructure as Code for automated deployment.

We already have plans to repeat this pattern to additional data workloads in 2024; ultimately the vision is for the majority of data/analytics workloads to run on the Amazon EKS platform in the fullness of time.

James Fogel, Pinterest

James Fogel is a Staff Software Engineer on the Cloud Architecture Team at Pinterest. He helps Pinterest’s Infrastructure organization build always-on, planet-scale cloud infrastructure with his deep knowledge in AWS services, infrastructure-as-code, cloud networking, and extreme-scale systems design.

Containers