Containers

Spark on Amazon EKS networking – Part 1

This post was co-authored by James Fogel, Staff Software Engineer on the Cloud Architecture Team at Pinterest

Part 1: Design process for Amazon EKS networking at scale

Introduction

Pinterest is a platform that helps inspire people to live a life they love. Big data and machine learning (ML) are core to Pinterest’s platform and product, gathering insights from over 480 million monthly active users (MAU) (Source: Pinterest, Global analysis, Q3 2023) and curating content to improve their experience. In this two-part series, my counterpart, James Fogel (Staff Cloud Architect at Pinterest), and I share Pinterest’s journey through designing and implementing their network and cell topology for running large-scale Spark workloads on Amazon Elastic Kubernetes Service (Amazon EKS). In this post, we focus on the process: gathering requirements, deciding on design tenets, and defining the design process.

There were many overlapping design parameters to consider: Kubernetes, Amazon EKS, Spark, AWS Virtual Private Cloud (Amazon VPC) networking, and more. Additionally, the network and cell design crosses over many domains owned by several different areas of the customer’s engineering team. The design process was a challenge, but between the customer’s Cloud Architecture team (James Fogel, Sekou Doumboya), the Spark project team (led by Soam Acharya), Site Recovery Engineers (Ashim Shrestha), Amazon EKS specialists (Vara Bonthu, Alan Halcyon), and Principal Account SA (yours truly, Doug Youd), we were able to come up with a network design optimized for this customer’s environment and roadmap. In part two of the series, we’ll cover design options, chosen design, and implementation that came as outputs to this process, as well as the results and the forward-looking vision for Pinterest’s data workloads on Amazon EKS.

Scoping the problem

In 2022, Pinterest decided to build a new data platform to replace their legacy Hadoop platform. Pinterest data teams decided Spark running on Kubernetes, specifically Amazon EKS, would be the future of their batch processing systems. A well thought out network design was crucial to address scaling problems.

Early in Spark and Amazon EKS testing, it was discovered that the Spark workload presented challenges with: 1/ Amazon Elastic Compute Cloud (Amazon EC2) API request rates for Elastic Network Interface (ENI) IP provisioning, 2/ Network Address Usage (NAU) and, 3/ Private IPv4 exhaustion. All three issues are interrelated and require a design that addresses these scale dimensions. Scaling limits and quotas horizontally through a cell-based architecture was a driving force for accelerating multi-account modernization initiatives, and qualified the Spark on Amazon EKS project as an early adopter of this new architectural pattern.

We sought to design a solution with some basic architectural considerations: scalability, performance, security, cost efficiency, repeatability/extensibility, and resiliency. We used these considerations to establish a list of specific tenets for our design. For example, when we thought about cost efficiency, we thought about the cost of AWS resources, but also considered the cost of operational and engineering overhead for the system. As another example, with the principle of repeatability and extensibility, we thought about both automation to make deploying Spark on Kubernetes repeatable, but also thought about making our patterns applicable to other Pinterest systems. Pinterest presented their Spark on EKS platform at Re:Invent 2023, which is worth a watch.

Solution overview

Design Process

We took an iterative, scenario-based approach to the design for the Spark on Amazon EKS project and also tried to extrapolate a pattern from the design for other comparable workloads in the process, while being careful to not overthink the scenarios and suffer from analysis paralysis. This was the design process we followed:

  • Tenets: We set ourselves a list of goals and tenets. Some tenets would be in contention with each other, and there are always tradeoffs to be made to achieve the best overall blend.
  • List design options: We investigated, listed all options, and evaluated baseline pros and cons of each. We were able to rule out some options in this phase by comparing the pros, cons, and capabilities.
  • Draft strawman designs: Working off of step two, we applied our options to our scenario and worked out rough designs.
  • Prototype: Made a development AWS account and built up the infrastructure put forward in step 3. Critically, took an Infrastructure-as-Code (IaC) first approach so that experimentation was captured in code and could be rapidly redeployed once we moved to building production structures.
  • Evaluate in scenarios: We took the prototyping findings, evaluated against goals and tenets in a scenario and extrapolated based on reasonable scenarios, and took the design to boundary conditions to see where it breaks down.
  • Iterate: Communicated findings and cell boundary decisions back to the development team and adapted designs to one another’s.
  • Present shortlist to business and wider group: This pattern will be foundational to many large new platforms at Pinterest, so we sought buy-in and support from Pinterest Engineering and Leadership. Cost was a significant dimension of the equation, so input from business decision-makers was critical.
  • Make a joint decision: Once all stakeholders were aligned, we published the agreed-upon design internally and got to work building.

Design tenets

An important phase of our development was setting a set of tenets for the design. We framed this as, “what standards do we want to hold ourselves to? What are the goals and outcomes?”  We iterated on these tenets at the beginning of the project, after having already completed some preliminary prototyping and experimentation with Spark on Amazon EKS in 2022, which helped us speak from a position of experience.

Each member of the cross-functional team brought their experience working on the prototype, consulted internally within their group to gather any lessons learned from similar projects and then agreed on the below set of tenets. Going in, we knew that many of these tenets would be in contention with each other, so the goal would be to hit as many as possible and understand the tradeoffs being made.

  • Address all known scale limits: Evaluate all components of the architecture and make sure we fit within the scale limits (described in depth below, since that was an important consideration).
  • Modern, scalable architecture: Utilize a cell-based architectural pattern considering both AWS accounts and networking.
  • Allow scaling to thousands of nodes initially and eventually tens of thousands of nodes in the R instance family at 16XL sizing.
  • Treat limits as finite-resources: We tried to quantify limits, quotas, request-rates and capacity as a finite resource to be designed around. This allowed us to make tradeoffs on using more of one resource to address challenges in another related resource. This ended up being a helpful way to frame complex interdependent issues with a wider group in a consumable way.
  • Cost optimized
    • Use financial implements such as Reserved instances, Instance/Compute savings plans, etc.
    • Avoid network architectures with sub-optimal cost profiles wherever possible, without compromising on capabilities.
  • Low operational overhead: Pinterest pride themselves on operating a large and sophisticated system with a relatively lean operations team. Ongoing operations are a metric against which designs are measured. Automation and repeatable templates were an outcome of this tenet; everything was written IaC first to make cells quickly repeatable, with enough flexibility to adapt to iterations on the data platform.
  • Amazon EKS nodes to remain directly routable from the production fabric: Making the nodes unroutable was seen as potentially disruptive to operations teams’ existing workflows and tooling, such as node deployment and sidecars.
  • Pods generally don’t need ingress or routable IPs: Spark executor pods often require egress to upstream dependencies to perform their work (for example reading / writing to databases and endpoints other than Amazon Simple Storage Service [Amazon S3]). Some pods, like Spark-driver pods, do host services that require some ingress, but for this workload that is the exception.
  • Support up to 96 pods per host: In early testing, we evaluated the current Hadoop/Spark environment and estimated a number of pods per node to fully utilize the Amazon EC2 nodes.
  • Avoid one-way-doors wherever possible: We determined some options would be difficult to change later, once load was shifted to the platform in production. Particular care was placed on identifying one-way-doors for the design and avoiding them where possible.

Known scale limits

  • IPv4 exhaustion: The k8s/EKS supported default CNI-Driver allocates IP-per-pod (which equates to per-job-executor in Spark terminology), which leads to high IP utilization.
  • Amazon EC2 and other AWS API quotas and limits at an account level: On any workload of a sufficient size, architects must consider the request limits of the AWS APIs. This is especially true with this particular Spark on Amazon EKS workload.
  • VPC Network Address Utilization (NAU): A core limit to be aware of with designs of this scale is the scalability of the VPC in terms of network-routes (NAUs). For example, with 10 k EKS nodes and 96 Pod IPs assigned individually, the NAU usage would be 960k across the workload. Each VPC can have a maximum of 256 k NAUs and 512 k across peered VPCs total, so the network design must address this constraint.
  • Amazon EKS and Spark control plane scaling: It’s possible for Amazon EKS node-groups to scale to thousands of nodes; however, with Spark, there are several other pieces of software that limit optimal cluster scale. For example, logging and metrics agents commonly probe k8s/EKS control plane APIs and ultimately can lower the ceiling a single Amazon EKS control plane can support. During large bursts of pod launches these secondary agents can lead to scaling challenges.

Conclusion

In this post, we showed you how Pinterest and AWS went about gathering requirements and forming tenets for a networking design project to support large scale, real-world deployments. The process was key to defining the scope of the problem, establishing the boundaries to operate within, aligning with stakeholders on requirements, and establishing a process for collaborating across teams to achieve our goals.

In the next part of this series, we’ll dive into the outputs of this process. We will share the options that we considered in designing our architecture and our chosen design and reasoning. We will also dive into one element of the design — cell shape and sizing — to demonstrate how we implemented the tenet of treating limits as finite resources by balancing our requirements and limitations.

"Headshot

James Fogel, Pinterest

James Fogel is a Staff Software Engineer on the Cloud Architecture Team at Pinterest. He helps Pinterest’s Infrastructure organization build always-on, planet-scale cloud infrastructure with his deep knowledge in AWS services, infrastructure-as-code, cloud networking, and extreme-scale systems design.

Doug Youd

Doug Youd

Doug Youd is a Principal Solutions Architect in the AWS strategic accounts group, focused on data platforms, advanced networking, containers and resiliency. He has a strong background in datacenter infrastructure, networking, virtualization and is passionate about opensource.