Spark | Containers

Spark on Amazon EKS networking – Part 2

This post was co-authored by James Fogel, Staff Software Engineer on the Cloud Architecture Team at Pinterest Part 2: Spark on EKS network design at scale Introduction In this two-part series, my counterpart, James Fogel (Staff Cloud Architect at Pinterest), and I share Pinterest’s journey designing and implementing their networking topology for running large-scale Spark […]

Spark on Amazon EKS networking – Part 1

This post was co-authored by James Fogel, Staff Software Engineer on the Cloud Architecture Team at Pinterest Part 1: Design process for Amazon EKS networking at scale Introduction Pinterest is a platform that helps inspire people to live a life they love. Big data and machine learning (ML) are core to Pinterest’s platform and product, […]

Run Spark-RAPIDS ML workloads with GPUs on Amazon EMR on EKS

Introduction Apache Spark revolutionized big data processing with its distributed computing capabilities, which enabled efficient data processing at scale. It offers the flexibility to run on traditional Central Processing Unit (CPUs) as well as specialized Graphic Processing Units (GPUs), which provides distinct advantages for various workloads. As the demand for faster and more efficient machine […]

Best practices for running Spark on Amazon EKS

Amazon EKS is becoming a popular choice among AWS customers for scheduling Spark applications on Kubernetes. It’s fully managed but still offers full Kubernetes capabilities for consolidating different workloads and getting a flexible scheduling API to optimize resources consumption. But Kubernetes is complex, and not all data engineers are familiar with how to set up […]

Advertising click-prediction modeling on Amazon EKS

In digital advertising, the ad click-through rate (CTR) model predicts the probability of a click given the ads and context x (for example, shopping query, time of the day, device). The output of a CTR model can be seen as a conditional probability p(y = click|x). A precise estimation of this probability influences our ability […]

Optimizing Spark performance on Kubernetes

Apache Spark is an open source project that has achieved wide popularity in the analytical space. It is used by well-known big data and machine learning workloads such as streaming, processing wide array of datasets, and ETL, to name a few. Kubernetes is a popular open source container management system that provides basic mechanisms for […]

Containers