AWS Open Source Blog

Deploying Spark jobs on Amazon EKS

UPDATE, March 2021: This blog post describes how to deploy self-managed Apache Spark jobs on Amazon EKS. AWS now provides a fully managed service with Amazon EMR on Amazon EKS. This new deployment option allows customers to automate the provisioning and management of Spark on Amazon EKS, and benefit from advanced features such as Amazon EMR runtime performance, fine-grained permissions, AWS Glue Catalog integration, and Amazon EMRFS optimized connector.


Kubernetes has gained a great deal of traction for deploying applications in containers in production, because it provides a powerful abstraction for managing container lifecycles, optimizing infrastructure resources, improving agility in the delivery process, and facilitating dependencies management.

Now that a custom Spark scheduler for Kubernetes is available, many AWS customers are asking how to use Amazon Elastic Kubernetes Service (Amazon EKS) for their analytical workloads, especially for their Spark ETL jobs. This post explains the current status of the Kubernetes scheduler for Spark, covers the different use cases for deploying Spark jobs on Amazon EKS, and guides you through the steps to deploy a Spark ETL example job on Amazon EKS.

Available functionalities in Spark 2.4

Before the native integration of Spark in Kubernetes, developers used Spark standalone deployment. In this configuration, the Spark cluster is long-lived and uses a Kubernetes Replication Controller. Because it’s static, the job parallelism is fixed and concurrency is limited, resulting in a waste of resources.

With Spark 2.3, Kubernetes has become a native Spark resource scheduler. Spark applications interact with the Kubernetes API to configure and provision Kubernetes objects, including Pods and Services for the different Spark components. This simplifies Spark clusters management by relying on Kubernetes’ native features for resiliency, scalability and security.

The Spark Kubernetes scheduler is still experimental. It provides some promising capabilities, while still lacking some others. Spark version 2.4 currently supports:

  • Spark applications in client and cluster mode.
  • Launching drivers and executors in Kubernetes Pods with customizable configurations.
  • Interacting with the Kubernetes API server via TLS authentication.
  • Mounting Kubernetes secrets in drivers and executors for sensitive information.
  • Mounting Kubernetes volumes (hostPath, emptyDir, and persistentVolumeClaim).
  • Exposing the Spark UI via a Kubernetes service.
  • Integrating with Kubernetes RBAC, enabling the Spark job to run as a serviceAccount.

It does not support:

  • Driver resiliency. The Spark driver is running in a Kubernetes Pod. In case of a node failure, Kubernetes doesn’t reschedule these Pods to any other node. The Kubernetes restartPolicy only refers to restarting the containers on the same Kubelet (same node). This issue can be mitigated with a Kubernetes Custom Controller monitoring the status of the driver Pod and applying a restart policy at the cluster level.
  • Dynamic Resource Allocation. Spark applications require an External Shuffle Service for auto-scaling Spark applications to persist shuffle data outside of the Spark executors (see SPARK-24432).

For more information, refer to the official Spark documentation.

Additionally, for streaming application resiliency, Spark uses a checkpoint directory to store metadata and data, and be able to recover its state. The checkpoint directory needs to be accessible from the Spark drivers and executors. Common approaches is to use HDFS but it is not available on Kubernetes. Spark deployments on Kubernetes can use PersistentVolumes, which survive Pod termination, with readWriteMany access mode to allow concurrent access. Amazon EKS supports Amazon EFS and Amazon FSX Lustre as PersistentVolume classes.

Use cases for Spark jobs on Amazon EKS

Using Amazon EKS for running Spark jobs provides benefits for the following type of use cases:

  • Workloads with high availability requirements. The Amazon EKS control plane is deployed in multiple availability zones, as can be the Amazon EKS worker nodes.
  • Multi-tenant environments providing isolation between workloads and optimizing resource consumption. Amazon EKS supports Kubernetes Namespaces, Network Policies, Quotas, Pods Priority, and Preemption.
  • Development environment using Docker and existing workloads in Kubernetes. Spark on Kubernetes provides a unified approach between big data and already containerized workloads.
  • Focus on application development without worrying about sizing and configuring clusters. Amazon EKS is fully managed and simplifies the maintenance of clusters including Kubernetes version upgrades and security patches.
  • Spiky workloads with fast autoscaling response time. Amazon EKS supports Kubernetes Cluster Autoscaler and can provides additional compute capacity in minutes.

Deploying Spark applications on EKS

In this section, I will run a demo to show you the steps to deploy a Spark application on Amazon EKS. The demo application is an ETL job written in Scala that processes New York taxi rides to calculate the most valuable zones for drivers by hour. It reads data from the public NYC Amazon S3 bucket, writes output in another Amazon S3 bucket, and creates a table in the AWS Glue Data Catalog to analyze the results via Amazon Athena. The demo application has the following pre-requisites:

  • A copy of the source code, available from:

git clone https://github.com/aws-samples/amazon-eks-apache-spark-etl-sample.git

  • Docker >17.05 on the computer which will be used to build the Docker image.
  • An existing Amazon S3 bucket name.
  • An AWS IAM policy with permissions to write to that Amazon S3 bucket.
  • An Amazon EKS cluster already configured with permissions to create a Kubernetes ServiceAccount with edit ClusterRole and a worker node instance role with the previous AWS IAM policy attached.

If you don’t have an Amazon EKS cluster, you can create one using the EKSCTL tool following the example below (after updating the AWS IAM policy ARN):

eksctl create cluster -f example/eksctl.yaml

In the first step, I am packaging my Spark application in a Docker image. My Dockerfile is a multi-stage image: the first stage is used to compile and build my binary with SBT tool, the second as a Spark base layer, and the last for my final deployable image. Neither SBT nor Apache Spark maintain a Docker base image, so I need to include the necessary files in my layers:

  • Build the Docker image for the application and push it to my repository:

docker build -t <REPO_NAME>/<IMAGE_NAME>:v1.0 .

docker push <REPO_NAME>/<IMAGE_NAME>:v1.0

Now that I have a package to deploy as a Docker image, I need to prepare the infrastructure for deployment:

  • Create the Kubernetes service account and the cluster role binding to give Kubernetes edit permissions to the Spark job:

kubectl apply -f example/kubernetes/spark-rbac.yaml

Finally, I need to deploy my Spark application to Amazon EKS. There are several ways to deploy Spark jobs to Kubernetes:

  • Use the spark-submit command from the server responsible for the deployment. Spark currently only supports Kubernetes authentication through SSL certificates. This method is not compatible with Amazon EKS because it only supports IAM and bearer tokens authentication.
  • Use a Kubernetes job which will run the spark-submit command from within the Kubernetes cluster and ensure that the job is in a success state. In this approach, spark-submit is run from a Kubernetes Pod and the authentication relies on Kubernetes RBAC which is fully compatible with Amazon EKS.
  • Use a Kubernetes custom controller (also called a Kubernetes Operator) to manage the Spark job lifecycle based on a declarative approach with Customer Resources Definitions (CRDs). This method is compatible with Amazon EKS and adds some functionalities that aren’t currently supported in the Spark Kubernetes scheduler, like driver resiliency and advanced resource scheduling.

I will use the Kubernetes job method to avoid the additional dependency on a custom controller. For this, I need a Spark base image which contains Spark binaries, including the spark-submit command:

  • Build the Docker image for Spark and push it to my repository:

docker build --target=spark -t <REPO_NAME>/spark:v2.4.4 .

docker push <REPO_NAME>/spark:v2.4.4

  • Use kubectl to create the job based on the description file, modifying the target Amazon S3 bucket and the Docker repositories to pull the images from:

kubectl apply -f example/kubernetes/spark-job.yaml

  • Use Kubernetes port forwarding to monitor the job via the Spark UI hosted on the Spark driver. First get the driver pod name, then forward its port on localhost and access the UI with your browser pointing to http://localhost:4040 :

kubectl get pods | awk '/spark-on-eks/{print $1}'

kubectl port-forward <SPARK_DRIVER_POD_NAME> 4040:4040

After the job has finished I can check the results via the Amazon Athena console by creating the tables, crawling, and querying them:

  • Add a crawler in the AWS Glue console to crawl the data in the Amazon S3 bucket and create two tables, one for raw rides and one for ride statistics per location:
    • Enter a crawler name, and click Next.
    • Select Data stores and click Next.
    • Select the previously used Amazon S3 bucket and click Next.
    • Enter a name for the AWS Glue IAM role and click Next.
    • Select Run on demand and click Next.
    • Choose the database where you want to add the tables, select Create a single schema for each Amazon S3 path, click Next and then Finish.
  • Run the crawler and wait for completion.
  • After selecting the right database in the Amazon Athena console, execute the following queries to preview the raw rides and to see which zone is the most attractive for a taxi driver to pick up:

SELECT * FROM RAW_RIDES LIMIT 100;

SELECT locationid, yellow_avg_minute_rate, yellow_count, pickup_hour FROM "test"."value_rides" WHERE pickup_hour=TIMESTAMP '2018-01-01 5:00:00.000' ORDER BY yellow_avg_minute_rate DESC; 

Summary

In this blog post, you learned how Apache Spark can use Amazon EKS for running Spark applications on top of Kubernetes. You learned the current state of Amazon EKS and Kubernetes integration with Spark, the limits of this approach, and the steps to build and run your ETL, including the Docker images build, Amazon EKS RBAC configuration, and AWS services configuration.

 

Vincent Gromakowski

Vincent Gromakowski

Vincent is an Analytics Specialist Solutions Architect at AWS where he enjoys solving customers’ analytics, NoSQL, and streaming challenges. He has a strong expertise on distributed data processing engines and resource orchestration platform.