AWS Big Data Blog
Use Batch Processing Gateway to automate job management in multi-cluster Amazon EMR on EKS environments
AWS customers often process petabytes of data using Amazon EMR on EKS. In enterprise environments with diverse workloads or varying operational requirements, customers frequently choose a multi-cluster setup due to the following advantages:
- Better resiliency and no single point of failure – If one cluster fails, other clusters can continue processing critical workloads, maintaining business continuity
- Better security and isolation – Increased isolation between jobs enhances security and simplifies compliance
- Better scalability – Distributing workloads across clusters enables horizontal scaling to handle peak demands
- Performance benefits – Minimizing Kubernetes scheduling delays and network bandwidth contention improves job runtimes
- Increased flexibility – You can enjoy straightforward experimentation and cost optimization through workload segregation to multiple clusters
However, one of the disadvantages of a multi-cluster setup is that there is no straightforward method to distribute workloads and support effective load balancing across multiple clusters. This post proposes a solution to this challenge by introducing the Batch Processing Gateway (BPG), a centralized gateway that automates job management and routing in multi-cluster environments.
Challenges with multi-cluster environments
In a multi-cluster environment, Spark jobs on Amazon EMR on EKS need to be submitted to different clusters from various clients. This architecture introduces several key challenges:
- Endpoint management – Clients must maintain and update connections for each target cluster
- Operational overhead – Managing multiple client connections individually increases the complexity and operational burden
- Workload distribution – There is no built-in mechanism for job routing across multiple clusters, which impacts configuration, resource allocation, cost transparency, and resilience
- Resilience and high availability – Without load balancing, the environment lacks fault tolerance and high availability
BPG addresses these challenges by providing a single point of submission for Spark jobs. BPG automates job routing to the appropriate EMR on EKS clusters, providing effective load balancing, simplified endpoint management, and improved resilience. The proposed solution is particularly beneficial for customers with multi-cluster Amazon EMR on EKS setups using the Spark Kubernetes Operator with or without Yunikorn scheduler.
However, although BPG offers significant benefits, it is currently designed to work only with Spark Kubernetes Operator. Additionally, BPG has not been tested with the Volcano scheduler, and the solution is not applicable in environments using native Amazon EMR on EKS APIs.
Solution overview
Martin Fowler describes a gateway as an object that encapsulates access to an external system or resource. In this case, the resource is the EMR on EKS clusters running Spark. A gateway acts as a single point to confront this resource. Any code or connection interacts with the interface of the gateway only. The gateway then translates the incoming API request into the API offered by the resource.
BPG is a gateway specifically designed to provide a seamless interface to Spark on Kubernetes. It’s a REST API service to abstract the underlying Spark on EKS clusters details from users. It runs in its own EKS cluster communicating to Kubernetes API servers of different EKS clusters. Spark users submit an application to BPG through clients, then BPG routes the application to one of the underlying EKS clusters.
The process for submitting Spark jobs using BPG for Amazon EMR on EKS is as follows:
- The user submits a job to BPG using a client.
- BPG parses the request, translates it into a custom resource definition (CRD), and submits the CRD to an EMR on EKS cluster according to predefined rules.
- The Spark Kubernetes Operator interprets the job specification and initiates the job on the cluster.
- The Kubernetes scheduler schedules and manages the run of the jobs.
The following figure illustrates the high-level details of BPG. You can read more about BPG in the GitHub README.
The proposed solution involves implementing BPG for multiple underlying EMR on EKS clusters, which effectively resolves the drawbacks discussed earlier. The following diagram illustrates the details of the solution.
Source Code
You can find the code base in the AWS Samples and Batch Processing Gateway GitHub repository.
In the following sections, we walk through the steps to implement the solution.
Prerequisites
Before you deploy this solution, make sure the following prerequisites are in place:
- Access to a valid AWS account
- The AWS Command Line Interface (AWS CLI) installed on your local machine
- git, docker, eksctl, kubectl, helm, jq and yq utilities installed on the local machine
- Permission to create AWS resources
- Familiarity with Kubernetes, Amazon EKS, and Amazon EMR on EKS
Clone the repositories to your local machine
We assume that all repositories are cloned into the home directory (~/
). All relative paths provided are based on this assumption. If you have cloned the repositories to a different location, adjust the paths accordingly.
- Clone the BPG on EMR on EKS GitHub repo with the following command:
The BPG repository is currently under active development. To provide a stable deployment experience consistent with the provided instructions, we have pinned the repository to the stable commit hash aa3e5c8be973bee54ac700ada963667e5913c865
.
Before cloning the repository, verify any security updates and adhere to your organization’s security practices.
- Clone the BPG GitHub repo with the following command:
Create two EMR on EKS clusters
The creation of EMR on EKS clusters is not the primary focus of this post. For comprehensive instructions, refer to Running Spark jobs with the Spark operator. However, for your convenience, we have included the steps for setting up the EMR on EKS virtual clusters named spark-cluster-a-v
and spark-cluster-b-v
in the GitHub repo. Follow these steps to create the clusters.
After successfully completing the steps, you should have two EMR on EKS virtual clusters named spark-cluster-a-v
and spark-cluster-b-v
running on the EKS clusters spark-cluster-a
and spark-cluster-b
, respectively.
To verify the successful creation of the clusters, open the Amazon EMR console and choose Virtual clusters under EMR on EKS in the navigation pane.
Set up BPG on Amazon EKS
To set up BPG on Amazon EKS, complete the following steps:
- Change to the appropriate directory:
- Set up the AWS Region:
- Create a key pair. Make sure you follow your organization’s best practices for key pair management.
Now you’re ready to create the EKS cluster.
By default, eksctl
creates an EKS cluster in dedicated virtual private clouds (VPCs). To avoid reaching the default soft limit on the number of VPCs in an account, we use the --vpc-public-subnets
parameter to create clusters in an existing VPC. For this post, we use the default VPC for deploying the solution. Modify the following code to deploy the solution in the appropriate VPC in accordance with your organization’s best practices. For official guidance, refer to Create a VPC.
- Get the public subnets for your VPC:
- Create the cluster:
- On the Amazon EKS console, choose Clusters in the navigation pane and check for the successful provisioning of the
bpg-cluster
In the next steps, we make the following changes to the existing batch-processing-gateway code base:
- Because the bpg code base doesn’t come bundled with MySQL drivers, update the Project Object Model or POM xml file to include mysql-connector-java
- Update POM xml to use latest versions of all dependencies
- Update the SQL syntax in Data Access Object or DAO java file to match MySQL syntax
- Add instruction to download Apache Maven in the Dockerfile
For your convenience, we have provided the updated files in the batch-processing-gateway-on-emr-on-eks
repository. You can copy these files into the batch-processing-gateway
repository.
- Replace POM xml file:
- Replace DAO java file:
- Replace the Dockerfile:
Now you’re ready to build your Docker image.
- Create a private Amazon Elastic Container Registry (Amazon ECR) repository:
- Get the AWS account ID:
- Authenticate Docker to your ECR registry:
- Build your Docker image:
- Tag your image:
- Push the image to your ECR repository:
The ImagePullPolicy
in the batch-processing-gateway GitHub repo is set to IfNotPresent
. Update the image tag in case you need to update the image.
- To verify the successful creation and upload of the Docker image, open the Amazon ECR console, choose Repositories under Private registry in the navigation pane, and locate the
bpg
repository:
Set up an Amazon Aurora MySQL database
Complete the following steps to set up an Amazon Aurora MySQL-Compatible Edition database:
- List all default subnets for the given Availability Zone in a specific format:
- Create a subnet group. Refer to create-db-subnet-group for more details.
- List the default VPC:
- Create a security group:
- List the
bpg-rds-securitygroup
security group ID:
- Create the Aurora DB Regional cluster. Refer to create-db-cluster for more details.
- Create a DB Writer instance in the cluster. Refer to create-db-instance for more details.
- To verify the successful creation of the RDS Regional cluster and Writer instance, on the Amazon RDS console, choose Databases in the navigation pane and check for the
bpg
database.
Set up network connectivity
Security groups for EKS clusters are typically associated with the nodes and the control plane (if using managed nodes). In this section, we configure the networking to allow the node security group of the bpg-cluster
to communicate with spark-cluster-a
, spark-cluster-b
, and the bpg Aurora RDS cluster
.
- Identify the security groups of
bpg-cluster
,spark-cluster-a
,spark-cluster-b
, and thebpg Aurora RDS cluster
:
- Allow the node security group of the
bpg-cluster
to communicate withspark-cluster-a
,spark-cluster-b
, and thebpg Aurora RDS cluster
:
Deploy BPG
We deploy BPG for weight-based cluster selection. spark-cluster-a-v
and spark-cluster-b-v
are configured with a queue named dev
and weight=50
. We expect statistically equal distribution of jobs between the two clusters. For more information, refer to Weight Based Cluster Selection.
- Get the bpg-cluster context:
- Create a Kubernetes namespace for BPG:
The helm chart for BPG requires a values.yaml
file. This file includes various key-value pairs for each EMR on EKS clusters, EKS cluster, and Aurora cluster. Manually updating the values.yaml
file can be cumbersome. To simplify this process, we’ve automated the creation of the values.yaml
file.
- Run the following script to generate the
values.yaml
file:
- Use the following code to deploy the helm chart. Make sure the tag value in both
values.template.yaml
andvalues.yaml
matches the Docker image tag specified earlier.
- Verify the deployment by listing the pods and viewing the pod logs:
- Exec into the BPG pod and verify the health check:
We get the following output:
{"status":"OK"}
BPG is successfully deployed on the EKS cluster.
Test the solution
To test the solution, you can submit multiple Spark jobs by running the following sample code multiple times. The code submits the SparkPi
Spark job to the BPG, which in turn submits the jobs to the EMR on EKS cluster based on the set parameters.
- Set the kubectl context to the bpg cluster:
- Identify the bpg pod name:
- Exec into the bpg pod:
kubectl exec -it "<BPG-PODNAME>" -n bpg -- bash
- Submit multiple Spark jobs using the curl. Run the below curl command to submit jobs to
spark-cluster-a
andspark-cluster-b
:
After each submission, BPG will inform you of the cluster to which the job was submitted. For example:
- Verify that the jobs are running in the EMR cluster
spark-cluster-a
andspark-cluster-b
:
You can view the Spark Driver logs to find the value of Pi as shown below:
After successful completion of the job, you should be able to see the below message in the logs:
We have successfully tested the weight-based routing of Spark jobs across multiple clusters.
Clean up
To clean up your resources, complete the following steps:
- Delete the EMR on EKS virtual cluster:
- Delete the AWS Identity and Access Management (IAM) role:
- Delete the RDS DB instance and DB cluster:
- Delete the
bpg-rds-securitygroup
security group andbpg-rds-subnetgroup
subnet group:
- Delete the EKS clusters:
- Delete
bpg
ECR repository:
- Delete the key pairs:
Conclusion
In this post, we explored the challenges associated with managing workloads on EMR on EKS cluster and demonstrated the advantages of adopting a multi-cluster deployment pattern. We introduced Batch Processing Gateway (BPG) as a solution to these challenges, showcasing how it simplifies job management, enhances resilience, and improves horizontal scalability in multi-cluster environments. By implementing BPG, we illustrated the practical application of the gateway architecture pattern for submitting Spark jobs on Amazon EMR on EKS. This post provides a comprehensive understanding of the problem, the benefits of the gateway architecture, and the steps to implement BPG effectively.
We encourage you to evaluate your existing Spark on Amazon EMR on EKS implementation and consider adopting this solution. It allows users to submit, examine, and delete Spark applications on Kubernetes with intuitive API calls, without needing to worry about the underlying complexities.
For this post, we focused on the implementation details of the BPG. As a next step, you can explore integrating BPG with clients such as Apache Airflow, Amazon Managed Workflows for Apache Airflow (Amazon MWAA), or Jupyter notebooks. BPG works well with the Apache Yunikorn scheduler. You can also explore integrating BPG to use Yunikorn queues for job submission.
About the Authors
Umair Nawaz is a Senior DevOps Architect at Amazon Web Services. He works on building secure architectures and advises enterprises on agile software delivery. He is motivated to solve problems strategically by utilizing modern technologies.
Ravikiran Rao is a Data Architect at Amazon Web Services and is passionate about solving complex data challenges for various customers. Outside of work, he is a theater enthusiast and amateur tennis player.
Sri Potluri is a Cloud Infrastructure Architect at Amazon Web Services. He is passionate about solving complex problems and delivering well-structured solutions for diverse customers. His expertise spans across a range of cloud technologies, ensuring scalable and reliable infrastructure tailored to each project’s unique challenges.
Suvojit Dasgupta is a Principal Data Architect at Amazon Web Services. He leads a team of skilled engineers in designing and building scalable data solutions for AWS customers. He specializes in developing and implementing innovative data architectures to address complex business challenges.