Accelerating Spark workloads on Amazon EMR with Windjammer’s Spark plugin

AWS customers often use Apache Spark for distributed big data processing. Spark has gained popularity due to its fast in-memory computing that enables parallel computation of tasks across multiple nodes. To aid customers with running Spark workloads, Amazon EMR provides a managed cluster platform that makes it easy to run frameworks such as Apache Hadoop, Apache Spark, and Apache Hive.

However, Spark users sometimes experience issues with Spark’s Java Virtual Machine (JVM) being very processing-intensive, resulting in high operating costs and server sprawl. Other common issues are Spark not fully exploiting the bandwidth of Amazon S3 and the need for expensive shuffle operations for fault tolerance.

Windjammer is a member of the AWS Partner Network (APN) and is available in AWS Marketplace. Windjammer EMR Accelerator is a compatible native Spark plugin that you can implement on top of Amazon EMR. Windjammer can help address Spark bottlenecks, provide cost reduction, enhance performance, and add fault tolerance.

In this post, Kai and I share how to implement the Windjammer Spark plugin on your EMR cluster to achieve performance improvements and cost reduction on your Spark workloads. We will also show how to perform A/B testing to measure the impact the Accelerator has on your workloads.

Prerequisites

You must complete the following prerequisites before implementing the Windjammer EMR Accelerator:

Subscribe to EMR Accelerator (EMR 6.X/SPARK) via AWS Marketplace.
Choose View purchase options, then choose Subscribe. To access the Windjammer Welcome page, choose Set up your account.
On the Welcome page, enter your email address and select Submit. You should see the Windjammer EMR Accelerator: Guide to Deployment, Management, and A/B Benchmarking Proof of Concepts (POC).

Solution overview

You can implement the Windjammer EMR Accelerator as a Spark plugin that optimizes Spark SQL query execution in a compatible way. You install the Accelerator on the EMR cluster via an EMR bootstrap action, which requires no changes to the Spark application or code.

Solution walkthrough: Accelerating Spark workloads on Amazon EMR with Windjammer’s Spark plugin

Following is a step-by-step guide for creating EMR clusters with the EMR Accelerator, running A/B benchmarking, and running Spark workloads with the Accelerator.

Step 1: Create and configure an EMR cluster with EMR Accelerator

In this section, we will show you how to create an EMR cluster through the cluster creation wizard in the Amazon EMR console.

As of publish date, the sample datasets for A/B benchmarking are available only in the following Regions:

US East (N. Virginia)
US East (Ohio)
US West (Oregon)

If you would like to run the A/B benchmarking with sample data in an unavailable Region, you must copy the sample Transaction Processing Performance Council Decision Support (TPC-DS) dataset from an available Region to a local bucket in your account. Data transfer costs may be incurred. See the Windjammer Deployment and POC Guide for more details.

Step 1.1 Select the software and steps for the cluster

In the Amazon EMR console, under EMR on EC2 in the left navigation pane, choose Clusters, and then choose Create cluster.
Select Go to advanced options.
Under Software configuration, select an Amazon EMR 6.X release (6.2 and above are supported).
At minimum, select Hadoop 3.2.1+, Hive 3.1.2+, and Spark 3.1.2+ applications.
In the Steps section, under Step type, select Custom JAR. Select Add step. In the Add step dialog box:
- Set JAR location to command-runner.jar.
- Set Arguments to /tmp/wjm-prep.
- Select Add.

On the lower right side, select Next.

Step 1.2 Select the hardware for the cluster

The following Amazon Elastic Compute Cloud (Amazon EC2) instance type selections show a balanced cluster configuration. This configuration is running the EMR Accelerator framework with the Transaction Processing Performance Council-Decision Support (TPC-DS) workload at 3 terabyte (TB) or 10 TB scale.

When selecting different EC2 instance types, select instances that have networking performance of approximately 1 Gbps per virtual central processing unit (vCPU). For example, you might choose EC2 instances with an “n” in their name. EMR Accelerator does not require instances with local solid-state drives (SSD) Typically, EMR clusters running the EMR Accelerator require fewer EC2 instances than stock EMR clusters while still accelerating performance.

To select hardware for the cluster, do the following:

Under the Cluster Nodes and Instances section, select the following instances:
1. For the Master node instance type select m5.xlarge.
2. For the core or task node instance types, select r5dn.4xlarge.
3. In the Node type column, select the Edit icon to optionally provision the cluster with two or four core or task nodes.
Configure the Root device EBS volume size to at least 50 GiB.
Select Next.

Step 1.3 Select the general cluster settings

You may optionally choose the available cluster settings; however, I will only show the necessary steps here. To select the general cluster settings, do the following:

In the cluster creation wizard General Options section, select Logging.
In the Bootstrap Actions section, choose the following bootstrap actions:
1. Set Add bootstrap action to Custom action.
2. Choose Configure and Add.
3. In Script location, enter s3://wjm-build-1-5/bootstrap.

- Optional arguments: If you would like to optionally run A/B benchmarking and your cluster is in one of the AWS Regions listed in Step 1, you do not need any other bootstrap arguments.
- If you are not running in one of these Regions and have copied the dataset to a bucket in your account and Region, in the Optional arguments dialog, specify the bucket name WJM_DATASETS=<local TPC-DS bucket name>.
- WJM_DATASETS must be in capital letters, and the bucket is specified without the s3://

1. Select Add.
Select Next.

Step 1.4 Select the security for the cluster

To choose your security for the cluster, do the following:

In the Security Options section, select an EC2 key pair. You may proceed without a key pair, but you will be unable connect to the nodes by using SSH.
Select Create cluster.

Step 2: Running A/B benchmarks on Spark workloads

The following steps to running TPC-DS queries also generate a performance report comparing EMR Accelerator to Amazon EMR Spark. This helps you evaluate the EMR Accelerator performance against your own workloads.

To run A/B benchmarks on pre-installed TPC-DS queries, do the following:

In the Amazon EMR console, under EMR on EC2 in the left navigation pane, choose Clusters, and select the cluster that you want to view.
Select the Steps tab.
Select Add step.
In the Add step dialog:
- Set JAR location to command-runner.jar.
- Set Arguments to: /opt/wjm/bin/sperfjob <size of dataset in TB> <query number> “sne emr”. For example: /opt/wjm/bin/sperfjob 3 3 “sne emr”
- Select Add.
- For a full list of the queries available, on the cluster’s master node, see the /opt/wjm/queries directory.
The Steps tab will show the status of the executing query step.
To see a full report comparing the two benchmarks after step completion, select View logs and then the stdout link under the Log files column for the step.

To execute custom queries with A/B benchmarking, copy your SQL query and query results file to the master node. See the Windjammer Deployment and POC Guide for more details.

Step 3: Running Spark workloads with EMR Accelerator

With Windjammer’s Spark plug-in, do not add additional Spark jars or parameters to the spark-submit invocations or make any modifications to the Spark code. Amazon EMR notebooks should be used as is. As always, perform sufficient testing prior to running production workloads.

Cleanup

To release resources and prevent additional charges, do the following:

Terminate the EMR cluster that you set up for use with the EMR Accelerator.
If you made a local copy of the TPC-DS datasets bucket (s3:// wjm-datasets-<region>), empty and delete the S3 bucket.

Conclusion

In this post, Kai and I showed you how to use Windjammer EMR Accelerator to improve performance while reducing costs on Spark workloads. We also showed you how to perform A/B benchmarking with the Windjammer framework to observe the impact on Spark workloads.

To learn more about the EMR Accelerator, please see the Deployment/POC Guide or the Technical whitepaper.

About the authors

Kai Rothauge is Chief Technology Officer at Windjammer Technologies, where he leads Windjammer Accelerator architecture and design. Prior to Windjammer, Kai was a Postdoctoral Researcher at the University of California, Berkeley, in the RISELab, where he developed Alchemist to accelerate Spark for large-scale data analysis by offloading to high-performance computing libraries. Kai received his PhD from the University of British Columbia in Applied Mathematics and his MMath from the University of Bath. Prior to his doctoral studies, he also completed scientific internships at the Max Planck Institute, the Fraunhofer Institute, and CSIRO.

Amber Runnels is a Senior Analytics Specialist Solutions Architect at AWS specializing in big data and distributed systems. She helps customers optimize workloads in the AWS data ecosystem to achieve a scalable, performant, and cost-effective architecture. Aside from technology, she is passionate about exploring the many places and cultures this world has to offer, reading novels, and building terrariums.

AWS Marketplace