How do I submit Spark jobs to an Amazon EMR cluster from a remote machine or edge node?

Last updated: 2019-10-22

I want to submit Apache Spark jobs to an Amazon EMR cluster from a remote machine, such as an Amazon Elastic Compute Cloud (Amazon EC2) instance.

Short Description

To submit Spark jobs to an EMR cluster from a remote machine, the following must be true:

1.    Network traffic is allowed from the remote machine to all cluster nodes.

2.    All Spark and Hadoop binaries are installed on the remote machine.

3.    The configuration files on the remote machine point to the EMR cluster.

Resolution

Confirm that network traffic is allowed from the remote machine to all cluster nodes

  • If you are using an EC2 instance as a remote machine or edge node: Allow inbound traffic from that instance's security group to the security groups for each cluster node.
  • If you are using your own machine: Allow inbound traffic from your machine's IP address to the security groups for each cluster node.

Install the Spark and other dependent binaries on the remote machine

To install the binaries, copy the files from the EMR cluster's master node, as explained in the following steps. This is the easiest way to be sure that the same version is installed on both the EMR cluster and the remote machine.

1.    Run the following commands to create the folder structure on the remote machine:

sudo mkdir -p /var/aws/emr/
sudo mkdir -p /etc/hadoop/conf
sudo mkdir -p /etc/spark/conf
sudo mkdir -p /var/log/spark/user/
sudo chmod 777 -R /var/log/spark/

2.    Copy the following files from the EMR cluster's master node to the remote machine. Don't change the folder structure or file names.
/etc/yum.repos.d/emr-apps.repo
/var/aws/emr/repoPublicKey.txt

3.    Run following commands to install the Spark and Hadoop binaries:

sudo yum install -y hadoop-client
sudo yum install -y hadoop-hdfs
sudo yum install -y spark-core
sudo yum install -y java-1.8.0-openjdk

If you want to use the AWS Glue Data Catalog with Spark, run the following command on the remote machine to install the AWS Glue libraries:

sudo yum install -y libgssglue

Create the configuration files and point them to the EMR cluster

Note: You can also tools such as rsync to copy the configuration files from EMR master node to remote instance.

1.    Run the following commands on the EMR cluster's master node to copy the configuration files to Amazon Simple Storage Service (Amazon S3). Replace yours3bucket with the name of the bucket that you want to use.

aws s3 cp /etc/spark/conf s3://yours3bucket/emrhadoop-conf/sparkconf/ --recursive
aws s3 cp /etc/hadoop/conf s3://yours3bucket/emrhadoop-conf/hadoopconf/ --recursive

2.    Download the configuration files from the S3 bucket to the remote machine by running the following commands on the core and task nodes. Replace yours3bucket with the name of the bucket that you used in previous step.

sudo aws s3 cp s3://yours3bucket/emrhadoop-conf/hadoopconf/ /etc/hadoop/conf/ --recursive
sudo aws s3 cp s3://yours3bucket/emrhadoop-conf/sparkconf/ /etc/spark/conf/ --recursive

3.    Create the HDFS home directory for the user who will submit the Spark job to the EMR cluster. In the following commands, replace sparkuser with the name of your user.

hdfs dfs –mkdir /user/sparkuser
hdfs dfs -chmown sparkuser:sparkuser /user/sparkuser
hdfs dfs -chown sparkuser:sparkuser /user/sparkuser

The remote machine is now ready for a Spark job.

Submit the Spark job

Run the following command to submit a Spark job to the EMR cluster. Replace these values:
org.apache.spark.examples.SparkPi: the class that serves as the entry point for the job
/usr/lib/spark/examples/jars/spark-examples.jar: the path to the Java .jar file

spark-submit --master yarn --deploy-mode cluster --class org.apache.spark.examples.SparkPi /usr/lib/spark/examples/jars/spark-examples.jar

You can also access HDFS data from the remote machine using hdfs commands.

Common errors

Standalone mode

Amazon EMR doesn't support standalone mode for Spark. It's not possible to submit a Spark application to a remote Amazon EMR cluster with a command like this:

SparkConf conf = new
SparkConf().setMaster("spark://master_url:7077”).setAppName("Word Count");

Instead, set up your local machine as explained earlier in this article. Then, submit the application using the spark-submit command.

java.lang.UnsupportedClassVersionError

The following error occurs when the remote EC2 instance is running Java version 1.7 and the EMR cluster is running Java 1.8:

Exception in thread "main" java.lang.UnsupportedClassVersionError:
org/apache/spark/launcher/Main : Unsupported major.minor version 52.0

To resolve this error, run the following commands to upgrade the Java version on the EC2 instance:

sudo yum install java-1.8.0
sudo yum remove java-1.7.0-openjdk

Did this article help you?

Anything we could improve?


Need more help?