How do I submit Spark jobs to a remote Amazon EMR cluster?

Last updated: 2019-04-18

I want to submit Apache Spark jobs to a remote Amazon EMR cluster. How can I do this?

Resolution

Prepare your local machine

Note: Spark jobs can be submitted when deploy-mode is set to client or cluster. When you submit a Spark job in cluster mode, the driver runs on cluster nodes that have all Hadoop binaries installed. When you submit a Spark job in client mode, you must download all Hadoop binaries and then install them on your local machine.

1.    Install all Spark client libraries on your local machine. For example, if you are using an emr-5.10.0 cluster (which has Spark 2.2.0 installed), download spark-2.2.0-bin-hadoop2.7.tgz. Then, place it on your local machine's PATH environment variable. To determine which version of Spark and Apache Hadoop you're using (and therefore which Spark binary you must download), see Spark Release History and Hadoop Version History.

2.    (Optional) If you plan to use the AWS Glue Data Catalog with Spark, copy the AWS Glue client libraries and dependencies to the remote Amazon EMR cluster.

3.    Create an environment variable called HADOOP_CONF_DIR.

Note: If you want to use the AWS Glue Data Catalog with Spark, create an environment variable called SPARK_CONF_DIR instead.

4.    Point the environment variable to a directory on your local machine.

5.    Copy all files in /etc/hadoop/conf on the remote Amazon EMR cluster. Then, add them to the directory that HADOOP_CONF_DIR points to. Spark uses the configuration files, such as yarn-site.xml for YARN settings and hdfs-site.xml for HDFS settings, that are in the directory that HADOOP_CONF_DIR points to.

Note: If you want to use the AWS Glue Data Catalog with Spark, copy all files in /etc/spark/conf on the cluster. Then, add them to the directory that SPARK_CONF_DIR points to.

6.    To connect to the Amazon EMR cluster from the remote machine, Set Up an SSH Tunnel to the Master Node Using Dynamic Port Forwarding.

Submit the Spark job

Your local machine is now ready to submit a Spark job to a remote Amazon EMR cluster. Use a command similar to the following:

spark-submit --master yarn --deploy-mode cluster --class <your class> <your_jar>.jar

Common errors

Standalone mode

Amazon EMR doesn't support standalone mode for Spark. It's not possible to submit a Spark application to a remote Amazon EMR cluster with the following command:

SparkConf conf = new SparkConf().setMaster("spark://<master url>:7077”).setAppName("Word Count");

Instead, set up your local machine as explained above. Then, submit the application using the spark-submit command.

java.lang.NoClassDefFoundError

When executing spark-submit in cluster mode, Spark might throw the following exception:

Exception in thread "main" java.lang.NoClassDefFoundError: com/sun/jersey/api/client/config/ClientConfig

To resolve this error, set the yarn.timeline-service.enabled property to false in yarn-site.xml in HADOOP_CONF_DIR on the local machine:

<property>
    <name>yarn.timeline-service.enabled</name>
    <value>false</value>
</property>

AccessControlException

The following error occurs when the local machine user ("Administrator," in the following example) doesn't have write permission to HDFS:

Exception in thread "main" org.apache.hadoop.security.AccessControlException: Permission denied: user=Administrator, access=WRITE, inode="/user/Administrator/.sparkStaging/application_1509666570568_0068":hdfs:hadoop:drwxr-xr-x

To resolve this problem, create the user on the Amazon EMR cluster. Then, and add it to the hadoop group:

1.    Connect to the master node of your Amazon EMR cluster using SSH.

2.    Run a command similar to the following to add the user to the cluster:

[hadoop@ip-10-0-0-171 ~]$ sudo adduser Administrator

3.    Run a command similar to the following to add the user to the hadoop group:

[hadoop@ip-10-0-0-171 ~]$ sudo usermod -g hadoop Administrator

4.    Verify that the user was added to hadoop by running the following command:

[hadoop@ip-10-0-0-171 ~]$ hdfs groups Administrator

You should get an output like this:

Administrator : hadoop

Did this article help you?

Anything we could improve?


Need more help?