I want to submit Apache Spark jobs to a remote Amazon EMR cluster. How can I do this?

Prepare your local machine

Note: Spark jobs can be submitted when deploy-mode is set to client or cluster.

1.    Install all Spark client libraries on your local machine. For example, if you are using an emr-5.10.0 cluster (which has Spark 2.2.0 installed), then download spark-2.2.0-bin-hadoop2.7.tgz and place it on your local machine's PATH environment variable. To determine which version of Spark and Apache Hadoop you are using (and therefore which Spark binary you need to download), see Spark Release History and Hadoop Version History.

2.    Create an environment variable called HADOOP_CONF_DIR, and then point it to a directory on your local machine. All files in /etc/hadoop/conf on the Amazon EMR cluster must be present in the directory that HADOOP_CONF_DIR points to. Spark uses the configuration files, such as yarn-site.xml for YARN settings and hdfs-site.xml for HDFS settings, that are in the directory that HADOOP_CONF_DIR points to.

Note: When you submit a Spark job in cluster mode, the driver runs on cluster nodes that have all Hadoop binaries installed. When you submit a Spark job in client mode, all Hadoop binaries must be downloaded and installed on your local machine.

3.    To connect to the Amazon EMR cluster from the remote machine, Set Up an SSH Tunnel to the Master Node Using Dynamic Port Forwarding.

Submit the Spark job

Your local machine is now ready to submit a Spark job to a remote Amazon EMR cluster, using a command similar to the following: 

spark-submit --master yarn --deploy-mode cluster --class <your class> <your_jar>.jar

Common errors

Standalone mode:

Amazon EMR doesn't support standalone mode for Spark. It's not possible to submit a Spark application to a remote Amazon EMR cluster with the following command: 

SparkConf conf = new SparkConf().setMaster("spark://<master url>:7077”).setAppName("Word Count");

Instead, set up your local machine as explained above and submit the application using the spark-submit command.


When executing spark-submit in cluster mode, Spark might throw the following exception: 

Exception in thread "main" java.lang.NoClassDefFoundError: com/sun/jersey/api/client/config/ClientConfig

To resolve this error, set the yarn.timeline-service.enabled property to false in yarn-site.xml in HADOOP_CONF_DIR on the local machine: 



The following error occurs when the local machine user ("Administrator" in the following example) doesn't have write permission to HDFS: 

Exception in thread "main" org.apache.hadoop.security.AccessControlException: Permission denied: user=Administrator, access=WRITE, inode="/user/Administrator/.sparkStaging/application_1509666570568_0068":hdfs:hadoop:drwxr-xr-x

To resolve this problem, create the user on the Amazon EMR cluster and add it to the hadoop group:

1.    Connect to the master node of your Amazon EMR cluster using SSH.

2.    Run a command similar to the following to add the user to the cluster: 

[hadoop@ip-10-0-0-171 ~]$ sudo adduser Administrator

3.    Run a command similar to the following to add the user to the hadoop group:

[hadoop@ip-10-0-0-171 ~]$ sudo usermod -g hadoop Administrator

4.    Verify that the user was added to hadoop by running the following command: 

[hadoop@ip-10-0-0-171 ~]$ hdfs groups Administrator

You should get an output like this: 

Administrator : hadoop

Published: 2018-09-11