How do I submit Spark jobs to a remote Amazon EMR cluster?
Last updated: 2019-04-18
I want to submit Apache Spark jobs to a remote Amazon EMR cluster. How can I do this?
Prepare your local machine
Note: Spark jobs can be submitted when deploy-mode is set to client or cluster. When you submit a Spark job in cluster mode, the driver runs on cluster nodes that have all Hadoop binaries installed. When you submit a Spark job in client mode, you must download all Hadoop binaries and then install them on your local machine.
1. Install all Spark client libraries on your local machine. For example, if you are using an emr-5.10.0 cluster (which has Spark 2.2.0 installed), download spark-2.2.0-bin-hadoop2.7.tgz. Then, place it on your local machine's PATH environment variable. To determine which version of Spark and Apache Hadoop you're using (and therefore which Spark binary you must download), see Spark Release History and Hadoop Version History.
2. (Optional) If you plan to use the AWS Glue Data Catalog with Spark, copy the AWS Glue client libraries and dependencies to the remote Amazon EMR cluster.
3. Create an environment variable called HADOOP_CONF_DIR.
Note: If you want to use the AWS Glue Data Catalog with Spark, create an environment variable called SPARK_CONF_DIR instead.
4. Point the environment variable to a directory on your local machine.
5. Copy all files in /etc/hadoop/conf on the remote Amazon EMR cluster. Then, add them to the directory that HADOOP_CONF_DIR points to. Spark uses the configuration files, such as yarn-site.xml for YARN settings and hdfs-site.xml for HDFS settings, that are in the directory that HADOOP_CONF_DIR points to.
Note: If you want to use the AWS Glue Data Catalog with Spark, copy all files in /etc/spark/conf on the cluster. Then, add them to the directory that SPARK_CONF_DIR points to.
6. To connect to the Amazon EMR cluster from the remote machine, Set Up an SSH Tunnel to the Master Node Using Dynamic Port Forwarding.
Submit the Spark job
Your local machine is now ready to submit a Spark job to a remote Amazon EMR cluster. Use a command similar to the following:
spark-submit --master yarn --deploy-mode cluster --class <your class> <your_jar>.jar
Amazon EMR doesn't support standalone mode for Spark. It's not possible to submit a Spark application to a remote Amazon EMR cluster with the following command:
SparkConf conf = new SparkConf().setMaster("spark://<master url>:7077”).setAppName("Word Count");
Instead, set up your local machine as explained above. Then, submit the application using the spark-submit command.
When executing spark-submit in cluster mode, Spark might throw the following exception:
Exception in thread "main" java.lang.NoClassDefFoundError: com/sun/jersey/api/client/config/ClientConfig
To resolve this error, set the yarn.timeline-service.enabled property to false in yarn-site.xml in HADOOP_CONF_DIR on the local machine:
<property> <name>yarn.timeline-service.enabled</name> <value>false</value> </property>
The following error occurs when the local machine user ("Administrator," in the following example) doesn't have write permission to HDFS:
Exception in thread "main" org.apache.hadoop.security.AccessControlException: Permission denied: user=Administrator, access=WRITE, inode="/user/Administrator/.sparkStaging/application_1509666570568_0068":hdfs:hadoop:drwxr-xr-x
To resolve this problem, create the user on the Amazon EMR cluster. Then, and add it to the hadoop group:
2. Run a command similar to the following to add the user to the cluster:
[hadoop@ip-10-0-0-171 ~]$ sudo adduser Administrator
3. Run a command similar to the following to add the user to the hadoop group:
[hadoop@ip-10-0-0-171 ~]$ sudo usermod -g hadoop Administrator
4. Verify that the user was added to hadoop by running the following command:
[hadoop@ip-10-0-0-171 ~]$ hdfs groups Administrator
You should get an output like this:
Administrator : hadoop