How do I set Spark parameters in Amazon EMR?

3 minute read

I want to configure Apache Spark parameters in Amazon EMR.

Short description

There are two methods for configuring Spark applications:

Use command line arguments such as the spark-submit command to avoid hardcoding values.
Configure the values in the spark-defaults.conf file to makes the changes permanent.

Resolution

Configure Spark parameters using spark-submit

The Spark shell and spark-submit command supports two ways to load configurations dynamically:

Use command line options, such as --num-executors.
Use the flag --conf.

Note: Run spark-submit--help to show the complete options list.

The spark-submit command also reads the configuration options from spark-defaults.conf, In the spark-defaults.conf file, each line consists of a key and a value separated by white space.

For more information, see Submitting user applications with spark-submit.

For more information on the parameters supported by Spark, see Spark configuration.

The following are some of the most common configuration options:

--class <main-class> \
--master <master-url>
--deploy-mode <deploy-mode> 
--conf <key>=<value> 
--num-executors <value> \
--executor-memory <value>G \
--driver-memory <value>G \
--executor-cores <number of cores> \
--driver-cores <number of cores> \
--jars <Comma-separated list of jars> \
--packages <Comma-separated list of Maven coordinates> \
--py-files < Comma-separated list of .zip, .egg, or .py files to place on the PYTHONPATH for Python apps> \

When using spark-submit, the application JAR, and any JARs included with the --jars option are automatically transferred to the cluster. URLs supplied after --jars must be separated by commas. The list is included in the driver and executor class paths and the JARs and files are copied to the working directory for each SparkContext on the executor nodes. Keep in mind that directory expansion doesn't work with --jars.

Example

spark-submit --deploy-mode cluster --class org.apache.spark.examples.SparkPi --conf spark.dynamicAllocation.enabled=false --master yarn --num-executors 4 --driver-memory 4G --executor-memory 4G --executor-cores 1 /usr/lib/spark/examples/jars/spark-examples.jar 10

You can pass the memory parameters using the flag --conf as shown in the following example:

spark-submit --deploy-mode cluster --class org.apache.spark.examples.SparkPi --conf spark.dynamicAllocation.enabled=false --master yarn >--conf spark.driver.memory=1G --conf spark.executor.memory=1G /usr/lib/spark/examples/jars/spark-examples.jar 10

Launch spark-shell and pyspark shell using custom Spark parameters

To launch spark-shell or pyspark shell, run the following commands:

spark-shell

spark-shell --conf spark.driver.maxResultSize=1G --conf spark.driver.memory=1G --deploy-mode client --conf spark.executor.memory=1G --conf spark.executor.heartbeatInterval=10000000s --conf spark.network.timeout=10000001s --executor-cores 1 --num-executors 5 --packages org.apache.spark:spark-avro_2.12:3.1.2 --conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer'

pyspark shell

pyspark --conf spark.driver.maxResultSize=1G --conf spark.driver.memory=1G --deploy-mode client --conf spark.executor.memory=1G --conf spark.executor.heartbeatInterval=10000000s --conf spark.network.timeout=10000001s --executor-cores 1 --num-executors 5 --packages org.apache.spark:spark-avro_2.12:3.1.2 --conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer'

Configure Spark parameters using spark-defaults.conf

To make the configuration changes permanent, append the configuration to the file /etc/spark/conf/spark-defaults.conf. Then, restart the Spark History Server. The following example configures the executor memory and driver memory in spark-defaults.conf. In this example, each line consists of a key and a value separated by white space.

Example

spark.executor.memory      9486M 
spark.driver.memory     9486M

The following example configuration configures the Spark driver and executor memory during cluster launch:

{
  "Classification": "spark-defaults",
  "Properties": {
    "spark.executor.memory": "9486M",
    "spark.driver.memory": "9486M"
    }
  }
]

Related information

Best practices for successfully managing memory for Apache Spark applications on Amazon EMR

Add a Spark step

Modify your cluster on the fly with Amazon EMR reconfiguration

Topics

Analytics

Relevant content

How to set spark configuration parameters in PySparkProcessor() in sagemaker processing job?
rePost-User-6470938
asked 2 years ago
No Such Method Error with Spark and Hudi
JCubeta
asked 4 years ago
How do I access the Spark UI?
Raffael
asked 5 years ago
AWS EMR (HDFS + Spark) - AWS EMR (Spark)
Accepted Answer
posix
asked 2 years ago
EMR-S Application Configuration (spark default) is not shown for tasks submitted using airflow
Muthukumar
asked 12 days ago
How can I access the Spark UI in Amazon EMR?
AWS OFFICIALUpdated a year ago
How can I modify the Spark configuration in an Amazon EMR notebook?
AWS OFFICIALUpdated 2 years ago
How do I troubleshoot a failed Spark step in Amazon EMR?
AWS OFFICIALUpdated 3 years ago
Why can’t I view the Apache Spark history events or logs from the Spark web UI in Amazon EMR?
AWS OFFICIALUpdated a year ago
EMR Cluster failure with "On the master instance, application provisioning failed"
SUPPORT ENGINEER
Yokesh NK
published 3 days ago