How can I customize the configuration for an Apache Spark job in an Amazon EMR notebook?
Short description
An Amazon EMR notebook is a serverless Jupyter notebook. A Jupyter notebook uses the Sparkmagic kernel as a client for interactively working with Spark in a remote EMR cluster through an Apache Livy server. You can use Sparkmagic commands to customize the Spark configuration. A custom configuration is useful when you want to do the following:
- Change executor memory and executor cores for a Spark Job
- Change resource allocation for Spark
Resolution
Modify the current session
1. In a Jupyter notebook cell, run the %%configure command to modify the job configuration. In the following example, the command changes the executor memory for the Spark job.
%%configure -f
{"executorMemory":"4G"}
2. For additional configurations that you usually pass with the --conf option, use a nested JSON object, as shown in the following example. Use this method instead of explicitly passing a conf object to a SparkContext or SparkSession.
%%configure -f
{"conf":{"spark.dynamicAllocation.enabled":"false"}}
Confirm that the configuration change was successful
1. On the client side, run the %%info command on Jupyter to see the current session configuration. Example output:
Current session configs: {'executorMemory': '4G', 'conf': {'spark.dynamicAllocation.enabled': 'false'}, 'kind': 'pyspark'}
2. On the server side, check the /var/log/livy/livy-livy-server.out log on the EMR cluster. If a SparkSession started, you should see a log entry like this:
20/06/24 10:11:22 INFO InteractiveSession$: Creating Interactive session 2: [owner: null, request: [kind: pyspark, proxyUser: None, executorMemory: 4G, conf: spark.dynamicAllocation.enabled -> false, heartbeatTimeoutInSecond: 0]]
Related information
Apache Livy - REST API