Best practices for successfully managing memory for Apache Spark applications on Amazon EMR
In the world of big data, a common use case is performing extract, transform (ET) and data analytics on huge amounts of data from a variety of data sources. Often, you then analyze the data to get insights. One of the most popular cloud-based solutions to process such vast amounts of data is Amazon EMR.
Amazon EMR is a managed cluster platform that simplifies running big data frameworks, such as Apache Hadoop and Apache Spark, on AWS. Amazon EMR enables organizations to spin up a cluster with multiple instances in a matter of few minutes. It also enables you to process various data engineering and business intelligence workloads through parallel processing. By doing this, to a great extent you can reduce the data processing times, effort, and costs involved in establishing and scaling a cluster.
Apache Spark is a cluster-computing software framework that is open-source, fast, and general-purpose. It is widely used in distributed processing of big data. Apache Spark relies heavily on cluster memory (RAM) as it performs parallel computing in memory across nodes to reduce the I/O and execution times of tasks.
Generally, you perform the following steps when running a Spark application on Amazon EMR:
- Upload the Spark application package to Amazon S3.
- Configure and launch the Amazon EMR cluster with configured Apache Spark.
- Install the application package from Amazon S3 onto the cluster and then run the application.
- Terminate the cluster after the application is completed.
It’s important to configure the Spark application appropriately based on data and processing requirements for it to be successful. With default settings, Spark might not use all the available resources of the cluster and might end up with physical or virtual memory issues, or both. There are thousands of questions raised in stackoverflow.com related to this specific topic.
This blog post is intended to assist you by detailing best practices to prevent memory-related issues with Apache Spark on Amazon EMR.
Common memory issues in Spark applications with default or improper configurations
Listed following are a few sample out-of-memory errors that can occur in a Spark application with default or improper configurations.
Out of Memory Error, Java Heap Space
Out of Memory Error, Exceeding Physical Memory
Out of Memory Error, Exceeding Virtual Memory
Out of Memory Error, Exceeding Executor Memory
These issues occur for various reasons, some of which are listed following:
- When the number of Spark executor instances, the amount of executor memory, the number of cores, or parallelism is not set appropriately to handle large volumes of data.
- When the Spark executor’s physical memory exceeds the memory allocated by YARN. In this case, the total of Spark executor instance memory plus memory overhead is not enough to handle memory-intensive operations. Memory-intensive operations include caching, shuffling, and aggregating (using
groupBy, and so on). Or, in some cases, the total of Spark executor instance memory plus memory overhead can be more than what is defined in
- The memory required to perform system operations such as garbage collection is not available in the Spark executor instance.
In the following sections, I discuss how to properly configure to prevent out-of-memory issues, including but not limited to those preceding.
Configuring for a successful Spark application on Amazon EMR
The following steps can help you configure a successful Spark application on Amazon EMR.
1. Determine the type and number of instances based on application needs
Amazon EMR has three types of nodes:
- Master: An EMR cluster has one master, which acts as the resource manager and manages the cluster and tasks.
- Core: The core nodes are managed by the master node. Core nodes run YARN NodeManager daemons, Hadoop MapReduce tasks, and Spark executors to manage storage, execute tasks, and send a heartbeat to the master.
- Task: The optional task-only nodes perform tasks and don’t store any data, in contrast to core nodes.
Best practice 1: Choose the right type of instance for each of the node types in an Amazon EMR cluster. Doing this is one key to success in running any Spark application on Amazon EMR.
There are numerous instance types offered by AWS with varying ranges of vCPUs, storage, and memory, as described in the Amazon EMR documentation. Based on whether an application is compute-intensive or memory-intensive, you can choose the right instance type with the right compute and memory configuration.
For memory-intensive applications, prefer R type instances over the other instance types. For compute-intensive applications, prefer C type instances. For applications balanced between memory and compute, prefer M type general-purpose instances.
To understand the possible use cases for each instance type offered by AWS, see Amazon EC2 Instance Types on the EC2 service website.
After deciding the instance type, determine the number of instances for each of the node types. You do this based on the size of the input datasets, application execution times, and frequency requirements.
2. Determine the Spark configuration parameters
Before we dive into the details on Spark configuration, let’s get an overview of how the executor container memory is organized using the diagram following.
As the preceding diagram shows, the executor container has multiple memory compartments. Of these, only one (execution memory) is actually used for executing the tasks. These compartments should be properly configured for running the tasks efficiently and without failure.
Calculate and set the following Spark configuration parameters carefully for the Spark application to run successfully:
spark.executor.memory– Size of memory to use for each executor that runs the task.
spark.executor.cores– Number of virtual cores.
spark.driver.memory– Size of memory to use for the driver.
spark.driver.cores– Number of virtual cores to use for the driver.
spark.executor.instances– Number of executors. Set this parameter unless
spark.dynamicAllocation.enabledis set to true.
spark.default.parallelism– Default number of partitions in resilient distributed datasets (RDDs) returned by transformations like
parallelizewhen no partition number is set by the user.
Amazon EMR provides high-level information on how it sets the default values for Spark parameters in the release guide. These values are automatically set in the spark-defaults settings based on the core and task instance types in the cluster.
To use all the resources available in a cluster, set the
maximizeResourceAllocation parameter to true. This EMR-specific option calculates the maximum compute and memory resources available for an executor on an instance in the core instance group. It then sets these parameters in the
spark-defaults settings. Even with this setting, generally the default numbers are low and the application doesn’t use the full strength of the cluster. For example, the default for
spark.default.parallelism is only 2 x the number of virtual cores available, though parallelism can be higher for a large cluster.
Spark on YARN can dynamically scale the number of executors used for a Spark application based on the workloads. Using Amazon EMR release version 4.4.0 and later, dynamic allocation is enabled by default (as described in the Spark documentation).
The problem with the
spark.dynamicAllocation.enabled property is that it requires you to set subproperties. Some example subproperties are
maxExecutors. Subproperties are required for most cases to use the right number of executors in a cluster for an application, especially when you need multiple applications to run simultaneously. Setting subproperties requires a lot of trial and error to get the numbers right. If they’re not right, the capacity might be reserved but never actually used. This leads to wastage of resources or memory errors for other applications.
Best practice 2: Set
spark.dynamicAllocation.enabledto true only if the numbers are properly determined for
spark.dynamicAllocation.initialExecutors/minExecutors/maxExecutorsparameters. Otherwise, set
spark.dynamicAllocation.enabledto false and control the driver memory, executor memory, and CPU parameters yourself. To do this, calculate and set these properties manually for each application (see the example following).
Let’s assume that we are going to process 200 terabytes of data spread across thousands of file stores in Amazon S3. Further, let’s assume that we do this through an Amazon EMR cluster with 1 r5.12xlarge master node and 19 r5.12xlarge core nodes. Each r5.12xlarge instance has 48 virtual cores (vCPUs) and 384 GB RAM. All these calculations are for the
--deploy-mode cluster, which we recommend for production use.
The following list describes how to set some important Spark properties, using the preceding case as an example.
Assigning executors with a large number of virtual cores leads to a low number of executors and reduced parallelism. Assigning a low number of virtual cores leads to a high number of executors, causing a larger amount of I/O operations. Based on historical data, we suggest that you have five virtual cores for each executor to achieve optimal results in any sized cluster.
For the preceding cluster, the property
spark.executor.cores should be assigned as follows:
spark.executors.cores = 5 (vCPU)
After you decide on the number of virtual cores per executor, calculating this property is much simpler. First, get the number of executors per instance using total number of virtual cores and executor virtual cores. Subtract one virtual core from the total number of virtual cores to reserve it for the Hadoop daemons.
Number of executors per instance = (total number of virtual cores per instance - 1)/ spark.executors.cores Number of executors per instance = (48 - 1)/ 5 = 47 / 5 = 9 (rounded down)
Then, get the total executor memory by using the total RAM per instance and number of executors per instance. Leave 1 GB for the Hadoop daemons.
This total executor memory includes the executor memory and overhead (
spark.yarn.executor.memoryOverhead). Assign 10 percent from this total executor memory to the memory overhead and the remaining 90 percent to executor memory.
We recommend setting this to equal
We recommend setting this to equal
Calculate this by multiplying the number of executors and total number of instances. Leave one executor for the driver.
Set this property using the following formula.
Warning: Although this calculation gives partitions of 1,700, we recommend that you estimate the size of each partition and adjust this number accordingly by using
In case of dataframes, configure the parameter
spark.sql.shuffle.partitions along with
Though the preceding parameters are critical for any Spark application, the following parameters also help in running the applications smoothly to avoid other timeout and memory-related errors. We advise that you set these in the
spark-defaults configuration file.
spark.network.timeout– Timeout for all network transactions.
spark.executor.heartbeatInterval– Interval between each executor’s heartbeats to the driver. This value should be significantly less than
spark.memory.fraction– Fraction of JVM heap space used for Spark execution and storage. The lower this is, the more frequently spills and cached data eviction occur.
spark.memory.storageFraction– Expressed as a fraction of the size of the region set aside by spark.memory.fraction. The higher this is, the less working memory might be available to execution. This means that tasks might spill to disk more often.
spark.yarn.scheduler.reporterThread.maxFailures– Maximum number executor failures allowed before YARN can fail the application.
spark.rdd.compress– When set to true, this property can save substantial space at the cost of some extra CPU time by compressing the RDDs.
spark.shuffle.compress– When set to true, this property compresses the map output to save space.
spark.shuffle.spill.compress– When set to true, this property compresses the data spilled during shuffles.
spark.sql.shuffle.partitions– Sets the number of partitions for joins and aggregations.
spark.serializer– Sets the serializer to serialize or deserialize data. As a serializer, I prefer Kyro (
org.apache.spark.serializer.KryoSerializer), which is faster and more compact than the Java default serializer.
To understand more about each of the parameters mentioned preceding, see the Spark documentation.
We recommend you consider these additional programming techniques for efficient Spark processing:
coalesce– Reduces the number of partitions to allow for less data movement.
repartition– Reduces or increases the number of partitions and performs full shuffle of data as opposed to
partitionBy– Distributes data horizontally across partitions.
bucketBy– Decomposes data into more manageable parts (buckets) based on hashed columns.
cache/persist– Pulls datasets into a clusterwide in-memory cache. Doing this is useful when data is accessed repeatedly, such as when querying a small lookup dataset or when running an iterative algorithm.
Best practice 3: Carefully calculate the preceding additional properties based on application requirements. Set these properties appropriately in
spark-defaults, when submitting a Spark application (
spark-submit), or within a
3. Implement a proper garbage collector to clear memory effectively
Garbage collection can lead to out-of-memory errors in certain cases. These include cases when there are multiple large RDDs in the application. Other cases occur when there is an interference between the task execution memory and RDD cached memory.
You can use multiple garbage collectors to evict the old objects and place the new ones into the memory. However, the latest Garbage First Garbage Collector (G1GC) overcomes the latency and throughput limitations with the old garbage collectors.
Best practice 4: Always set up a garbage collector when handling large volume of data through Spark.
-XX:+UseG1GC specifies that the G1GC garbage collector should be used. (The default is
-XX:+UseParallelGC.) To understand the frequency and execution time of the garbage collection, use the parameters
-verbose:gc -XX:+PrintGCDetails -XX:+PrintGCDateStamps. To initiate garbage collection sooner, set
InitiatingHeapOccupancyPercent to 35 (the default is 0.45). Doing this helps avoid potential garbage collection for the total memory, which can take a significant amount of time. An example follows.
4. Set the YARN configuration parameters
Even if all the Spark configuration properties are calculated and set correctly, virtual out-of-memory errors can still occur rarely as virtual memory is bumped up aggressively by the OS. To prevent these application failures, set the following flags in the YARN site settings.
Best practice 5: Always set the virtual and physical memory check flag to false.
5. Perform debugging and monitoring
To get details on where the spark configuration options are coming from, you can run spark-submit with the –verbose option. Also, you can use Ganglia and Spark UI to monitor the application progress, Cluster RAM usage, Network I/O, etc.
In the following example, we compare the outcomes between configured and non-configured Spark applications using Ganglia graphs.
When configured following the methods described, a Spark application can process 10 TB data successfully without any memory issues on an Amazon EMR cluster whose specs are as follows:
- 1 r5.12xlarge master node
- 19 r5.12xlarge core nodes
- 8 TB total RAM
- 960 total virtual CPUs
- 170 executor instances
- 5 virtual CPUs/executor
- 37 GB memory/executor
- Parallelism equals 1,700
Following, you can find Ganglia graphs for reference.
If you run the same Spark application with default configurations on the same cluster, it fails with an out-of-physical-memory error. This is because the default configurations (two executor instances, parallelism of 2, one vCPU/executor, 8-GB memory/executor) aren’t enough to process 10 TB data. Though the cluster had 7.8 TB memory, the default configurations limited the application to use only 16 GB memory, leading to the following out-of-memory error.
Also, for large datasets, the default garbage collectors don’t clear the memory efficiently enough for the tasks to run in parallel, causing frequent failures. The following charts help in comparing the RAM usage and garbage collection with the default and G1GC garbage collectors.With G1GC, the RAM used is maintained below 5 TB (see the blue area in the graph).
With the default garbage collector (CMS), the RAM used goes above 5 TB. This can lead to the failure of the Spark job when running many tasks continuously.
Example: EMR instance template with configuration
There are different ways to set the Spark and YARN configuration parameters. One of ways is to pass these when creating the EMR cluster.
To do this, in the Amazon EMR console’s Edit software settings section, you can enter the appropriately updated configuration template (Enter configuration). Or the configuration can be passed from S3 (Load JSON from S3).
Following is a configuration template with sample values. At a minimum, calculate and set the following parameters for a successful Spark application.
In this blog post, I detailed the possible out-of-memory errors, their causes, and a list of best practices to prevent these errors when submitting a Spark application on Amazon EMR.
My colleagues and I formed these best practices after thorough research and understanding of various Spark configuration properties and testing multiple Spark applications. These best practices apply to most of out-of-memory scenarios, though there might be some rare scenarios where they don’t apply. However, we believe that this blog post provides all the details needed so you can tweak parameters and successfully run a Spark application.
About the Author
Karunanithi Shanmugam is a data engineer with AWS Tech and Finance.