How do I stop a Hadoop or Spark job's user cache so that the cache doesn't use too much disk space in Amazon EMR?

4 minute read

The user cache for my Apache Hadoop or Apache Spark job uses all the disk space on the partition. The Amazon EMR job fails or the HDFS NameNode service is in safe mode.

Short description

On an Amazon EMR cluster, YARN is configured to allow jobs to write cache data to /mnt/yarn/usercache. When you process a large amount of data or run multiple concurrent jobs, the /mnt file system can fill. This causes node manager to fail on some nodes, and the job freezes or fails.

Use one of these methods to resolve this issue:

If you don't have long-running or streaming jobs, then adjust the user cache retention settings for YARN NodeManager.
If you have long-running or streaming jobs, then scale up the Amazon Elastic Block Store (Amazon EBS) volumes.

Resolution

Adjust the user cache retention settings for NodeManager

The following attributes define the cache cleanup settings:

yarn.nodemanager.localizer.cache.cleanup.interval-ms: This is the cache cleanup interval. The default value is 600,000 milliseconds. After this interval, if the cache size exceeds the set value in yarn.nodemanager.localizer.cache.target-size-mb, then files that running containers don't use are deleted.
yarn.nodemanager.localizer.cache.target-size-mb: This is the maximum disk space that's allowed for the cache. The default value is 10,240 MB. When the cache disk size exceeds this value, files that running containers don't use are deleted on the interval that's set in yarn.nodemanager.localizer.cache.cleanup.interval-ms.

To set the cleanup interval and maximum disk space size on your cluster, complete the following steps:

Open /etc/hadoop/conf/yarn-site.xml on each core and task node. Reduce the values for yarn.nodemanager.localizer.cache.cleanup.interval and yarn.nodemanager.localizer.cache.target-size-mb for each core and task node. For example:

sudo vim /etc/hadoop/conf/yarn-site.xml
yarn.nodemanager.localizer.cache.cleanup.interval-ms 400000
yarn.nodemanager.localizer.cache.target-size-mb 5120

Run the following commands on each core and task node to restart NodeManager:
```
sudo stop hadoop-yarn-nodemanager
sudo start hadoop-yarn-nodemanager
```
Note: In Amazon EMR versions 5.21.0 and later, you can also use a configuration object to override the cluster configuration or specify additional configuration classifications. For more information, see Reconfigure an instance group in a running cluster.

To set the cleanup interval and maximum disk space size on a new cluster at launch, add a configuration object similar to the following one:

[
    {
      "Classification": "yarn-site",
     "Properties": {
       "yarn.nodemanager.localizer.cache.cleanup.interval-ms": "400000",
       "yarn.nodemanager.localizer.cache.target-size-mb": "5120"
      }
    }
]

The deletion service doesn't complete on running containers. This means that even after you adjust the user cache retention settings, data might still spill to the following path and fill the file system:

{'yarn.nodemanager.local-dirs'}/usercache/user/appcache/application_id ,

Scale up the EBS volumes on the EMR cluster nodes

To scale up storage on a running EMR cluster, see Dynamically scale up storage on Amazon EMR clusters.

To scale up storage on a new EMR cluster, specify a larger volume size when you create the EMR cluster. You can also do this when you add nodes to an existing cluster:

Amazon EMR versions 5.22.0 and later: The default amount of EBS storage increases based on the size of the Amazon Elastic Compute Cloud (Amazon EC2) instance. For more information about the default amount of storage and number of volumes for each instance type, see Default Amazon EBS storage for instances.
Amazon EMR versions 5.21 and earlier: The default EBS volume size is 32 GB, and 27 GB is reserved for the /mnt partition. HDFS, YARN, the user cache, and all applications use the /mnt partition. Increase the size of your EBS volume as needed, such as 100-500 GB. You can also specify multiple EBS volumes that are mounted as /mnt1, /mnt2, and so on.

For Spark streaming jobs, you can also perform an RDD.unpersist() after you no longer need the data. Or, explicitly call System.gc() in Scala or sc._jvm.System.gc() in Python to start JVM garbage collection and remove the intermediate shuffle files.

Topics

Analytics

Relevant content

EMR Update Hadoop jar files
jayaram
asked 2 years ago
AWS EMR - YARN Resource Issue
vsk95
asked 25 days ago
EMR Serverless Application Disk Space?
Accepted Answer
Patrick
asked 2 years ago
AWS EMR (HDFS + Spark) - AWS EMR (Spark)
Accepted Answer
posix
asked 2 years ago
EMR : cron: A yarn command is skipped
Accepted Answer
rePost-User-2319878
asked 2 years ago
How do I resolve "no space left on device" stage failures in Spark on Amazon EMR?
AWS OFFICIALUpdated 2 years ago
Why does the YARN application still use resources after the Spark job that I ran on Amazon EMR is completed?
AWS OFFICIALUpdated 3 years ago
Why is the core node in my Amazon EMR cluster running out of disk space?
AWS OFFICIALUpdated a year ago
How do I resolve the "java.lang.ClassNotFoundException" in Spark on Amazon EMR?
AWS OFFICIALUpdated 2 years ago
Decoding Instance-State log in EMR
SUPPORT ENGINEER
Yokesh NK
published 10 days ago