Why is the core node in my Amazon EMR cluster running out of disk space?

5 minute read

I'm running Apache Spark jobs on an Amazon EMR cluster, and the core node is almost out of disk space.

Resolution

Determine which core nodes are unhealthy

Nodes that have at least one Amazon Elastic Block Store (Amazon EBS) volume attached are considered unhealthy if they hit more than 90% disk utilization. To determine which nodes might have reached 90% disk utilization, do the following:

1. Check the Amazon CloudWatch metric MRUnhealthyNodes. This metric indicates the number of unhealthy nodes of an EMR cluster.

Note: You can create a CloudWatch Alarm to monitor the MRUnhealthyNodes metric.

2. Connect to the primary node and access the instance controller log at /emr/instance-controller/log/instance-controller.log. In the instance controller log, search for InstanceJointStatusMap to identify which nodes are unhealthy.

For more information, see High disk utilization in How do I resolve ExecutorLostFailure "Slave lost" errors in Spark on Amazon EMR?

3. Log in to the core nodes and then run the following command to determine if a mount has high utilization:

df -h

Remove unnecessary local and temporary Spark application files

When you run Spark jobs, Spark applications create local files that consume the rest of the disk space on the core node. If the df -h command shows that /mnt, for example, is using more than 90% disk space, check which directories or files have high utilization.

Run the following command on the core node to see the top 10 directories that are using the most disk space:

cd /mnt
sudo du -hsx * | sort -rh | head -10

If the /mnt/hdfs directory has high utilization, check the HDFS usage and remove any unnecessary files, such as log files. Reducing the retention period helps in cleaning the log files from HDFS automatically.

hdfs dfsadmin -report
hadoop fs -du -s -h /path/to/dir

Reduce the retention period for Spark event and YARN container logs

A common cause of HDFS usage is the /var/log directory. The /var/log directory is where log files such as Spark event logs and YARN container logs are stored. You can change the retention period for these files to save space.

The following example command displays the /var/log/spark usage.

Note: /var/log/spark is the default Spark event log directory.

hadoop fs -du -s -h /var/log/spark

Reduce the default retention period for Spark job history files

Spark job history files are located in /var/log/spark/apps by default. When the file system history cleaner runs, Spark deletes job history files older than seven days. To reduce the default retention period, do the following:

On a running cluster:

1. Connect to the primary node using SSH.

2. Add or update the following values in /etc/spark/conf/spark-defaults.conf. The following configuration runs the cleaner every 12 hrs. The configuration clear files that are more than 1 day old. You can customize this time period for your individual use case in the spark.history.fs.cleaner.internval and spark.history.fs.cleaner.maxAge parameters.

------
spark.history.fs.cleaner.enabled true
spark.history.fs.cleaner.interval 12h
spark.history.fs.cleaner.maxAge 1d
------

3. Restart the Spark History Server.

During cluster launch:

Use the following configuration. You can customize the time period for your individual use case in the spark.history.fs.cleaner.internval and spark.history.fs.cleaner.maxAge parameters.

{
"Classification": "spark-defaults",
"Properties": {
"spark.history.fs.cleaner.enabled":"true",
"spark.history.fs.cleaner.interval":"12h",
"spark.history.fs.cleaner.maxAge":"1d"
   }
}

For more information on these parameters, see Monitoring and instrumentation in the Spark documentation.

Reduce the default retention period of YARN container logs

Spark application logs, which are the YARN container logs for your Spark jobs, are located in /var/log/hadoop-yarn/apps on the core node. Spark moves these logs to HDFS when the application is finished running. By default, YARN keeps application logs on HDFS for 48 hours. To reduce the retention period:

1. Connect to the primary, core, or task nodes using SSH.

2. Open the /etc/hadoop/conf/yarn-site.xml file on each node in your Amazon EMR cluster (primary, core, and task nodes).

3. Reduce the value of the yarn.log-aggregation.retain-seconds property on all nodes.

4. Restart the ResourceManager daemon. For more information, see Viewing and restarting Amazon EMR and application processes.

You can also reduce the retention period by reconfiguring the cluster. For more information, see Reconfigure an instance group in a running cluster.

Reduce /mnt/yarn usage

If the /mnt/yarn directory is highly utilized, adjust the user cache retention or scale the EBS volumes on the node. For more information, see How can I prevent a Hadoop or Spark job's user cache from consuming too much disk space in Amazon EMR?

Resize the cluster or scale Amazon EMR

Add more core nodes to mitigate HDFS space issues. And, add any of the core or task nodes if directories other than HDFS directories are getting full. For more information, see Scaling cluster resources.

You can also extend the EBS volumes in existing nodes or use a dynamic scaling script. For more information, see the following:

Related information

Configure cluster hardware and networking

HDFS configuration

Work with storage and file systems

Topics

Analytics

Relevant content

EMR terminated because all slaves in the job flow were terminated, But core and task nodes were ON DEMAND
Aishwarya
asked 3 months ago
Why are my EKS cluster nodes having different name scheme?
Sayak
asked a year ago
Redshift Disk space Sudden increase after adding the reserve nodes
AWS-User-3534774
asked 5 months ago
EMR - UNHEALHTY nodes & HDFS utilization
ahMarrone
asked 2 years ago
EMR - Number of core node that a primary node will be able to orchestrate
Accepted Answer
Narayanan
asked 8 months ago
Why are my Yarn applications in Amazon EMR still in the Accepted state?
AWS OFFICIALUpdated a year ago
Why did my Spark job in Amazon EMR fail?
AWS OFFICIALUpdated a year ago
Why is my Amazon EMR cluster unreachable?
AWS OFFICIALUpdated a year ago
How can I resolve "Exit status: -100. Diagnostics: Container released on a *lost* node" errors in Amazon EMR?
AWS OFFICIALUpdated 3 years ago
Preventing EKS Node Lease Leakage for Smoother Cluster Operation
EXPERT
Shubham Sharan
published 7 months ago