Why is the core node in my Amazon EMR cluster running out of disk space?

Last updated: 2020-07-08

I'm running Apache Spark jobs on an Amazon EMR cluster. The core node is almost out of disk space. Why is this happening?

Resolution

Check for these common causes of high disk space utilization on the core node:

Local and temporary files from the Spark application

When you run Spark jobs, Spark applications create local files that can consume the rest of the disk space on the core node. Run the following command on the master node to see the 10 directories that are using the most disk space.

sudo du -hsx * | sort -rh | head -10

If local files are consuming the rest of the disk space, scale your cluster. For more information, see Scaling cluster resources.

Note: If the number of Spark executors doesn't increase as expected, increase the storage capacity of the Amazon Elastic Block Store (Amazon EBS) volumes that are attached to the core node. Or, add more EBS volumes to the core node.

Spark application logs and job history files

When you run Spark jobs, Spark creates application logs and job history files on the HDFS. These logs can consume the rest of the disk space on the core node. To resolve this problem, check the directories where the logs are stored and change the retention parameters, if necessary.

Spark application logs, which are the YARN container logs for your Spark jobs, are located in /var/log/hadoop-yarn/apps on the core node. Spark moves these logs to HDFS when the application is finished running. By default, YARN keeps application logs on HDFS for 48 hours. To reduce the retention period:

  1. Connect to the master node using SSH.
  2. Open the /etc/hadoop/conf/yarn-site.xml file on each node in your Amazon EMR cluster (master, core, and task nodes).
  3. Reduce the value of the yarn.log-aggregation.retain-seconds property on all nodes.
  4. Restart the ResourceManager daemon. For more information, see Viewing and restarting Amazon EMR and application processes (daemons).

Note: After Spark copies the application logs to HDFS, they remain on the local disk so that Log Pusher can push the logs to Amazon Simple Storage Service (Amazon S3). The default retention period is four hours. To reduce the retention period, modify the /etc/logpusher/hadoop.config file.

Spark job history files are located in /var/log/spark/apps on the core node. When the filesystem history cleaner runs, Spark deletes job history files that are older than seven days. To reduce the default retention period:

  1. Connect to the master node using SSH.
  2. Open the /etc/spark/conf/spark-defaults.conf file on the master node.
  3. Reduce the value of the spark.history.fs.cleaner.maxAge property.

By default, the filesystem history cleaner runs once a day. The frequency is specified in the spark.history.fs.cleaner.interval property. For more information, see Monitoring and instrumentation in the Spark documentation.