How can I turn off Safemode for the NameNode service on my Amazon EMR cluster?

Last updated: 2021-09-03

When I try to run an Apache Hadoop or Apache Spark job on an Amazon EMR cluster, I get one of the following error messages:

  • Cannot create file/user/test.txt._COPYING_. Name node is in safe mode.
  • org.apache.hadoop.hdfs.server.namenode.SafeModeException: Cannot delete /user/hadoop/.sparkStaging/application_15xxxxxxxx_0001. Name node is in safe mode. It was turned on manually. Use "hdfs dfsadmin -safemode leave" to turn safe mode off. NamenodeHostName:ip-xxx-xx-xx-xx.ec2.internal

I tried turning Safemode off, but it comes back on immediately. I want to get NameNode out of Safemode.

Short description

Safemode for the NameNode is essentially a read-only mode for the Hadoop Distributed File System (HDFS) cluster. NameNode might enter into Safemode for different reasons, such as the following:

  • Available space is less than the amount of space required for the NameNode storage directory. The amount of space required for the NameNode directory is defined in the parameter dfs.namenode.resource.du.reserved.
  • NameNode is unable to load the FsImage and EditLog into memory.
  • NameNode didn't receive the block report from DataNode.

Check the NameNode logs to find the root cause of the issue in the log location /var/log/hadoop-hdfs/.

Resolution

Try one or more of the following troubleshooting options based on your use case.

Switch to a cluster with multiple master nodes

Checkpointing isn't automatic in clusters with a single master node. This means that HDFS edit logs aren't backed up to a new snapshot (FsImage) and removed. HDFS uses edit logs to record filesystem changes between snapshots. If you have a cluster with a single master node, and you don't remove the edit logs manually, these logs can eventually use all of the disk space in/mnt. To resolve this issue, launch a cluster with multiple master nodes. Clusters with multiple master nodes support high availability for HDFS NameNode, which resolves the checkpointing issue. For more information, see Plan and configure master nodes.

Remove unnecessary files from /mnt

The minimum available disk space for /mnt is specified by the dfs.namenode.resource.du.reserved parameter. When the amount of available disk in the /mnt directory drops to a value below the value set in dfs.namenode.resource.du.reserved, NameNode enters into Safemode. The default value for dfs.namenode.resource.du.reserved is 100 MB. When NameNode is in Safemode, no filesystem or block modifications are allowed. Therefore, removing the unnecessary files from /mnt might help resolve the issue. To delete the files that you no longer need, do the following:

1.    Connect to the master node using SSH.

2.    To verify that NameNode is in Safemode because of insufficient disk space, check the NameNode logs. These logs are located in /var/log/hadoop-hdfs. The logs might look similar to the following if the disk space is sufficient:

2020-08-28 19:14:43,540 WARN org.apache.hadoop.hdfs.server.namenode.NameNodeResourceChecker (org.apache.hadoop.hdfs.server.namenode.FSNamesystem$NameNodeResourceMonitor@5baaae4c): Space available on volume '/dev/xvdb2' is 76546048, which is below the configured reserved amount 104857600

The logs might look similar to the following if the disk space is insufficient:

2020-09-28 19:14:43,540 WARN org.apache.hadoop.hdfs.server.namenode.FSNamesystem (org.apache.hadoop.hdfs.server.namenode.FSNamesystem$NameNodeResourceMonitor@5baaae4c): NameNode low on available disk space. Already in safe mode.

3.    Confirm that NameNode is still in Safemode by running the following command:

[root@ip-xxx-xx-xx-xxx mnt]# hdfs dfsadmin -safemode get
Safe mode is ON

4.    Delete unnecessary files from /mnt. For example, on a cluster with one master node, if the directory in/mnt/namenode/current is using a large amount of space, then you can create a new snapshot (FsImage) and then remove the old edit logs.

The following example script generates a new snapshot, backs up old edit logs to an Amazon Simple Storage Service (Amazon S3) bucket, and then removes the edit logs. The script doesn't remove logs for edits that are in progress. Run the following script as the hadoop user:

#!/bin/bash
hdfs dfsadmin -safemode enter
hdfs dfsadmin -saveNamespace
sudo su - root -c "hdfs dfs -put /mnt/namenode/current/*edits_[0-9]* s3://doc-example-bucket/backup-hdfs/"
sudo su - root -c "rm -f /mnt/namenode/current/*edits_[0-9]*"
sudo su - root -c "rm -f /mnt/namenode/current/seen*"
hdfs dfsadmin -safemode leave

5.    Verify the amount of available disk space in /mnt. If the available space is more than 100 MB, check the status of Safemode again, and then turn Safemode off:

[hadoop@ip-xxx-xx-xx-xxx ~]$ hdfs dfsadmin -safemode get
Safe mode is ON
[hadoop@ip-xxx-xx-xx-xxx ~]$ hdfs dfsadmin -safemode leave
Safe mode is OFF

If /mnt still has less than 100 MB of available space, then do one or more of the following:

  • Remove more files as explained in the following section.
  • Increase the size of the /mnt volume.

Remove more files

1.    Connect to the master node using SSH.

2.    Navigate to the /mnt directory:

cd /mnt

3.    Determine which folders are using the most disk space:

sudo du -hsx * | sort -rh | head -10

4.    Keep investigating until you find the source of the disk space issue. For example, if the var folder is using a large amount of disk space, check the largest subfolders in var:

cd var
sudo du -hsx * | sort -rh | head -10

5.    After you determine which fie/folder is taking up the disk space, choose to delete these files. Be sure that you delete only files that you no longer need. The compressed log files in /mnt/var/log/hadoop-hdfs/ and /mnt/var/log/hadoop-yarn/ that are already backed up to the Amazon S3 logging bucket are good candidates for deletion.

6.    After you delete the unnecessary files, check the status of Safemode again, and then turn Safemode off:

[hadoop@ip-xxx-xx-xx-xxx ~]$ hdfs dfsadmin -safemode get
Safe mode is ON
[hadoop@ip-xxx-xx-xx-xxx ~]$ hdfs dfsadmin -safemode leave
Safe mode is OFF

Hadoop documentation for HDFS Users Guide

Did this article help?


Do you need billing or technical support?