How can I resolve node label and YARN ResourceManager failures in Amazon EMR?
Last updated: 2022-04-26
I enabled node labels on an Amazon EMR cluster. Then, YARN ResourceManager failed.
This issue affects Amazon EMR release versions 5.19.0 - 5.21.0. In these versions, Amazon EMR stores node label files in HDFS:
- DEFAULT_DIR_NAME = "node-labels"
- MIRROR_FILENAME = "nodelabel.mirror"
- EDITLOG_FILENAME = "nodelabel.editlog"
Amazon EMR stores these files at the following location in yarn-site.xml on all nodes: yarn.node-labels.fs-store.root-dir: '/apps/yarn/nodelabels'. The issue happens when these files become corrupted when you lose all nodes that contain the file's blocks during a resize operation. ResourceManager then restarts, gets stuck in a restart loop, and then CommonNodeLabelsManager throws an exception.
To find the exception, search for "org.apache.hadoop.yarn.nodelabels.CommonNodeLabelsManager" in /var/log/hadoop-yarn/yarn-yarn-resourcemanager-*.log.
To resolve this error, delete the node label files. Then, restart ResourceManager to recreate the files.
1. Check file system health and locate the blocks:
hdfs fsck /apps/yarn/nodelabels/ -locations -blocks -files
2. Remove the files:
hdfs dfs -rm -skipTrash /apps/yarn/nodelabels/*
3. Restart ResourceManager:
sudo stop hadoop-yarn-resourcemanager sudo start hadoop-yarn-resourcemanager
4. When ResourceManager restarts, it recreates the node label files. This resolves the restart loop. However, you can't submit YARN applications yet. Before you can submit YARN applications, manually add node label entries:
yarn rmadmin -addToClusterNodeLabels "CORE(exclusive=false)"
5. List the labels to confirm that ResourceManager recreated them:
yarn cluster --list-node-labels