How can I resolve "Exit status: -100. Diagnostics: Container released on a *lost* node" errors in Amazon EMR?

Last updated: 2019-12-09

My Amazon EMR job fails with an error message like this:

ExecutorLostFailure (executor 12 exited caused by one of the running tasks) Reason: Container marked as failed: container_1572839353552_0008_01_000002 on host: ip-xx-xxx-xx-xx Exit status: -100. Diagnostics: Container released on a *lost* node

What does this mean, and how do I fix this error?

Short Description

This error commonly occurs when a core or task node is terminated because of high disk space utilization, or when a node becomes unresponsive due to prolonged high CPU utilization or low available memory. This article focuses on disk space issues.

When disk usage on a core or task node disk (for example, /mnt or /mnt1) exceeds 90%, the disk is considered unhealthy. If fewer than 25% of a node's disks are healthy, YARN ResourceManager gracefully decommissions the node. To resolve this problem, add more Amazon Elastic Block Store (Amazon EBS) capacity to the EMR cluster. You can do this when you launch a new cluster or by modifying a running cluster.

Resolution

Determine the root cause

To determine the cause of the error, check the following Amazon CloudWatch metrics for the EMR cluster:

  • MR unhealthy nodes: If this metric shows an unhealthy node, the issue is caused by a lack of disk space.
  • MR lost nodes: If this metric shows a lost node, it indicates that a node was lost due to a hardware failure, or that the node couldn't be reached due to high CPU or high memory utilization.

Use one of the following options to resolve lost node errors that are caused by a lack of disk space.

New clusters: Add more EBS capacity

To add more EBS capacity when you launch an EMR cluster, choose a larger Amazon Elastic Compute Cloud (Amazon EC2) instance type. Larger EC2 instances include more EBS storage capacity. For more information, see Default EBS Storage for Instances. (You can also modify the volume size or add more volumes when you create the cluster, regardless of the instance type that you choose.)

New or running clusters: Add more core or task nodes

Running clusters: Add more EBS volumes

To attach more EBS volumes to a running cluster:

1.    If larger EBS volumes don't resolve the problem, attach more EBS volumes to the core and task nodes.

2.    Format and mount the attached volumes. Be sure to use the correct disk number (for example, /mnt1 or /mnt2 instead of /data).

3.    Connect to the node using SSH.

4.    Add the path /mnt1/yarn inside the yarn.nodemanager.local-dirs property of /etc/hadoop/conf/yarn-site.xml. Example:

<property>
    <name>yarn.nodemanager.local-dirs</name>
    <value>/mnt/yarn,/mnt1/yarn</value>
</property>

5.    Restart the NodeManager service:

sudo stop hadoop-yarn-nodemanager
sudo start hadoop-yarn-nodemanager

6.    Enable termination protection.

If you still have disk space issues, try the following:

  • Remove unnecessary files.
  • Increase the disk utilization threshold from 90% to 99%. To do this, modify the yarn.nodemanager.disk-health-checker.max-disk-utilization-per-disk-percentage property in yarn-default.xml on all nodes. Then, restart the hadoop-yarn-nodemanager service.