How can I resolve "Exit status: -100. Diagnostics: Container released on a lost node" errors in Amazon EMR?

3 minute read

My Amazon EMR job fails with an error message similar to the following: ExecutorLostFailure (executor 12 exited caused by one of the running tasks) Reason: Container marked as failed: container_1572839353552_0008_01_000002 on host: ip-xx-xxx-xx-xx Exit status: -100. Diagnostics: Container released on a lost node

Short description

This error commonly occurs during either of the following situations:

A core or task node is terminated because of high disk space utilization.
A node becomes unresponsive due to prolonged high CPU utilization or low available memory.

This article focuses on disk space issues.

When disk usage on a core or task node disk (for example, /mnt or /mnt1) exceeds 90%, the disk is considered unhealthy. If fewer than 25% of a node's disks are healthy, YARN ResourceManager gracefully decommissions the node. To resolve this problem, add more Amazon Elastic Block Store (Amazon EBS) capacity to the EMR cluster. You can do this when you launch a new cluster or by modifying a running cluster.

Resolution

Determine the root cause

To determine the cause of the error, check the following Amazon CloudWatch metrics for the EMR cluster:

MR unhealthy nodes: If this metric shows an unhealthy node, the issue is caused by a lack of disk space.
MR lost nodes: If this metric shows a lost node, it indicates that a node was lost due to a hardware failure, or that the node couldn't be reached due to high CPU or high memory utilization.

Use one of the following options to resolve lost node errors that are caused by a lack of disk space.

New clusters: Add more EBS capacity

To add more EBS capacity when you launch an EMR cluster, choose a larger Amazon Elastic Compute Cloud (Amazon EC2) instance type. Larger EC2 instances include more EBS storage capacity. For more information, see Default EBS Storage for Instances. (You can also modify the volume size or add more volumes when you create the cluster, regardless of the instance type that you choose.)

New or running clusters: Add more core or task nodes

Choose a larger number of core or task nodes when you launch a new cluster.
Add more core or task nodes to a running cluster.

Running clusters: Add more EBS volumes

Do the following to attach more EBS volumes to a running cluster:

1. If larger EBS volumes don't resolve the problem, attach more EBS volumes to the core and task nodes.

2. Format and mount the attached volumes. Be sure to use the correct disk number (for example, /mnt1 or /mnt2 instead of /data).

3. Connect to the node using SSH.

4. Add the path /mnt1/yarn inside the yarn.nodemanager.local-dirs property of /etc/hadoop/conf/yarn-site.xml. Example:

<property>
    <name>yarn.nodemanager.local-dirs</name>
    <value>/mnt/yarn,/mnt1/yarn</value>
</property>

5. Restart the NodeManager service:

sudo stop hadoop-yarn-nodemanager
sudo start hadoop-yarn-nodemanager

6. Enable termination protection.

If you still have disk space issues, try the following:

Remove unnecessary files.
Increase the disk utilization threshold from 90% to 99%. To do this, modify the yarn.nodemanager.disk-health-checker.max-disk-utilization-per-disk-percentage property in yarn-default.xml on all nodes. Then, restart the hadoop-yarn-nodemanager service.

Related information

Cluster terminates with NO_SLAVE_LEFT and core nodes FAILED_BY_MASTER

Why is the core node in my Amazon EMR cluster running out of disk space?

Topics

Analytics

Relevant content

torque service exits with status 3 on master node
mfolusiak
asked 3 years ago
EMR Serverless CF Stacked failed with "Error occurred during operation 'CreateApplication'"
Accepted Answer
rePost-User-9652121
asked a year ago
Lost Node-Pod-Container Access from CLI, Nodes show Unknown Status in Console, EKSClusterRoleLatest missing
Justin
asked 9 months ago
session manager error exit status 1
rePost-User-6067514
asked a year ago
Error occurred during build: Command 01_symlink failed
Latchu_DevOps
asked 2 months ago
How do I resolve ExecutorLostFailure "Slave lost" errors in Spark on Amazon EMR?
AWS OFFICIALUpdated 2 years ago
Why is my AWS Glue job failing with the error "Exit status: -100. Diagnostics: Container released on a *lost* node"?
AWS OFFICIALUpdated 3 years ago
How do I resolve "Container killed on request. Exit code is 137" errors in Spark on Amazon EMR?
AWS OFFICIALUpdated 2 years ago
How can I resolve node label and YARN ResourceManager failures in Amazon EMR?
AWS OFFICIALUpdated 2 years ago
How to use Amazon Polly to resolve common implementation challenges
EXPERT
Abhishek Soni
published 2 years ago

How can I resolve "Exit status: -100. Diagnostics: Container released on a *lost* node" errors in Amazon EMR?