​Why was my EMR cluster terminated?

6 minute read
0

My Amazon EMR cluster terminated unexpectedly.

Resolution

Review Amazon EMR provisioning logs stored in Amazon S3

Amazon EMR cluster logs are stored in an Amazon Simple Storage Service (Amazon S3) bucket that's specified at cluster launch. The logs are stored at s3://example-log-location/example-cluster-ID/node/example-EC2-instance-ID/.

Note: Replace example-log-location, example-cluster-ID, and example-EC2-instance-ID with your system's naming.

The following is a list of common errors:

SHUTDOWN_STEP_FAILED (USER_ERROR)
NO_SLAVES_LEFT (SYSTEM_ERROR)
The master failed: Error occurred: <html>??<head><title>502 Bad Gateway</title></head>??<body>??<center><h1>502 Bad Gateway</h1></center>??<hr><center>nginx/1.16.1</center>??</body>??</html>??
KMS_ISSUE (USER_ERROR)Terminated with errors, The master node was terminated by user.

Note: The preceding are the most common termination errors. EMR clusters might be terminated due to errors other than those listed. For more information, see Resource errors.

SHUTDOWN_STEP_FAILED (USER_ERROR)

When you submit a step job in your EMR cluster, you can specify the step failure behavior in the ActionOnFailure parameter. The EMR cluster terminates if you select TERMINATE_CLUSTER or TERMINATE_JOB_FLOW for the ActionOnFailure parameter. For more information, see StepConfig.

The following is an example error message from AWS CloudTrail:

{
  "severity": "ERROR",
  "actionOnFailure": "TERMINATE_JOB_FLOW",
  "stepId": "s-2I0GXXXXXXXX",
  "name": "Example Step",
  "clusterId": "j-2YJXXXXXXX",
  "state": "FAILED",
  "message": "Step s-2I0GXXXXXXXX (Example Step) in Amazon EMR cluster j-2YJXXXXXXX failed at 202X-1X-0X 0X:XX UTC."
}

To avoid this error, use the CONTINUE or CANCEL_AND_WAIT option in the ActionOnFailure parameter when submitting the step job.

NO_SLAVES_LEFT (SYSTEM_ERROR)

This error occurs when:

  • Termination protection is turned off in the EMR cluster.
  • All core nodes exceed disk storage capacity as specified by a maximum utilization threshold in the yarn-site configuration classification. The default maximum utilization threshold is 90%.
  • The CORE instance is a Spot Instance, and the Spot Instance is TERMINATED_BY_SPOT_DUE_TO_NO_CAPACITY.

For information on Spot Instance termination, see Why did Amazon EC2 interrupt my Spot Instance?

For more information on the NO_SLAVE_LEFT error, see, see Cluster terminated with NO_SLAVE_LEFT and core nodes FAILED_BY_MASTER.

The following is an example error message from the instance-controller:

202X-0X-0X 1X:5X:5X,968 INFO Poller: InstanceJointStatusMap contains X entries (DD:5 R:3):
i-0e336xxxxxxxxxxxx 25d21h R  25d21h ig-22 ip-1x-2xx-xx-1xx.local.xxx.com  I:   52s Y:U    98s c: 0 am:    0 H:R  1.1%Yarn unhealthy Reason : 1/4 local-dirs usable space is below configured utilization percentage/no more usable space [ /mnt/yarn : used space above threshold of 90.0% ] ; 1/1 log-dirs usable space is below configured utilization percentage/no more usable space [ /var/log/hadoop-yarn/containers : used space above threshold of 90.0% ]

To resolve this error:

502 Bad Gateway

The 502 Bad Gateway error occurs when Amazon EMR internal systems can't reach the primary node for a period of time. Amazon EMR is terminated if termination protection is turned off. Check the latest instance-controller logs and instance state logs when the instance-controller service is down. The instance-controller standard output shows that the service is terminated because there is insufficient memory. This indicates that the primary node of the cluster is low on memory.

The following is an example error message from the instance state log:

# dump instance controller stdout
tail -n 100 /emr/instance-controller/log/instance-controller.out
OpenJDK 64-Bit Server VM warning: INFO: os::commit_memory(0x00007fb46c7c8000, 12288, 0) failed; error='Cannot allocate memory' (errno=12)
#
# There is insufficient memory for the Java Runtime Environment to continue.
# Native memory allocation (mmap) failed to map 12288 bytes for committing reserved memory.
# An error report file with more information is saved as:
# /tmp/hs_err_pid16110.log

# whats memory usage look like
free -m
              total        used        free      shared  buff/cache   available
Mem:          15661       15346         147           0         167          69
Swap:             0           0           0

To avoid the preceding error, launch an EMR cluster with a higher instance type to leverage more memory for your cluster's requirements. Also, clean up disk space to avoid memory outages in long running clusters. For more information, see How do I troubleshoot primary node failure with error "502 Bad Gateway" or "504 Gateway Time-out" in Amazon EMR?

KMS_ISSUE (USER_ERROR)

When using an Amazon EMR security configuration to encrypt an Amazon EBS root device and storage volumes, the role must have proper permissions. If the necessary permissions are missing, then you receive the KMS_ISSUE error.

The following is an example error message from AWS CloudTrail:

The EMR Service Role must have the kms:GenerateDataKey* and kms:ReEncrypt* permission for the KMS key configuration when you enabled EBS encryption by default. You can retrieve that KMS key's ID by using the ec2:GetEbsDefaultKmsKeyId API.

To avoid the preceding error, make sure that security configurations that are used to encrypt the Amazon EBS root device and storage volumes have the necessary permissions. For these configurations, be sure that the Amazon EMR service role (EMR_DefaultRole_V2) has permissions to use the specified AWS Key Management Service (AWS KMS) key.

Terminated with errors, The master node was terminated by user

When the EMR cluster primary node stops for any reason, the cluster terminates with the The master node was terminated by user error.

The following is an example error message from AWS CloudTrail:

eventTime": "2023-01-18T08:07:02Z",
    "eventSource": "ec2.amazonaws.com",
    "eventName": "StopInstances",
    "awsRegion": "us-east-1",
    "sourceIPAddress": "52.xx.xx.xx",
    "userAgent": "AWS Internal",
    "requestParameters": {
        "instancesSet": {
            "items": [
                {
                    "instanceId": "i-xxf6c5xxxxxxxxxxx"
                }
            ]
        },
        "force": false
},

Because stopping the EMR primary or all core nodes leads to cluster termination, avoid stopping or rebooting cluster nodes.


AWS OFFICIAL
AWS OFFICIALUpdated a year ago
No comments