为什么我的 EMR 集群被终止了？

3 分钟阅读

我的 Amazon EMR 集群意外终止。

解决方法

查看 Amazon S3 中存储的 Amazon EMR 预置日志

Amazon EMR 集群日志存储在集群启动时指定的 Amazon Simple Storage Service (Amazon S3) 桶中。日志存储位置为 s3://example-log-location/example-cluster-ID/node/example-EC2-instance-ID/。

**注意：**将 example-log-location、example-cluster-ID 以及 example-EC2-instance-ID 替换为您的系统的命名。

以下是常见错误列表：

SHUTDOWN_STEP_FAILED (USER_ERROR)

NO_SLAVES_LEFT (SYSTEM_ERROR)

The master failed: Error occurred: <html>??<head><title>502 Bad Gateway</title></head>??<body>??<center><h1>502 Bad Gateway</h1></center>??<hr><center>nginx/1.16.1</center>??</body>??</html>??

KMS_ISSUE (USER_ERROR)Terminated with errors, The master node was terminated by user.

**注意：**上述为最常见的终止错误。EMR 集群可能会由于列出的错误以外的错误而终止。有关更多信息，请参阅资源错误。

SHUTDOWN_STEP_FAILED (USER_ERROR)

在 EMR 集群中提交步骤作业时，可以在 ActionOnFailure 参数中指定步骤失败行为。如果为 ActionOnFailure 参数选择了 TERMINATE_CLUSTER 或 TERMINATE_JOB_FLOW，EMR 集群就会终止。有关更多信息，请参阅 StepConfig。

以下是来自 AWS CloudTrail 的错误消息示例：

{
  "severity": "ERROR",
  "actionOnFailure": "TERMINATE_JOB_FLOW",
  "stepId": "s-2I0GXXXXXXXX",
  "name": "Example Step",
  "clusterId": "j-2YJXXXXXXX",
  "state": "FAILED",
  "message": "Step s-2I0GXXXXXXXX (Example Step) in Amazon EMR cluster j-2YJXXXXXXX failed at 202X-1X-0X 0X:XX UTC."
}

为避免此错误，请在提交步骤作业时在 ActionOnFailure 参数中使用 CONTINUE 或 CANCEL_AND_WAIT 选项。

NO_SLAVES_LEFT (SYSTEM_ERROR)

在以下情况下会出现此错误：

EMR 集群中的终止保护禁用。
所有核心节点都超过了 yarn-site 配置分类中由最大利用率阈值指定的磁盘存储容量。默认的最大利用率阈值为 90%。
核心实例是竞价型实例，竞价型实例“TERMINATED_BY_SPOT_DUE_TO_NO_CAPACITY”。

有关终止竞价型实例的信息，请参阅 Amazon EC2 为何中断我的竞价型实例？

有关 NO_SLAVE_LEFT 错误的更多信息，请参阅集群终止，显示 NO_SLAVE_LEFT 和核心节点 FAILED_BY_MASTER 错误。

以下是来自 instance-controller 的错误消息示例：

202X-0X-0X 1X:5X:5X,968 INFO Poller: InstanceJointStatusMap contains X entries (DD:5 R:3):
i-0e336xxxxxxxxxxxx 25d21h R  25d21h ig-22 ip-1x-2xx-xx-1xx.local.xxx.com  I:   52s Y:U    98s c: 0 am:    0 H:R  1.1%Yarn unhealthy Reason : 1/4 local-dirs usable space is below configured utilization percentage/no more usable space [ /mnt/yarn : used space above threshold of 90.0% ] ; 1/1 log-dirs usable space is below configured utilization percentage/no more usable space [ /var/log/hadoop-yarn/containers : used space above threshold of 90.0% ]

要排除此错误，请执行以下操作：

使集群的终止保护处于开启状态。有关更多信息，请参阅终止保护和运行不正常的 YARN 节点。
使用 Amazon EMR 扩展策略（自动扩缩和托管扩缩），根据自身需要扩展核心节点。有关更多信息，请参阅使用集群扩缩。
为您的集群添加更多 Amazon Elastic Block Store (Amazon EBS) 容量。有关更多信息，请参阅如何在 Amazon EMR 中解决“Exit status: -100.Diagnostics: Container released on a *lost* node”错误？
为 MRUnhealthyNodes Amazon CloudWatch 指标创建警报。您可以为此警报设置通知，在达到 45 分钟超时时间之前向您发出节点运行不正常的警告。有关更多信息，请参阅基于静态阈值创建 CloudWatch 警报。

502 Bad Gateway

当 Amazon EMR 内部系统在一段时间内无法到达主节点时，就会出现 502 Bad Gateway 错误。如果终止保护禁用了，Amazon EMR 将终止。当 instance-controller 服务关闭时，检查最新的 instance-controller 日志和实例状态日志。instance-controller 标准输出显示服务因内存不足而终止。这表示集群的主节点内存不足。

以下是来自实例状态日志的错误消息示例：

# dump instance controller stdout
tail -n 100 /emr/instance-controller/log/instance-controller.out
OpenJDK 64-Bit Server VM warning: INFO: os::commit_memory(0x00007fb46c7c8000, 12288, 0) failed; error='Cannot allocate memory' (errno=12)
#
# There is insufficient memory for the Java Runtime Environment to continue.
# Native memory allocation (mmap) failed to map 12288 bytes for committing reserved memory.
# An error report file with more information is saved as:
# /tmp/hs_err_pid16110.log

# whats memory usage look like
free -m
              total        used        free      shared  buff/cache   available
Mem:          15661       15346         147           0         167          69
Swap:             0           0           0

为避免上述错误，请启动具有更高级别实例类型的 EMR 集群，以利用更多内存来满足集群的要求。此外，清理磁盘空间，以避免长时间运行的集群出现内存中断。有关更多信息，请参阅如何排除 Amazon EMR 中出现“502 Bad Gateway” 或 “504 Gateway Time-out”错误的主节点故障？

KMS_ISSUE (USER_ERROR)

使用 Amazon EMR 安全配置加密 Amazon EBS 根设备和存储卷时，该角色必须具有适当的权限。如果缺少必要的权限，就会收到 KMS\ _ISSUE 错误。

以下是来自 AWS CloudTrail 的错误消息示例：

The EMR Service Role must have the kms:GenerateDataKey* and kms:ReEncrypt* permission for the KMS key configuration when you enabled EBS encryption by default. You can retrieve that KMS key's ID by using the ec2:GetEbsDefaultKmsKeyId API.

为避免上述错误，请确保用于加密 Amazon EBS 根设备和存储卷的安全配置具有必要的权限。对于这些配置，请确保 Amazon EMR 服务角色 (EMR_DefaultRole_V2) 有权使用指定的 AWS Key Management Service (AWS KMS) 密钥。

因“The master node was terminated by user”错误而终止

当 EMR 集群主节点因任何原因停止时，集群终止，显示 The master node was terminated by user 错误。

以下是来自 AWS CloudTrail 的错误消息示例：

eventTime": "2023-01-18T08:07:02Z",
    "eventSource": "ec2.amazonaws.com",
    "eventName": "StopInstances",
    "awsRegion": "us-east-1",
    "sourceIPAddress": "52.xx.xx.xx",
    "userAgent": "AWS Internal",
    "requestParameters": {
        "instancesSet": {
            "items": [
                {
                    "instanceId": "i-xxf6c5xxxxxxxxxxx"
                }
            ]
        },
        "force": false
},

由于停止 EMR 主节点或所有核心节点会导致集群终止，因此请避免停止或重启集群节点。

主题

分析

标签

Amazon EMR

语言

中文 (简体)

AWS 官方已更新 1 年前

没有评论

​为什么我的 EMR 集群被终止了？

解决方法

查看 Amazon S3 中存储的 Amazon EMR 预置日志

SHUTDOWN_STEP_FAILED (USER_ERROR)

NO_SLAVES_LEFT (SYSTEM_ERROR)

502 Bad Gateway

KMS_ISSUE (USER_ERROR)

因“The master node was terminated by user”错误而终止

相关内容

为什么我的 EMR 集群被终止了？