如何排查 Amazon EMR 中 Spark 步骤失败的问题?

上次更新时间:2019 年 11 月 18 日

如何排查 Amazon EMR 中 Apache Spark 步骤失败的问题?

简短描述

要排查 Spark 步骤失败的问题:

  • 对于通过 -deploy-mode 客户端提交的 Spark 作业:检查步骤日志以确定步骤失败的根本原因。
  • 对于通过 -deploy-mode 集群提交的 Spark 作业:检查步骤日志以确定应用程序 ID。然后,检查应用程序主日志并找出步骤失败的根本原因。

解决方法

客户端模式作业

如果以客户端模式部署 Spark 作业,步骤日志将提供作业参数和步骤错误消息。这些日志会被存档到 Amazon Simple Storage Service (Amazon S3)。例如:

  • s3://aws-logs-111111111111-us-east-1/elasticmapreduce/j-35PUYZBQVIJNM/steps/s-2M809TD67U2IA/controller.gz:此文件包含 spark-submit 命令。检查此日志,以查看作业的参数。
  • s3://aws-logs-111111111111-us-east-1/elasticmapreduce/j-35PUYZBQVIJNM/steps/s-2M809TD67U2IA/stderr.gz:此文件提供驱动程序日志。(当 Spark 作业以客户端模式运行时,Spark 驱动程序将在主节点运行。)

要找到步骤失败的根本原因:

运行以下命令,以下载步骤日志到 Amazon Elastic Compute Cloud (Amazon EC2) 实例,然后搜索警告和错误:

#Download the step logs:
aws s3 sync s3://aws-logs-111111111111-us-east-1/elasticmapreduce/j-35PUYZBQVIJNM/steps/s-2M809TD67U2IA/ s-2M809TD67U2IA/
#Open the step log folder:
cd s-2M809TD67U2IA/
#Uncompress the log file:
find . -type f -exec gunzip {} \;
#Get the yarn application id from the cluster mode log:
grep "Client: Application report for" * | tail -n 1
#Get the errors and warnings from the client mode log:
egrep "WARN|ERROR" *

例如,此文件:

s3://aws-logs-111111111111-us-east-1/elasticmapreduce/j-35PUYZBQVIJNM/containers/application_1572839353552_0008/container_1572839353552_0008_01_000001/stderr.gz

说明存在内存问题:

19/11/04 05:24:45 ERROR SparkContext: Error initializing SparkContext.
java.lang.IllegalArgumentException: Executor memory 134217728 must be at least 471859200. Please increase executor memory using the --executor-memory option or spark.executor.memory in Spark configuration.

使用日志中的信息解决该错误:例如,要解决内存问题,使用更多执行程序内存提交作业:

spark-submit --deploy-mode client --executor-memory 4g --class org.apache.spark.examples.SparkPi /usr/lib/spark/examples/jars/spark-examples.jar

集群节点作业

1.    检查 stderr 步骤日志,以确定与失败步骤关联的应用程序的 ID。这些步骤日志会被存档到 Amazon S3。例如,此日志:

s3://aws-logs-111111111111-us-east-1/elasticmapreduce/j-35PUYZBQVIJNM/steps/s-2M809TD67U2IA/stderr.gz

确定 application_1572839353552_0008

19/11/04 05:24:42 INFO Client: Application report for application_1572839353552_0008 (state: ACCEPTED)

2.    确定应用程序主日志。当 Spark 作业以集群模式运行时,Spark 驱动程序将在应用程序主控器内运行。应用程序主控器是执行 Spark 作业时首个运行的容器。以下是 Spark 应用程序日志的示例列表。在此列表中,container_1572839353552_0008_01_000001 是第一个容器,亦即,它就是应用程序主控器。

s3://aws-logs-111111111111-us-east-1/elasticmapreduce/j-35PUYZBQVIJNM/containers/application_1572839353552_0008/container_1572839353552_0008_01_000001/stderr.gz

s3://aws-logs-111111111111-us-east-1/elasticmapreduce/j-35PUYZBQVIJNM/containers/application_1572839353552_0008/container_1572839353552_0008_01_000001/stdout.gz

s3://aws-logs-111111111111-us-east-1/elasticmapreduce/j-35PUYZBQVIJNM/containers/application_1572839353552_0008/container_1572839353552_0008_01_000002/stderr.gz

s3://aws-logs-111111111111-us-east-1/elasticmapreduce/j-35PUYZBQVIJNM/containers/application_1572839353552_0008/container_1572839353552_0008_01_000002/stdout.gz

3.    在确定应用程序主日志后,将日志下载到 EC2 实例。然后,搜索警告和错误。例如:

#Download the Spark application logs:
aws s3 sync s3://aws-logs-111111111111-us-east-1/elasticmapreduce/j-35PUYZBQVIJNM/containers/application_1572839353552_0008/ application_1572839353552_0008/
#Open the Spark application log folder:
cd application_1572839353552_0008/ 
#Uncompress the log file:
find . -type f -exec gunzip {} \;
#Search for warning and errors inside all the container logs. Then, open the container logs returned in the output of this command.
egrep -Ril "ERROR|WARN" . | xargs egrep "WARN|ERROR"

例如,此日志:

s3://aws-logs-111111111111-us-east-1/elasticmapreduce/j-35PUYZBQVIJNM/containers/application_1572839353552_0008/container_1572839353552_0008_01_000001/stderr.gz

说明存在内存问题:

19/11/04 05:24:45 ERROR SparkContext: Error initializing SparkContext.
java.lang.IllegalArgumentException: Executor memory 134217728 must be at least 471859200. Please increase executor memory using the --executor-memory option or spark.executor.memory in Spark configuration.

4.    解决在日志中发现的问题。例如,要修复内存问题,使用更多执行程序内存提交作业:

spark-submit --deploy-mode cluster --executor-memory 4g --class org.apache.spark.examples.SparkPi /usr/lib/spark/examples/jars/spark-examples.jar 1000

这篇文章对您有帮助吗?

我们可以改进什么?


需要更多帮助吗?