如何對 Amazon EMR 中失敗的 Apache Spark 步驟進行疑難排解？

2 分的閱讀內容

我想對 Amazon EMR 中失敗的 Apache Spark 步驟進行疑難排解。

簡短描述

對失敗的 Spark 步驟進行疑難排解：

對於使用 --deploy-mode client 提交的 Spark 作業：檢查步驟日誌以識別步驟失敗的根本原因。
對於使用 --deploy-mode cluster 提交的 Spark 作業：檢查步驟日誌以識別應用程式 ID。然後，檢查應用程式主要日誌以識別步驟失敗的根本原因。

解決方法

用戶端模式作業

在用戶端模式下部署 Spark 作業時，步驟日誌會提供作業參數和步驟錯誤訊息。這些日誌會封存到 Amazon Simple Storage Service (Amazon S3)。例如：

**s3://aws-logs-111111111111-us-east-1/elasticmapreduce/j-35PUYZBQVIJNM/steps/s-2M809TD67U2IA/controller.gz:**此檔案包含 spark-submit 命令。檢查此日誌以查看作業的參數。
**s3://aws-logs-111111111111-us-east-1/elasticmapreduce/j-35PUYZBQVIJNM/steps/s-2M809TD67U2IA/stderr.gz:**此檔案提供驅動程式日誌。(當 Spark 作業以用戶端模式執行時，Spark 驅動程式會在主節點上執行。)

若要尋找步驟失敗的根本原因，請執行下列命令，將步驟日誌下載到 Amazon Elastic Compute Cloud (Amazon EC2) 執行個體。然後，搜尋警告和錯誤：

#Download the step logs:
aws s3 sync s3://aws-logs-111111111111-us-east-1/elasticmapreduce/j-35PUYZBQVIJNM/steps/s-2M809TD67U2IA/ s-2M809TD67U2IA/
#Open the step log folder:
cd s-2M809TD67U2IA/
#Uncompress the log file:
find . -type f -exec gunzip {} \;
#Get the yarn application id from the cluster mode log:
grep "Client: Application report for" * | tail -n 1
#Get the errors and warnings from the client mode log:
egrep "WARN|ERROR" *

例如，這個檔案：

s3://aws-logs-111111111111-us-east-1/elasticmapreduce/j-35PUYZBQVIJNM/containers/application_1572839353552_0008/container_1572839353552_0008_01_000001/stderr.gz

表示記憶體問題：

19/11/04 05:24:45 ERROR SparkContext: Error initializing SparkContext.
java.lang.IllegalArgumentException: Executor memory 134217728 must be at least 471859200. Please increase executor memory using the --executor-memory option or spark.executor.memory in Spark configuration.

使用日誌中的資訊來解決錯誤。

例如，若要解決記憶體問題，請提交具有更多執行程式記憶體的作業：

spark-submit --deploy-mode client --executor-memory 4g --class org.apache.spark.examples.SparkPi /usr/lib/spark/examples/jars/spark-examples.jar

叢集模式作業

1. 檢查 stderr 步驟日誌，以識別與失敗步驟相關聯的應用程式 ID。步驟日誌封存到 Amazon S3。例如，此日誌：

s3://aws-logs-111111111111-us-east-1/elasticmapreduce/j-35PUYZBQVIJNM/steps/s-2M809TD67U2IA/stderr.gz

識別 application_1572839353552_0008：

19/11/04 05:24:42 INFO Client: Application report for application_1572839353552_0008 (state: ACCEPTED)

2. 識別應用程式主要日誌。當 Spark 作業以叢集模式執行時，Spark 驅動程式會在應用程式主機內執行。應用程式主機是 Spark 作業執行時執行的第一個容器。以下是 Spark 應用程式日誌的範例清單。

在此清單中，container_1572839353552_0008_01_000001 是第一個容器，這表示它是應用程式主機。

s3://aws-logs-111111111111-us-east-1/elasticmapreduce/j-35PUYZBQVIJNM/containers/application_1572839353552_0008/container_1572839353552_0008_01_000001/stderr.gz

s3://aws-logs-111111111111-us-east-1/elasticmapreduce/j-35PUYZBQVIJNM/containers/application_1572839353552_0008/container_1572839353552_0008_01_000001/stdout.gz

s3://aws-logs-111111111111-us-east-1/elasticmapreduce/j-35PUYZBQVIJNM/containers/application_1572839353552_0008/container_1572839353552_0008_01_000002/stderr.gz

s3://aws-logs-111111111111-us-east-1/elasticmapreduce/j-35PUYZBQVIJNM/containers/application_1572839353552_0008/container_1572839353552_0008_01_000002/stdout.gz

3. 識別應用程式主要日誌後，將日誌下載到 Amazon EC2 執行個體。然後，搜尋警告和錯誤。例如：

#Download the Spark application logs:
aws s3 sync s3://aws-logs-111111111111-us-east-1/elasticmapreduce/j-35PUYZBQVIJNM/containers/application_1572839353552_0008/ application_1572839353552_0008/
#Open the Spark application log folder:
cd application_1572839353552_0008/
#Uncompress the log file:
find . -type f -exec gunzip {} \;
#Search for warning and errors inside all the container logs. Then, open the container logs returned in the output of this command.
egrep -Ril "ERROR|WARN" . | xargs egrep "WARN|ERROR"

例如，此日誌：

s3://aws-logs-111111111111-us-east-1/elasticmapreduce/j-35PUYZBQVIJNM/containers/application_1572839353552_0008/container_1572839353552_0008_01_000001/stderr.gz

表示記憶體問題：

19/11/04 05:24:45 ERROR SparkContext: Error initializing SparkContext.
java.lang.IllegalArgumentException: Executor memory 134217728 must be at least 471859200. Please increase executor memory using the --executor-memory option or spark.executor.memory in Spark configuration.

4. 解決日誌中識別的問題。例如，若要修正記憶體問題，請提交具有更多執行程式記憶體的作業：

spark-submit --deploy-mode cluster --executor-memory 4g --class org.apache.spark.examples.SparkPi /usr/lib/spark/examples/jars/spark-examples.jar 1000

如何對 Amazon EMR 中失敗的 Apache Spark 步驟進行疑難排解？

簡短描述

解決方法

用戶端模式作業

叢集模式作業

相關資訊

相關內容