How do I troubleshoot a failed Spark step in Amazon EMR?

Last updated: 2019-11-18

How do I troubleshoot a failed Apache Spark step in Amazon EMR?

Short Description

To troubleshoot failed Spark steps:

  • For Spark jobs submitted with --deploy-mode client: Check the step logs to identify the root cause of the step failure.
  • For Spark jobs submitted with --deploy-mode cluster: Check the step logs to identify the application ID. Then, check the application master logs to identify the root cause of the step failure.

Resolution

Client mode jobs

When a Spark job is deployed in client mode, the step logs provide the job parameters and step error messages. These logs are archived to Amazon Simple Storage Service (Amazon S3). For example:

  • s3://aws-logs-111111111111-us-east-1/elasticmapreduce/j-35PUYZBQVIJNM/steps/s-2M809TD67U2IA/controller.gz: This file contains the spark-submit command. Check this log to see the parameters for the job.
  • s3://aws-logs-111111111111-us-east-1/elasticmapreduce/j-35PUYZBQVIJNM/steps/s-2M809TD67U2IA/stderr.gz: This file provides the driver logs. (When the Spark job runs in client mode, the Spark driver runs on the master node.)

To find the root cause of the step failure:

Run the following commands to download the step logs to an Amazon Elastic Compute Cloud (Amazon EC2) instance and then search for warnings and errors:

#Download the step logs:
aws s3 sync s3://aws-logs-111111111111-us-east-1/elasticmapreduce/j-35PUYZBQVIJNM/steps/s-2M809TD67U2IA/ s-2M809TD67U2IA/
#Open the step log folder:
cd s-2M809TD67U2IA/
#Uncompress the log file:
find . -type f -exec gunzip {} \;
#Get the yarn application id from the cluster mode log:
grep "Client: Application report for" * | tail -n 1
#Get the errors and warnings from the client mode log:
egrep "WARN|ERROR" *

For example, this file:

s3://aws-logs-111111111111-us-east-1/elasticmapreduce/j-35PUYZBQVIJNM/containers/application_1572839353552_0008/container_1572839353552_0008_01_000001/stderr.gz

indicates a memory problem:

19/11/04 05:24:45 ERROR SparkContext: Error initializing SparkContext.
java.lang.IllegalArgumentException: Executor memory 134217728 must be at least 471859200. Please increase executor memory using the --executor-memory option or spark.executor.memory in Spark configuration.

Use the information in the logs to resolve the error. For example, to resolve the memory issue, submit a job with more executor memory:

spark-submit --deploy-mode client --executor-memory 4g --class org.apache.spark.examples.SparkPi /usr/lib/spark/examples/jars/spark-examples.jar

Cluster mode jobs

1.    Check the stderr step log to identify the ID of the application that's associated with the failed step. The step logs are archived to Amazon S3. For example, this log:

s3://aws-logs-111111111111-us-east-1/elasticmapreduce/j-35PUYZBQVIJNM/steps/s-2M809TD67U2IA/stderr.gz

identifies application_1572839353552_0008:

19/11/04 05:24:42 INFO Client: Application report for application_1572839353552_0008 (state: ACCEPTED)

2.    Identify the application master logs. When the Spark job runs in cluster mode, the Spark driver runs inside the application master. The application master is the first container that runs when the Spark job executes. The following is an example list of Spark application logs. In this list, container_1572839353552_0008_01_000001 is the first container, which means that it's the application master.

s3://aws-logs-111111111111-us-east-1/elasticmapreduce/j-35PUYZBQVIJNM/containers/application_1572839353552_0008/container_1572839353552_0008_01_000001/stderr.gz

s3://aws-logs-111111111111-us-east-1/elasticmapreduce/j-35PUYZBQVIJNM/containers/application_1572839353552_0008/container_1572839353552_0008_01_000001/stdout.gz

s3://aws-logs-111111111111-us-east-1/elasticmapreduce/j-35PUYZBQVIJNM/containers/application_1572839353552_0008/container_1572839353552_0008_01_000002/stderr.gz

s3://aws-logs-111111111111-us-east-1/elasticmapreduce/j-35PUYZBQVIJNM/containers/application_1572839353552_0008/container_1572839353552_0008_01_000002/stdout.gz

3.    After you identify the application master logs, download the logs to an EC2 instance. Then, search for warnings and errors. For example:

#Download the Spark application logs:
aws s3 sync s3://aws-logs-111111111111-us-east-1/elasticmapreduce/j-35PUYZBQVIJNM/containers/application_1572839353552_0008/ application_1572839353552_0008/
#Open the Spark application log folder:
cd application_1572839353552_0008/ 
#Uncompress the log file:
find . -type f -exec gunzip {} \;
#Search for warning and errors inside all the container logs. Then, open the container logs returned in the output of this command.
egrep -Ril "ERROR|WARN" . | xargs egrep "WARN|ERROR"

For example, this log:

s3://aws-logs-111111111111-us-east-1/elasticmapreduce/j-35PUYZBQVIJNM/containers/application_1572839353552_0008/container_1572839353552_0008_01_000001/stderr.gz

indicates a memory problem:

19/11/04 05:24:45 ERROR SparkContext: Error initializing SparkContext.
java.lang.IllegalArgumentException: Executor memory 134217728 must be at least 471859200. Please increase executor memory using the --executor-memory option or spark.executor.memory in Spark configuration.

4.    Resolve the issue identified in the logs. For example, to fix the memory issue, submit a job with more executor memory:

spark-submit --deploy-mode cluster --executor-memory 4g --class org.apache.spark.examples.SparkPi /usr/lib/spark/examples/jars/spark-examples.jar 1000

Did this article help you?

Anything we could improve?


Need more help?