How do I troubleshoot a failed Spark step in Amazon EMR?

4 minute read

I want to troubleshoot a failed Apache Spark step in Amazon EMR.

Short description

To troubleshoot failed Spark steps:

For Spark jobs submitted with --deploy-mode client: Check the step logs to identify the root cause of the step failure.
For Spark jobs submitted with --deploy-mode cluster: Check the step logs to identify the application ID. Then, check the application master logs to identify the root cause of the step failure.

Resolution

Client mode jobs

When a Spark job is deployed in client mode, the step logs provide the job parameters and step error messages. These logs are archived to Amazon Simple Storage Service (Amazon S3). For example:

s3://aws-logs-111111111111-us-east-1/elasticmapreduce/j-35PUYZBQVIJNM/steps/s-2M809TD67U2IA/controller.gz: This file contains the spark-submit command. Check this log to see the parameters for the job.
s3://aws-logs-111111111111-us-east-1/elasticmapreduce/j-35PUYZBQVIJNM/steps/s-2M809TD67U2IA/stderr.gz: This file provides the driver logs. (When the Spark job runs in client mode, the Spark driver runs on the master node.)

To find the root cause of the step failure, run the following commands to download the step logs to an Amazon Elastic Compute Cloud (Amazon EC2) instance. Then, search for warnings and errors:

#Download the step logs:
aws s3 sync s3://aws-logs-111111111111-us-east-1/elasticmapreduce/j-35PUYZBQVIJNM/steps/s-2M809TD67U2IA/ s-2M809TD67U2IA/
#Open the step log folder:
cd s-2M809TD67U2IA/
#Uncompress the log file:
find . -type f -exec gunzip {} \;
#Get the yarn application id from the cluster mode log:
grep "Client: Application report for" * | tail -n 1
#Get the errors and warnings from the client mode log:
egrep "WARN|ERROR" *

For example, this file:

s3://aws-logs-111111111111-us-east-1/elasticmapreduce/j-35PUYZBQVIJNM/containers/application_1572839353552_0008/container_1572839353552_0008_01_000001/stderr.gz

indicates a memory problem:

19/11/04 05:24:45 ERROR SparkContext: Error initializing SparkContext.
java.lang.IllegalArgumentException: Executor memory 134217728 must be at least 471859200. Please increase executor memory using the --executor-memory option or spark.executor.memory in Spark configuration.

Use the information in the logs to resolve the error.

For example, to resolve the memory issue, submit a job with more executor memory:

spark-submit --deploy-mode client --executor-memory 4g --class org.apache.spark.examples.SparkPi /usr/lib/spark/examples/jars/spark-examples.jar

Cluster mode jobs

1. Check the stderr step log to identify the ID of the application that's associated with the failed step. The step logs are archived to Amazon S3. For example, this log:

s3://aws-logs-111111111111-us-east-1/elasticmapreduce/j-35PUYZBQVIJNM/steps/s-2M809TD67U2IA/stderr.gz

identifies application_1572839353552_0008:

19/11/04 05:24:42 INFO Client: Application report for application_1572839353552_0008 (state: ACCEPTED)

2. Identify the application master logs. When the Spark job runs in cluster mode, the Spark driver runs inside the application master. The application master is the first container that runs when the Spark job executes. The following is an example list of Spark application logs.

In this list, container_1572839353552_0008_01_000001 is the first container, which means that it's the application master.

s3://aws-logs-111111111111-us-east-1/elasticmapreduce/j-35PUYZBQVIJNM/containers/application_1572839353552_0008/container_1572839353552_0008_01_000001/stderr.gz

s3://aws-logs-111111111111-us-east-1/elasticmapreduce/j-35PUYZBQVIJNM/containers/application_1572839353552_0008/container_1572839353552_0008_01_000001/stdout.gz

s3://aws-logs-111111111111-us-east-1/elasticmapreduce/j-35PUYZBQVIJNM/containers/application_1572839353552_0008/container_1572839353552_0008_01_000002/stderr.gz

s3://aws-logs-111111111111-us-east-1/elasticmapreduce/j-35PUYZBQVIJNM/containers/application_1572839353552_0008/container_1572839353552_0008_01_000002/stdout.gz

3. After you identify the application master logs, download the logs to an Amazon EC2 instance. Then, search for warnings and errors. For example:

#Download the Spark application logs:
aws s3 sync s3://aws-logs-111111111111-us-east-1/elasticmapreduce/j-35PUYZBQVIJNM/containers/application_1572839353552_0008/ application_1572839353552_0008/
#Open the Spark application log folder:
cd application_1572839353552_0008/ 
#Uncompress the log file:
find . -type f -exec gunzip {} \;
#Search for warning and errors inside all the container logs. Then, open the container logs returned in the output of this command.
egrep -Ril "ERROR|WARN" . | xargs egrep "WARN|ERROR"

For example, this log:

s3://aws-logs-111111111111-us-east-1/elasticmapreduce/j-35PUYZBQVIJNM/containers/application_1572839353552_0008/container_1572839353552_0008_01_000001/stderr.gz

indicates a memory problem:

19/11/04 05:24:45 ERROR SparkContext: Error initializing SparkContext.
java.lang.IllegalArgumentException: Executor memory 134217728 must be at least 471859200. Please increase executor memory using the --executor-memory option or spark.executor.memory in Spark configuration.

4. Resolve the issue identified in the logs. For example, to fix the memory issue, submit a job with more executor memory:

spark-submit --deploy-mode cluster --executor-memory 4g --class org.apache.spark.examples.SparkPi /usr/lib/spark/examples/jars/spark-examples.jar 1000

Related information

Adding a Spark step

Topics

Analytics

Relevant content

Multiple Spark Submits in Parallel
codet
asked 2 years ago
EMR serverless spark jobs connection with postgresql
Accepted Answer
muthu
asked a year ago
log4j2 with AWS EMR and Spark
ErickN
asked 4 years ago
Spark submit is failing in cluster mode for pyspark application
Javali
asked 8 months ago
Manual submitted step jobs failing
Accepted Answer
Vaas
asked 5 months ago
How do I troubleshoot a failed or stuck Spark SQL query in Amazon EMR?
AWS OFFICIALUpdated a year ago
How can I troubleshoot stage failures in Spark jobs on Amazon EMR?
AWS OFFICIALUpdated 2 years ago
Why did my Spark job in Amazon EMR fail?
AWS OFFICIALUpdated a year ago
How do I troubleshoot a failed step in Amazon EMR?
AWS OFFICIALUpdated a year ago
Accessing Spark Web UI for Interactive Endpoints in EMR on EKS
SUPPORT ENGINEER
Yokesh NK
published 4 days ago