How can I troubleshoot stage failures in Spark jobs on Amazon EMR?

Last updated: 2022-01-31

I want to troubleshoot stage failures in Apache Spark applications on Amazon EMR.

Short description

In Spark, stage failures happen when there's a problem with processing a Spark task. These failures can be caused by hardware issues, incorrect Spark configurations, or code problems. When a stage failure occurs, the Spark driver logs report an exception similar to the following:

org.apache.spark.SparkException: Job aborted due to stage failure: Task XXX in stage YYY failed 4 times, most recent failure: Lost task XXX in stage YYY (TID ZZZ, ip-xxx-xx-x-xxx.compute.internal, executor NNN): ExecutorLostFailure (executor NNN exited caused by one of the running tasks) Reason: ...

Resolution

Find the reason code

For Spark jobs submitted with --deploy-mode client, the reason code is in the exception that's displayed in the terminal.

For Spark jobs submitted with --deploy-mode cluster, run the following command on the master node to find stage failures in the YARN application logs. Replace application_id with the ID of your Spark application (for example, application_1572839353552_0008).

yarn logs -applicationId application_id | grep  "Job aborted due to stage failure" -A 10

You can also get this information from YARN ResourceManager in the application master container.

Resolve the root cause


Did this article help?


Do you need billing or technical support?