Short description
------------------



In Spark, stage failures happen when there's a problem with processing a Spark task. These failures can be caused by hardware issues, incorrect Spark configurations, or code problems. When a stage failure occurs, the Spark driver logs report an exception similar to the following:





```plaintext
org.apache.spark.SparkException: Job aborted due to stage failure: Task XXX in stage YYY failed 4 times, most recent failure: Lost task XXX in stage YYY (TID ZZZ, ip-xxx-xx-x-xxx.compute.internal, executor NNN): ExecutorLostFailure (executor NNN exited caused by one of the running tasks) Reason: ...
```


 Resolution
-----------


###  Find the reason code



For Spark jobs submitted with **--deploy-mode client**, the reason code is in the exception that's displayed in the terminal.


For Spark jobs submitted with **--deploy-mode cluster**, run the following command on the master node to find stage failures in the YARN application logs. Replace **application\_id** with the ID of your Spark application (for example, **application\_1572839353552\_0008**).





```plaintext
yarn logs -applicationId application_id | grep  "Job aborted due to stage failure" -A 10
```



You can also get this information from YARN ResourceManager in the application master container.



###  Resolve the root cause



After you find the exception, use one of the following articles to resolve the root cause:


* [How do I resolve "Container killed on request. Exit code is 137" errors in Spark on Amazon EMR?](https://repost.aws/knowledge-center/container-killed-on-request-137-emr)
* [How do I resolve "no space left on device" stage failures in Spark on Amazon EMR?](https://repost.aws/knowledge-center/no-space-left-on-device-emr-spark)
* [How do I resolve ExecutorLostFailure "Slave lost" errors in Spark on Amazon EMR?](https://repost.aws/knowledge-center/executorlostfailure-slave-lost-emr)





---








I want to troubleshoot stage failures in Apache Spark applications on Amazon EMR.

Troubleshoot stage failures in Spark jobs on Amazon EMR

How can I troubleshoot stage failures in Spark jobs on Amazon EMR?

Analytics

EMR Cluster failure with "Failed to start the job flow due to an internal error"

How do I troubleshoot the failure of my Amazon EMR Spark job using Amazon Athena?

How do I troubleshoot a failed Spark step in Amazon EMR?

How do I resolve "no space left on device" stage failures in Spark on Amazon EMR?

How do I troubleshoot a failed or stuck Spark SQL query in Amazon EMR?

SageMaker Processing Job (Spark) vs. AWS Glue Job (Spark) - any baseline comparison?

How to optimize a batch of Spark Jobs on EMR to reduce overall processing time by 4-5x?

Running Spark jobs on ECS

AWS  EMR (HDFS + Spark)  - AWS EMR (Spark)

EMR serverless spark jobs connection with postgresql

How can I troubleshoot stage failures in Spark jobs on Amazon EMR?

Short description

Resolution

Find the reason code

Resolve the root cause

Relevant content