Why did my Spark job in Amazon EMR fail?
Last updated: 2022-12-23
My Apache Spark job in Amazon EMR failed.
ERROR ShuffleBlockFetcherIterator: Failed to get block(s) from ip-192-168-14-250.us-east-2.compute.internal:7337
org.apache.spark.network .client.ChunkFetchFailureException: Failure while fetching StreamChunkId[streamId=842490577174,chunkIndex=0]: java.lang.RuntimeException: Executor is not registered (appId=application_1622819239076_0367, execId=661)
This issue might occur when the executor worker node in Amazon EMR is in an unhealthy state. When disk utilization for a worker node exceeds the 90% utilization threshold, the YARN NodeManager health service reports the node as UNHEALTHY. Unhealthy nodes are included in Amazon EMR deny lists. In addition, YARN containers aren't allocated to those nodes.
To troubleshoot this issue, do the following:
- Review the resource manager logs from the EMR cluster master node for unhealthy worker nodes. For more information, see How do I resolve ExecutorLostFailure "Slave lost" errors in Spark on Amazon EMR? and review the High disk utilization section.
- Verify impacted node disk space utilization, review files consuming disk space, and perform data recovery to return nodes to a healthy state. For more information, see Why is the core node in my Amazon EMR cluster running out of disk space?
ERROR [Executor task launch worker for task 631836] o.a.s.e.Executor:Exception in task 24.0 in stage 13028.0 (TID 631836) java.util.NoSuchElementException: None.get
This error occurs when there is a problem within the application code and the SparkContext initialization.
Make sure that there aren't multiple SparkContext jobs active within the same session. According to the Java Virtual Machine (JVM), You can have one active SparkContext at a time. If you want to initialize another SparkContext, then you must stop the active job before creating a new one.
Container killed on request. Exit code is 137
This exception occurs when a task in a YARN container exceeds the physical memory allocated for that container. This commonly happens when you have shuffle partitions, inconsistent partition sizes, or a large number of executor cores.
Review error details in the Spark driver logs to determine the cause of the error. For more information, see How can I access Spark driver logs on an Amazon EMR cluster?
The following is an example error from the driver log:
ERROR YarnScheduler: Lost executor 19 on ip-10-109-xx-xxx.aws.com : Container from a bad node: container_1658329343444_0018_01_000020 on host: ip-10-109-xx-xxx.aws.com . Exit status: 137.Diagnostics: Container killed on request. Exit code is 137 Container exited with a non-zero exit code 137. Killed by external signal Executor container 'container_1658329343444_0018_01_000020' was killed with exit code 137. To understand the root cause, you can analyze executor container log. # java.lang.OutOfMemoryError: Java heap space # -XX:OnOutOfMemoryError="kill -9 %p" # Executing /bin/sh -c "kill -9 23573"...
The preceding error stack trace indicates that there isn't enough available memory on the executor to continue processing data. This error might happen in different job stages in both narrow and wide transformations.
To resolve this issue, do one of the following:
- Increase executor memory.
Note: Executor memory includes memory required for executing the tasks plus overhead memory. The sum of these must not be greater than the size of JVM and the YARN maximum container size.
- Add more Spark partitions.
- Increase the number of shuffle partitions.
- Reduce the number of executor cores.
For more information, see How do I resolve "Container killed on request. Exit code is 137" errors in Spark on Amazon EMR?
Spark jobs are in a hung state and not completing
Spark jobs might be in stuck for multiple reasons. For example, if the Spark driver (application master) process is impacted or executor containers are lost.
This commonly happens when you have high disk space utilization, or when you use Spot Instances for cluster nodes and the Spot Instance is terminated. For more information, see How do I resolve ExecutorLostFailure "Slave lost" errors in Spark on Amazon EMR?
To troubleshoot this issue, do the following:
- Review Spark application master or driver logs for any exceptions.
- Validate the YARN node list for any unhealthy nodes. When disk utilization for a core node exceeds the utilization threshold, the YARN Node Manager health service reports the node as UNHEALTHY. Unhealthy nodes are included in Amazon EMR deny lists. In addition, YARN containers aren't allocated to those nodes.
- Monitor disk space utilization, and configure Amazon Elastic Block Store (Amazon EBS) volumes to keep utilization below 90% for EMR cluster worker nodes.
WARN Executor: Issue communicating with driver in heartbeater org.apache.spark.rpc.RpcTimeoutException: Futures timed out after [10000 milliseconds]. This timeout is controlled by spark.executor.heartbeatInterval
Spark executors send heartbeat signals to the Spark driver at intervals specified by the spark.executor.heartbeatInterval property. During long garbage collection pauses, the executor might not send a heartbeat signal. The driver kills executors that fail to send a heartbeat signal for more than the value specified.
Timeout exceptions occur when the executor is under memory constraint or facing OOM issues while processing data. This also impacts the garbage collection process, causing further delay.
Use one of the following methods to resolve heartbeat timeout errors:
- Increase executor memory. Also, depending on the application process, repartition your data.
- Tune garbage collection.
- Increase the interval for spark.executor.heartbeatInterval.
- Specify a longer spark.network.timeout period.
ExecutorLostFailure "Exit status: -100. Diagnostics: Container released on a *lost* node
This error occurs when a core or task node is terminated because of high disk space utilization. The error also occurs when a node becomes unresponsive due to prolonged high CPU utilization or low available memory. For troubleshooting steps, see How can I resolve "Exit status: -100. Diagnostics: Container released on a *lost* node" errors in Amazon EMR?
Note: This error might also occur when you use Spot Instances for cluster nodes, and a Spot Instance is terminated. In this scenario, the EMR cluster provisions an On-Demand Instance to replace the terminated Spot Instance. This means that the application might recover on its own. For more information, see Spark enhancements for elasticity and resiliency on Amazon EMR.
executor 38: java.sql.SQLException (Network error IOException: Connection timed out (Connection timed out)
This issue is related to communication with the SQL database for establishing the socket connection when reading or writing data. Verify that the DB host can receive incoming connections on Port 1433 from your EMR cluster security groups.
Also, review the maximum number of parallel database connections configured for the SQL DB and the memory allocation for the DB instance class. Database connections also consume memory. If utilization is high, then review the database configuration and the number of allowed connections. For more information, see Maximum number of database connections.
Amazon S3 exceptions
HTTP 503 "Slow Down"
HTTP 503 exceptions occur when you exceed the Amazon Simple Storage Service (Amazon S3) request rate for the prefix. A 503 exception doesn't always mean that a failure will occur. However, mitigating the problem can improve your application's performance.
For more information, see Why does my Spark or Hive job on Amazon EMR fail with an HTTP 503 "Slow Down" AmazonS3Exception?
HTTP 403 "Access Denied"
HTTP 403 errors are caused by incorrect or not valid credentials, such as:
- Credentials or roles that are specified in your application code.
- The policy that's attached to the Amazon EC2 instance profile role.
- Amazon Virtual Private Cloud (Amazon VPC) endpoints for Amazon S3.
- S3 source and destination bucket policies.
To resolve 403 errors, be sure that the relevant AWS Identity and Access Management (IAM) role or policy allows access to Amazon S3. For more information, see Why does my Amazon EMR application fail with an HTTP 403 "Access Denied" AmazonS3Exception?
HTTP 404 "Not Found"
HTTP 404 errors indicate that the application expected to find an object in S3, but at the time of the request, the object wasn't found. Common causes include:
- Incorrect S3 paths (for example, a mistyped path).
- The file was moved or deleted by a process outside of the application.
- An operation that caused eventual consistency problems, such as an overwrite.
For more information, see Why does my Amazon EMR application fail with an HTTP 404 "Not Found" AmazonS3Exception?