Posted On: Feb 9, 2023

Today, we are introducing a new capability for Amazon EMR on EKS to increase job execution resiliency. Until now, users had to build their own custom job execution retry mechanism outside of Amazon EMR on EKS, to make sure their Spark jobs keep running in case of failure. With this feature, users can now save time and keep their business-critical and long-running streaming workloads running, by having Amazon EMR on EKS automatically re-submit jobs in case of failure.

With job retries, once you define a retry policy by providing the amount of attempts to limit executions to, Amazon EMR on EKS will enforce and monitor this policy during each job execution, giving you visibility via the DescribeJobRun API and AWS CloudWatch events of each retry being performed. 

Job execution retries is now generally available in all AWS regions where Amazon EMR on EKS is, starting with Amazon EMR 6.9 and later releases. To learn more about how to use job driver pod retries, please visit our documentation.