Posted On: Oct 20, 2020
AWS Batch now allows users to configure retry strategies based on defined exit codes. Customers can now determine whether their AWS Batch jobs retry based on specified events such as infrastructure failure or application failure. This allows customers to tightly control the actions taken on job failure - resulting in lower costs and faster throughput by retrying only when necessary.
There are a variety of reasons why batch jobs might need to retry. An application may be consuming too much memory on an instance, or a job may not have access to specific files needed to successfully execute its code. Regardless of the reason for the failure, allowing for retries is often a necessity, particularly for customers wanting to take advantage of cost savings by using interruptible Spot instances.
Starting today, AWS Batch allows users to specify whether a job should retry based on a range of failure codes. This allows a customer to set up simple retry strategies: for example, if a job fails because it is on an instance reclaimed by Spot, it retries. If it fails because it is consuming too much memory, the job fails without retry and the user is notified.
Learn more about configuring your retry strategy in AWS Batch in our documentation.