Introducing new alerts to help users detect and react to blocked job queues in AWS Batch

As many readers will know, AWS Batch provides functionality that enables you to run batch workloads on the managed container orchestration services in AWS: Amazon ECS and Amazon EKS. One of the core concepts of Batch is that it provides a job queue you can submit your work to. Batch is designed to transition your jobs from SUBMITTED to RUNNABLE states if they pass preliminary checks, and from RUNNING to either FAILED or SUCCEEDED after the job is placed on a compute resource and completes. Batch is also sends an event to Amazon CloudWatch Events for each corresponding job state update.

Sometimes, though, a RUNNABLE job at the head of the queue can block all other jobs behind it from running. This could be caused by a misconfiguration in your AWS account, or it might be because your account doesn’t have access to the specific instances required for the job (like a GPU, for example). We’ve documented several common causes – and their resolutions – but to fix the issue you first need to know that the blocked job queue condition exists.

Today we’re introducing CloudWatch Events notifications for blocked job queues. We’ve designed this new feature to provide you with an event any time Batch detects that a job queue is blocked by a RUNNABLE job at the head of the queue. Even better, the feature is designed to show the reason the job is stuck in both the CloudWatch event and the statusReason of the job returned from DescribeJobs and ListJobs API calls.

Common causes of blocked job queues

There are many reasons why a job at the head of the queue can block other jobs behind it from running.

IAM roles, network, and security settings are often the culprits, but there are several more reasons for a blocked job queue that Batch can detect in your environment:

your queue’s attached compute environments (CEs) have all received insufficient capacity errors
your CEs have a maxVcpu that’s too small for the job requirements
your CEs lack any instances that meet the job requirements
your service role has a permission issue
all your CEs are in an INVALID state (which usually means networking or IAM roles are preventing instances joining the compute fleet)

With the new CloudWatch Events notifications we’re announcing today, Batch can now send a notification when it detects a common reason for jobs being stuck in RUNNABLE, but not for every RUNNABLE job simply waiting for its turn. Finally, sometimes Batch will detect a blocked job queue but can’t determine the specific reason. In this case, Batch can still send a notification of a blocked job queue, but you’ll need to do a bit of detective work to figure out the root cause.

Taking action

There are two ways you can programmatically act on the blocked job queue events that Batch sends to CloudWatch Events.

Acting on events with Amazon EventBridge

The first way you can automate an action (based on a matching event pattern) is to define Amazon EventBridge rules with different EventBridge targets for each event type.

For example, when you receive a message about a job that requires more memory than any instance can provide, you could use an AWS Lambda function to terminate the job request—to unblock the job queue—and then send the event information into an Amazon SQS queue so you can inspect the job details later.

When you review those SQS messages, you can decide if you want to adjust the CE to meet the needs of these types of jobs, or create a new Batch environment to handle more resource-intensive jobs.

The Batch user guide has examples showing how to listen for and react to the CloudWatch events using EventBridge.

Auto terminating stuck jobs

The second way you can act to address blocked job queues is at the level of the job queue itself.

Batch now has a jobStateTimeLimitActions parameter for job queues that lets you automatically cancel a stuck job after a defined period of time, maxTimeSeconds. If you opt to define this parameter, Batch is designed to start the timeout clock ticking when it detects that a job is blocking the queue. Batch will also update the statusReason field at this time. Once the stuck job at the head of the queue is cancelled, a “Batch Job State Change” CloudWatch event is emitted with the underlying reason. If Batch detects that another job at the head of the queue is blocking the queue, a “Batch Job Queue Blocked” CloudWatch event is emitted, and starts a new maxTimeSeconds timer to take the action you defined once the limit is reached.

Here’s an example showing how to specify that the job queue will wait for a job that is blocked by compute environments having reached maximum capacity for 4 hours (maxTimeSeconds=14400) before cancelling the job.

"jobStateTimeLimitActions": [
      {                              
         "reason" :  "MISCONFIGURATION:COMPUTE_ENVIRONMENT_MAX_RESOURCE", 
         "state": "RUNNABLE",            
         "maxTimeSeconds" : 14400,
         "action" : "CANCEL"            
      }
]

Table 1 lists the scenarios we discussed, the output message in CloudWatch Events, and the jobStateTimeLimitAction.reason that you can specify to cancel the stuck job in the job queue. The table also lists the “Batch Job State Change” CloudWatch event message if the job was automatically canceled.

Table 1: AWS Batch CloudWatch Events messages for blocked job queue events. The **Scenerio** column describes the context Batch determined that a queue was blocked. **Status Parameter** refers to key fields in event messages that are provided for an event. **Status Parameter Value** lists an example value for the corresponding parameter. Note that the CAPACITY:INSUFFICIENT_INSTANCE_CAPACITY message will provide a specific instance type name, not just the one in the example.
Scenario	Status Parameter	Status Parameter value
All your job queue’s connected compute environments have received insufficient capacity errors.	CloudWatch Event `statusReason`	CAPACITY:INSUFFICIENT_INSTANCE_CAPACITY – Service cannot fulfill the capacity requested for instance type [instanceTypeName].
	`jobStateTimeLimitActions.reason`	CAPACITY:INSUFFICIENT_INSTANCE_CAPACITY
	CloudWatch Event `statusReason` after job cancellation	Canceled by JobStateTimeLimit action due to reason: CAPACITY:INSUFFICIENT_INSTANCE_CAPACITY
All compute environments have maxVcpu that is smaller than job requirements.	CloudWatch Event `statusReason`	MISCONFIGURATION:COMPUTE_ENVIRONMENT_MAX_RESOURCE – CE(s) associated with the job queue cannot meet the CPU requirement of the job.
	`jobStateTimeLimitActions.reason`	MISCONFIGURATION:COMPUTE_ENVIRONMENT_MAX_RESOURCE
	CloudWatch Event `statusReason` after job cancellation	Canceled by JobStateTimeLimit action due to reason: MISCONFIGURATION:COMPUTE_ENVIRONMENT_MAX_RESOURCE
All compute environments have no connected instances that meet job requirements.	CloudWatch Event `statusReason`	MISCONFIGURATION:JOB_RESOURCE_REQUIREMENT – The job resource requirement (vCPU/memory/GPU) is higher than that can be met by the CE(s) attached to the job queue.
	`jobStateTimeLimitActions.reason`	MISCONFIGURATION:JOB_RESOURCE_REQUIREMENT
	CloudWatch Event `statusReason` after job cancellation	Canceled by JobStateTimeLimit action due to reason: MISCONFIGURATION:JOB_RESOURCE_REQUIREMENT
All compute environments have service role issues.	CloudWatch Event `statusReason`	MISCONFIGURATION:SERVICE_ROLE_PERMISSIONS – Batch service role has a permission issue.
	`jobStateTimeLimitActions.reason`	MISCONFIGURATION:SERVICE_ROLE_PERMISSIONS
	CloudWatch Event `statusReason` after job cancellation	Canceled by JobStateTimeLimit action due to reason: MISCONFIGURATION:SERVICE_ROLE_PERMISSIONS
All connected compute environments are invalid.	CloudWatch Event `statusReason`	ACTION_REQUIRED – CE(s) associated with the job queue are invalid.
	`jobStateTimeLimitActions.reason`	Not applicable
	CloudWatch Event `statusReason` after job cancellation	Not applicable
Batch has detected a blocked queue, but is unable to determine the reason.	CloudWatch Event `statusReason`	UNDETERMINED – Batch job is blocked, root cause is undetermined.
	`jobStateTimeLimitActions.reason`	Not applicable
	CloudWatch Event `statusReason` after job cancellation	Not applicable

You’ll note that some reasons are not able to be used in the jobStateTimeLimitActions parameter. One example is when all your queue’s attached compute environments are INVALID, or when Batch is unable to determine the root cause of the blockage. In both of those cases, we recommend setting up EventBridge rules to notify you when they occur.

Conclusion

AWS Batch has introduced new functionality designed to help you detect and act on blocked job queues, where a job that’s at the head of a queue can’t run for one reason or another and prevents all the others behind it from running, too. We shared how to use EventBridge to catch these new CloudWatch Events, and the Batch job queue API to automatically terminate the stuck job and unblock the queue of jobs behind it.

Finally, we covered the most common root causes along with the error messages to look for. In most cases Batch can automatically determine the root cause of the blockage, allowing you to define specific automated actions for each class of error to take action on and unblock the queue.

To get started using AWS Batch, log into the AWS Management Console, or read the AWS Batch User Guide.

For specific guidance on using these new events, refer to the AWS Batch Troubleshooting guide for jobs stuck in RUNNABLE, the blocked job queue events documentation, and the Batch EventBridge documentation on how react to these events.

AWS HPC Blog