AWS HPC Blog
Introducing new alerts to help users detect and react to blocked job queues in AWS Batch
As many readers will know, AWS Batch provides functionality that enables you to run batch workloads on the managed container orchestration services in AWS: Amazon ECS and Amazon EKS. One of the core concepts of Batch is that it provides a job queue you can submit your work to. Batch is designed to transition your jobs from SUBMITTED
to RUNNABLE
states if they pass preliminary checks, and from RUNNING
to either FAILED
or SUCCEEDED
after the job is placed on a compute resource and completes. Batch is also sends an event to Amazon CloudWatch Events for each corresponding job state update.
Sometimes, though, a RUNNABLE
job at the head of the queue can block all other jobs behind it from running. This could be caused by a misconfiguration in your AWS account, or it might be because your account doesn’t have access to the specific instances required for the job (like a GPU, for example). We’ve documented several common causes – and their resolutions – but to fix the issue you first need to know that the blocked job queue condition exists.
Today we’re introducing CloudWatch Events notifications for blocked job queues. We’ve designed this new feature to provide you with an event any time Batch detects that a job queue is blocked by a RUNNABLE
job at the head of the queue. Even better, the feature is designed to show the reason the job is stuck in both the CloudWatch event and the statusReason
of the job returned from DescribeJobs
and ListJobs
API calls.
Common causes of blocked job queues
There are many reasons why a job at the head of the queue can block other jobs behind it from running.
IAM roles, network, and security settings are often the culprits, but there are several more reasons for a blocked job queue that Batch can detect in your environment:
- your queue’s attached compute environments (CEs) have all received insufficient capacity errors
- your CEs have a
maxVcpu
that’s too small for the job requirements - your CEs lack any instances that meet the job requirements
- your service role has a permission issue
- all your CEs are in an
INVALID
state (which usually means networking or IAM roles are preventing instances joining the compute fleet)
With the new CloudWatch Events notifications we’re announcing today, Batch can now send a notification when it detects a common reason for jobs being stuck in RUNNABLE
, but not for every RUNNABLE
job simply waiting for its turn. Finally, sometimes Batch will detect a blocked job queue but can’t determine the specific reason. In this case, Batch can still send a notification of a blocked job queue, but you’ll need to do a bit of detective work to figure out the root cause.
Taking action
There are two ways you can programmatically act on the blocked job queue events that Batch sends to CloudWatch Events.
Acting on events with Amazon EventBridge
The first way you can automate an action (based on a matching event pattern) is to define Amazon EventBridge rules with different EventBridge targets for each event type.
For example, when you receive a message about a job that requires more memory than any instance can provide, you could use an AWS Lambda function to terminate the job request—to unblock the job queue—and then send the event information into an Amazon SQS queue so you can inspect the job details later.
When you review those SQS messages, you can decide if you want to adjust the CE to meet the needs of these types of jobs, or create a new Batch environment to handle more resource-intensive jobs.
The Batch user guide has examples showing how to listen for and react to the CloudWatch events using EventBridge.
Auto terminating stuck jobs
The second way you can act to address blocked job queues is at the level of the job queue itself.
Batch now has a jobStateTimeLimitActions
parameter for job queues that lets you automatically cancel a stuck job after a defined period of time, maxTimeSeconds
. If you opt to define this parameter, Batch is designed to start the timeout clock ticking when it detects that a job is blocking the queue. Batch will also update the statusReason
field at this time. Once the stuck job at the head of the queue is cancelled, a “Batch Job State Change” CloudWatch event is emitted with the underlying reason. If Batch detects that another job at the head of the queue is blocking the queue, a “Batch Job Queue Blocked” CloudWatch event is emitted, and starts a new maxTimeSeconds
timer to take the action you defined once the limit is reached.
Here’s an example showing how to specify that the job queue will wait for a job that is blocked by compute environments having reached maximum capacity for 4 hours (maxTimeSeconds=14400) before cancelling the job.
"jobStateTimeLimitActions": [
{
"reason" : "MISCONFIGURATION:COMPUTE_ENVIRONMENT_MAX_RESOURCE",
"state": "RUNNABLE",
"maxTimeSeconds" : 14400,
"action" : "CANCEL"
}
]
Table 1 lists the scenarios we discussed, the output message in CloudWatch Events, and the jobStateTimeLimitAction.reason
that you can specify to cancel the stuck job in the job queue. The table also lists the “Batch Job State Change” CloudWatch event message if the job was automatically canceled.
Scenario | Status Parameter | Status Parameter value |
All your job queue’s connected compute environments have received insufficient capacity errors. | CloudWatch Event statusReason |
CAPACITY:INSUFFICIENT_INSTANCE_CAPACITY – Service cannot fulfill the capacity requested for instance type [instanceTypeName]. |
jobStateTimeLimitActions.reason |
CAPACITY:INSUFFICIENT_INSTANCE_CAPACITY | |
CloudWatch Event statusReason after job cancellation |
Canceled by JobStateTimeLimit action due to reason: CAPACITY:INSUFFICIENT_INSTANCE_CAPACITY | |
All compute environments have maxVcpu that is smaller than job requirements. | CloudWatch Event statusReason |
MISCONFIGURATION:COMPUTE_ENVIRONMENT_MAX_RESOURCE – CE(s) associated with the job queue cannot meet the CPU requirement of the job. |
jobStateTimeLimitActions.reason |
MISCONFIGURATION:COMPUTE_ENVIRONMENT_MAX_RESOURCE | |
CloudWatch Event statusReason after job cancellation |
Canceled by JobStateTimeLimit action due to reason: MISCONFIGURATION:COMPUTE_ENVIRONMENT_MAX_RESOURCE | |
All compute environments have no connected instances that meet job requirements. | CloudWatch Event statusReason |
MISCONFIGURATION:JOB_RESOURCE_REQUIREMENT – The job resource requirement (vCPU/memory/GPU) is higher than that can be met by the CE(s) attached to the job queue. |
jobStateTimeLimitActions.reason |
MISCONFIGURATION:JOB_RESOURCE_REQUIREMENT | |
CloudWatch Event statusReason after job cancellation |
Canceled by JobStateTimeLimit action due to reason: MISCONFIGURATION:JOB_RESOURCE_REQUIREMENT | |
All compute environments have service role issues. | CloudWatch Event statusReason |
MISCONFIGURATION:SERVICE_ROLE_PERMISSIONS – Batch service role has a permission issue. |
jobStateTimeLimitActions.reason |
Not applicable | |
CloudWatch Event statusReason after job cancellation |
Not applicable | |
All connected compute environments are invalid. | CloudWatch Event statusReason |
ACTION_REQUIRED – CE(s) associated with the job queue are invalid. |
jobStateTimeLimitActions.reason |
Not applicable | |
CloudWatch Event statusReason after job cancellation |
Not applicable | |
Batch has detected a blocked queue, but is unable to determine the reason. | CloudWatch Event statusReason |
UNDETERMINED – Batch job is blocked, root cause is undetermined. |
jobStateTimeLimitActions.reason |
Not applicable | |
CloudWatch Event statusReason after job cancellation |
Not applicable |
You’ll note that some reasons are not able to be used in the jobStateTimeLimitActions
parameter. One example is when all your queue’s attached compute environments are INVALID, or when Batch is unable to determine the root cause of the blockage. In both of those cases, we recommend setting up EventBridge rules to notify you when they occur.
Conclusion
AWS Batch has introduced new functionality designed to help you detect and act on blocked job queues, where a job that’s at the head of a queue can’t run for one reason or another and prevents all the others behind it from running, too. We shared how to use EventBridge to catch these new CloudWatch Events, and the Batch job queue API to automatically terminate the stuck job and unblock the queue of jobs behind it.
Finally, we covered the most common root causes along with the error messages to look for. In most cases Batch can automatically determine the root cause of the blockage, allowing you to define specific automated actions for each class of error to take action on and unblock the queue.
To get started using AWS Batch, log into the AWS Management Console, or read the AWS Batch User Guide.
For specific guidance on using these new events, refer to the AWS Batch Troubleshooting guide for jobs stuck in RUNNABLE, the blocked job queue events documentation, and the Batch EventBridge documentation on how react to these events.