Why is my AWS Batch job stuck in RUNNABLE status?

Last updated: 2021-10-06

My AWS Batch job is stuck in RUNNABLE status. Why is this happening, and how do I get my AWS Batch job unstuck?

Short description

AWS Batch moves a job to RUNNABLE status when the job has no outstanding dependencies and is ready to be scheduled to a host. RUNNABLE jobs are started as soon as sufficient resources are available in one of the compute environments that are mapped to the job's queue.

If enough resources to run a job aren't available, then the job can remain in RUNNABLE status indefinitely. For more information, see Jobs stuck in RUNNABLE status in the AWS Batch User Guide.

To troubleshoot AWS Batch jobs stuck in RUNNABLE status, do the following.

Resolution

Verify that your compute environment has enough resources to run your job

1.    Open the AWS Batch console.

2.    Choose Dashboard.

3.    In the Job queue overview pane, in the RUNNABLE column, choose the job that's stuck in RUNNABLE status. The Job details page appears.

4.    On the Job details page, in the Environment section, review the values for vCPUs and Memory. You need these values to complete steps 7 through 9.

5.    In the left navigation pane, choose Compute environments. Then, in the Name column, find the name of the compute environment where your job needs to run.

6.    Review the Status column for the compute environment. Make sure that it's set to VALID.

7.    Review the State column for the compute environment. Make sure that it's set to ENABLED.

8.    Review the Max vCPUs column and the Desired vCPUs column for the compute environment. Make sure that the Max vCPUs value is set high enough to allow AWS Batch to increase the number of Desired vCPUs to run jobs.

9.    Verify that the Desired vCPUs value is the same or higher than the number of vCPUs the job needs to run.

10.    If Desired vCPU is 0, check the amount of memory and CPU resources available for your Amazon Elastic Compute Cloud (Amazon EC2) instance type.

-or-

If Desired vCPU is higher than 0 or your job is still in RUNNABLE status, complete the steps in the following section of this article.

Important: At least one of the instance types for your compute environment must have more memory than what your job specifies. Also, the instance type must have CPU resources that are equal to or more than what your job specifies. If at least one instance type doesn't have enough memory or CPU resources to run your job, cancel the job. Then, run a new job that requires less CPU or memory. Or, create a new compute environment with enough resources to run the job, and then assign the job to the appropriate job queue.

Verify that your compute environment has instances and the instances are available to run your job

1.    Open the Amazon Elastic Container Service (Amazon ECS) console.

2.    In the left navigation pane, choose Clusters. Then, choose the cluster that contains your job.

Note: The name of the cluster starts with the name of the compute environment, followed by _Batch_ and a random hash of numbers and letters.

3.    Choose the ECS Instances view. Then, verify that container instances are available to run your job.

4.    If the cluster has a container instance available to run your job, check the status of the Docker daemon and the Amazon ECS container agent. For more information, see Why are my Amazon ECS container instances with Amazon Linux 1 AMIs disconnected?

If no instances are in the Amazon ECS cluster, verify that your instances can be created in your compute environment. To verify that your instances can be created, do one of the following based on your compute environment:

To verify that your instances can be created in an On-Demand compute environment

1.    Open the Amazon EC2 console.

2.    In the left navigation pane, choose Auto Scaling Groups.

3.    For Filter, enter the name of your compute environment.

Note: Amazon EC2 could create more than one Auto Scaling group for the same compute environment.

4.    For each Auto Scaling group, choose the Activity History view. Then, look for any blocking issues. The Status column shows Unsuccessful if there are any issues blocking the instances from launching. For example, if your account reaches the maximum number of instances, then Amazon EC2 could return a message similar to the following:

Launching a new EC2 instance. Status Reason: Your quota allows for 0 more running instance(s). You requested at least 1. Launching EC2 instance failed.

The event includes a timestamp in UTC from when you submitted the job. For example:

At 2018-09-03T05:54:30Z a user request update of AutoScalingGroup constraints to min: 0, max: 1, desired: 1 changing the desired capacity from 0 to 1.
At 2018-09-03T05:54:52Z an instance was started in response to a difference between desired and actual capacity, increasing the capacity from 0 to 1.

Note: AWS Batch requests instances on your behalf. If you modify the Auto Scaling groups manually, your compute environment could become INVALID. For more information on instance limits and how to request a limit increase, see Amazon EC2 service quotas.

5.    If the most recent events of the Auto Scaling group show only successful events, complete the steps in the following section.

To verify that your instances can be created in a Spot compute environment

1.    Open the Amazon EC2 console.

2.    In the left navigation pane, choose Instances. Then, choose Spot Requests.

3.    In the filter, for Request type, choose fleet.

4.    For Status, choose active.

5.    Choose Description. Then, review the Total target capacity value to see if the Spot Instance request was fulfilled. If no instance was created, check the History view to see a message that explains why. For example, requests that didn't reach the bid price return a message similar to the following:

m4.large, ami-aff65ad2, Linux/UNIX (Amazon VPC), us-east-1a, Spot bid price is less than Spot market price $0.0324

6.    Choose the right bid percent for your compute environment. Also, make sure that you create a new compute environment if you change the bid price. For more information, see Spot Instance pricing history.

Note: AWS Batch creates Spot Fleet requests on your behalf. Avoid modifying Spot Fleet requests manually, or your compute environment could become INVALID.

7.    If the most recent events of the Auto Scaling group show only successful events, complete the steps in the next section.

Verify the container instance IAM role

1.    Open the AWS Batch console.

2.    In the navigation pane, choose Compute environments. Then, choose your compute environment.

3.    In the Compute environment details section, copy the Instance role name.

4.    Open the AWS Identity and Access Management (IAM) console.

5.    In the search box, enter the Instance role name. Then, choose your instance role from the results.

6.    Choose the Permissions view. Then, confirm that the AmazonEC2ContainerServiceforEC2Role managed policy is attached to the role. If the policy is attached, your instance role is properly configured and you can skip to step 11.

7.    Choose Attach Policies.

8.    In the search box, enter AmazonEC2ContainerServiceforEC2Role.

9.    For the AmazonEC2ContainerServiceforEC2Role policy, select the check box. Then, choose Attach Policy.

10.    Choose the Trust Relationships view. Then, choose Edit trust relationship.

11.    Confirm that the trust relationship contains the following policy:

{
  "Version": "2008-10-17",
  "Statement": [
    {
      "Sid": "",
      "Effect": "Allow",
      "Principal": {
        "Service": "ec2.amazonaws.com"
      },
      "Action": "sts:AssumeRole"
    }
  ]
}

12.    If the trust relationship matches the policy in the preceding example, then choose Cancel.

-or-

If the trust relationship doesn't match the policy in the preceding example, copy the policy into the Policy Document console. Then, choose Update Trust Policy.

If your instance still isn't joining the ECS cluster, complete the steps in the next section.

Verify the network and security settings of the compute environment

1.    Open the AWS Batch console.

2.    In the left navigation pane, choose Compute environments. Then, choose your compute environment.

3.    In the Compute resources section, copy the Subnets and Security groups values.

4.    Open the Amazon Virtual Private Cloud (Amazon VPC) console.

5.    In the left navigation pane, choose Subnets.

6.    For each subnet in the compute environment, choose Description. Then, review the Auto-assign public IPv4 address value.

If the Auto-assign public IPv4 address value is "Yes"

The instances launched in the subnet have the following:

  • A public IPv4 address
  • A route table with a route destination of 0.0.0.0/0
  • An internet gateway set to Target (for example: igw-1a2b3c4d)

If the Auto-assign public IPv4 address value is "No"

The instances launched in the subnet have the following:

  • A private IPv4 address
  • A route table with a route destination of 0.0.0.0/0
  • A NAT gateway set to Target (for example: nat-12345678901234567).

Note: For more information, see Routing.

7.    In the left navigation pane, choose Security Groups.

8.    For each security group specified in the compute environment, choose the Outbound Rules view. Then, verify that a rule with the following settings exists:
For Type, choose ALL Traffic.
For Protocol, choose ALL.
For Port Range, choose ALL.
For Destination, choose and 0.0.0.0/0.

Important: If the rule doesn't exist, choose Edit. Then, create the rule. For a more restrictive rule for outbound traffic, choose HTTPS (443) for Type and 0.0.0.0/0 for Destination.

9.    In the left navigation pane, choose Network ACLs.

10.    Choose the VPC's access control list (ACL).

11.    Confirm that the default network ACL is configured to allow all traffic to flow in and out of associated subnets.

Important: If you modified the ACL, add a rule that allows outbound IPv4 HTTPS traffic from the subnet to the internet. For more information, see Security groups for your VPC and Network ACLs. To change the VPC, subnets, or security groups, create a new compute environment.

If your instance still isn't joining the ECS cluster, connect to your instance. Then, check the status of the Docker daemon and the Amazon ECS container agent.


Did this article help?


Do you need billing or technical support?