Manuel shows you why
your AWS Batch job is stuck
in RUNNABLE status

Manuel_CPT1018

My AWS Batch job has been stuck in RUNNABLE status for a long time. How can I fix this?

AWS Batch will move your job to RUNNABLE status when the job has no outstanding dependencies and is ready to be scheduled to a host. Jobs in RUNNABLE status are started when there are enough resources available in one of the compute environments that's mapped to your job’s queue. If enough resources are not available, jobs can remain in RUNNABLE status indefinitely.

If you're using a valid managed compute environment with a default Amazon Machine Image (AMI), your job could be stuck in RUNNABLE status for the following reasons:

  • Insufficient resources: Your job specifies more CPU or memory resources than the compute environment can allocate.
  • No assigned container instance: Instances cannot be created, and networking or security issues can prevent the container instance from joining the underlying Amazon Elastic Container Service (Amazon ECS) cluster.
  • Host-level problems: There could be problems inside the container instance at the level of the host or Docker daemon. For example, the volumes of the instance might be full, or the Docker daemon or Amazon ECS container agent can have stop or start issues.

Verify that your compute environment has enough resources to run your job

1.    Open the AWS Batch console.

2.    For Job queues, in the RUNNABLE column, choose the job queue with your job that's stuck in RUNNABLE status.

3.    Choose your job that's stuck in RUNNABLE status.

4.    For Job details, in the Environment section, get the values for vCPUs and Memory.

5.    In the navigation pane, choose Compute environments.

6.    For the compute environment where your job needs to run, verify that Status is set to VALID and State is set to ENABLED.

7.    Verify that Max vCPUs is set to a value high enough to allow AWS Batch to increase the number of Desired vCPUs to run jobs.

8.    Verify that the value of Desired vCPUs is the same or higher than the amount of vCPUs the job needs to run.

9.    If Desired vCPU is 0check the amount of memory and CPU resources available for your Amazon EC2 instance type. If Desired vCPU is higher than 0 or your job is still in RUNNABLE status, complete the steps in the Verify that your compute environment has instances and the instances are available to run your job section.

Important: At least one of the instance types for your compute environment must have more memory than what your job specifies. Additionally, the instance type must have CPU resources that are equal to or more than what your job specifies. If at least one instance type doesn't have enough memory or CPU resources to run your job, cancel the job. Then, run a new job that requires less CPU or memory. Or, you can create a new compute environment with enough resources to run the job, and then assign the job to the appropriate job queue.

Verify that your compute environment has instances and the instances are available to run your job

1.    Open the Amazon ECS console.

2.    In the navigation pane, choose Clusters, and then choose the cluster that contains your job.

Note: The name of the cluster starts with the name of the compute environment, followed by _Batch_ and a random hash of numbers and letters.

3.    Choose the ECS Instances view, and then confirm that container instances are available to run your job.

4.    If the cluster has a container instance available to run your job, check the status of the Docker daemon and the Amazon ECS container agent. For more information, see Why is my Amazon ECS agent listed as disconnected?.

If no instances are in the Amazon ECS cluster, complete the steps in either of the following sections depending on your compute environment: Verify that your instances can be created in an On-Demand compute environment or Verify that your instances can be created in a Spot compute environment.

Verify that your instances can be created in an On-Demand compute environment

1.    Open the Amazon EC2 console.

2.    In the navigation pane, choose Auto Scaling Groups.

3.    For Filter, enter the name of your compute environment.

Note: More than one Auto Scaling group might be created for the same compute environment.

4.    For each Auto Scaling group, choose the Activity History view, and then look for any blocking issues.

The Status column shows Unsuccessful if there are any issues blocking the instances from launching. For example, if your account reached the maximum number of instances, then Amazon EC2 might return a message similar to the following:

Launching a new EC2 instance. Status Reason: Your quota allows for 0 more running instance(s). You requested at least 1. Launching EC2 instance failed.

The event should include a timestamp in UTC from when you submitted the job, as in the following example:

At 2018-09-03T05:54:30Z a user request update of AutoScalingGroup constraints to min: 0, max: 1, desired: 1 changing the desired capacity from 0 to 1.  
At 2018-09-03T05:54:52Z an instance was started in response to a difference between desired and actual capacity, increasing the capacity from 0 to 1.

Note: Instances are requested on your behalf by AWS Batch. Avoid modifying the Auto Scaling groups manually or your compute environment might become INVALID. For more information on instance limits and how to request a limit increase, see Amazon EC2 Service Limits.

If the most recent events of the Auto Scaling group show only successful events, complete the steps in the Verify the container instance IAM role section. Then, you can find out why your instance did not join the Amazon ECS cluster.

Verify that your instances can be created in a Spot compute environment

1.    Open the Amazon EC2 console.

2.    In the navigation pane, for Instances, choose Spot Requests.

3.    In the filter, for Request type, choose fleet.

4.    For Status, choose active.

5.    Choose Description, and then look for the value of Total target capacity to see if the request of the Spot instance was fulfilled. If no instance has been created, check the History view for a message that might explain why. For example, requests that did not reach the bid price return a message similar to the following:

m4.large, ami-aff65ad2, Linux/UNIX (Amazon VPC), us-east-1a, Spot bid price is less than Spot market price $0.0324

6.    Choose the right bid percent for your compute environment, and be sure to create a new compute environment if you change the bid price. For more information, see Spot Instance Pricing History.

Note: AWS Batch creates Spot Fleet requests on your behalf. Avoid modifying Spot Fleet requests manually, or your compute environment might become INVALID.

If the most recent events of the Auto Scaling group show only successful events, complete the steps in the Verify the container instance IAM role section.

Verify the container instance IAM role

1.    Open the AWS Batch console.

2.    In the navigation pane, choose Compute environments, and then choose your compute environment.

3.    In the Compute environment details section, get the name of the Instance role.

4.    Open the IAM console.

5.    In the search box, search for the name of your instance role, and then choose your instance role from the results.

6.    Choose the Permissions view, and then confirm that the AmazonEC2ContainerServiceforEC2Role managed policy is attached to the role. If the policy is attached, your instance role is properly configured and you can skip to step 11.

7.    Choose Attach Policies.

8.    In the search box, type AmazonEC2ContainerServiceforEC2Role.

9.    For the AmazonEC2ContainerServiceforEC2Role policy, select the check box, and then choose Attach Policy.

10.    Choose the Trust Relationships view, and then choose Edit trust relationship.

11.    Confirm that the trust relationship contains the following policy:

{
  "Version": "2008-10-17",
  "Statement": [
    {
      "Sid": "",
      "Effect": "Allow",
      "Principal": {
        "Service": "ec2.amazonaws.com"
      },
      "Action": "sts:AssumeRole"
    }
  ]
}

12.    If the trust relationship matches the policy in the above example, choose Cancel. If the trust relationship doesn't match the policy in the above example, copy the policy into the Policy Document console. Then, choose Update Trust Policy.

If your instance still isn't joining the Amazon ECS cluster, complete the steps in the Verify the network and security settings of the compute environment section.

Verify the network and security settings of the compute environment

1.    Open the AWS Batch console.

2.    In the navigation pane, choose Compute environments, and then choose your compute environment.

3.    In the Compute resources section, get the value of Subnets and Security groups.

4.    Open the Amazon VPC console.

5.    In the navigation pane, choose Subnets.

6.    For each subnet in the compute environment, choose Description, and then check the value for the Auto-assign public IPv4 address property.

If the value is Yes, the instances launched in the subnet will have the following: a public IPv4 address, a route table with a route destination of 0.0.0.0/0, and an internet gateway set to Target (for example, igw-1a2b3c4d).

If the value is No, the instances launched in the subnet will have the following: a private IPv4 address, a route table with a route destination of 0.0.0.0/0, and a NAT gateway set to Target (for example, nat-12345678901234567). For more details, see Routing.

7.    In the navigation pane, choose Security Groups.

8.    For each security group specified in the compute environment, choose the Outbound Rules view, and then confirm that a rule exists with the following settings: 
For Type, choose ALL Traffic. 
For Protocol, choose ALL. 
For Port Range, choose ALL. 
For Destination, choose and 0.0.0.0/0.

Important: If the rule doesn't exist, choose Edit, and then create the rule. If you want a more restrictive rule for outbound traffic, choose HTTPS (443) for Type and 0.0.0.0/0 for Destination.

9.    In the navigation pane, choose Network ACLs.

10.    Choose the access control list (ACL) of the VPC specified in the compute environment.

11.    Confirm that the default network ACL is configured to allow all traffic to flow in and out of associated subnets.

Important: If you modified the ACL, add a rule that allows outbound IPv4 HTTPS traffic from the subnet to the internet. For more details, see Security Groups for Your VPC and Network ACLs. To change the VPC, subnets, or security groups, create a new compute environment.

If your instance still isn't joining the Amazon ECS cluster, connect to your instance. Then, check the status of the Docker daemon and the Amazon ECS container agent.


Did this page help you? Yes | No

Back to the AWS Support Knowledge Center

Need help? Visit the AWS Support Center

Published: 2019-02-12