Why is my Amazon ECS task stuck in the PENDING state?

Last updated: 2020-12-17

My Amazon Elastic Container Service (Amazon ECS) task stuck in the PENDING state.

Short description

Some common scenarios that can cause your ECS task to be stuck in the PENDING state include:

  • The Docker daemon is unresponsive
  • The Docker image is large
  • The Amazon ECS container agent lost connectivity with the Amazon ECS service in the middle of a task launch
  • The Amazon ECS container agent takes a long time to stop an existing task

To see why your task is stuck in the PENDING state, complete the following troubleshooting steps based on the issue you're having.

Note: If you receive errors when running AWS Command Line Interface (AWS CLI) commands, make sure that you’re using the most recent AWS CLI version.

Resolution

The Docker daemon is unresponsive

For CPU issues, complete the following steps:

1.    Use Amazon CloudWatch metrics to see if your container instance exceeded maximum CPU.

2.    Increase the size of your container instance as needed.

For memory issues, complete the following steps:

1.    Run the free command to see how much memory is available for your system.

2.    Increase the size of your container instance as needed.

For I/O issues, complete the following steps:

1.    Run the iotop command.

2.    Find out what tasks in what services are using the most IOPS. Then, distribute these tasks to distinct container instances using task placement constraints and strategies.

-or-

Use CloudWatch to create an alarm for your Amazon Elastic Block Store (Amazon EBS) BurstBalance metrics. Then, use an AWS Lambda function or your own custom logic to balance tasks.

The Docker image is large

Larger images take longer to download and increase the amount of time the task is in the PENDING state.

To speed up the transition time, tune the ECS_IMAGE_PULL_BEHAVIOR parameter to take advantage of image caching.

Note: For example, set the ECS_IMAGE_PULL_BEHAVIOR parameter to prefer-cached in /etc/ecs/ecs.config. If prefer-cached is specified, then the image is pulled remotely if there's no cached image. Otherwise, the cached image on the instance is used.

The Amazon ECS container agent lost connectivity with the Amazon ECS service in the middle of a launch

1.    To verify the status and connectivity of the Amazon ECS container agent, run either of the following commands on your container instance.

For Amazon Linux 1:

$ sudo status ecs
$ sudo docker ps -f name=ecs-agent

For Amazon Linux 2:

$ sudo systemctl status ecs
$ sudo docker ps -f name=ecs-agent

Note: You should see active/running in the output.

2.    To view metadata on running tasks in your ECS container instance, run the following commands on your container instance:

$ curl http://localhost:51678/v1/metadata

You should receive the following output:

{
  "Cluster": "CLUSTER_ID",
  "ContainerInstanceArn": "arn:aws:ecs:REGION:ACCOUNT_ID:container-instance/TASK_ID",
  "Version": "Amazon ECS Agent - AGENT "
}

3.    To view information on running tasks, run the following command on your container instance:

$ curl http://localhost:51678/v1/tasks

You should receive the following output:

{
  "Tasks": [
    {
      "Arn": "arn:aws:ecs:REGION:ACCOUNT_ID:task/TASK_ID",
      "DesiredStatus": "RUNNING",
      "KnownStatus": "RUNNING",
      ... ...
    }
  ]
}

4.    If the issue is related to a disconnected agent, then restart your container agent with either of the following commands.

For Amazon Linux 1:

$ sudo stop ecs
$ sudo start ecs

For Amazon Linux 2:

$ sudo systemctl stop ecs
$ sudo systemctl start ecs

You receive output similar to the following:

ecs start/running, process xxxx

5.    To determine agent connectivity, check the following logs during the relevant time frame for key words such as "error," "warn," or "agent transition state":

View the Amazon ECS container agent log at /var/log/ecs/ecs-agent.log.yyyy-mm-dd-hh.
View the Amazon ECS init log at /var/log/ecs/ecs-init.log.
View the Docker logs at /var/log/docker.

Note: You can also use the Amazon ECS logs collector to collect general operating system logs, Docker logs, and container agent logs for Amazon ECS.

The Amazon ECS container agent takes a long time to stop an existing task

If the Amazon ECS container agent has older tasks to stop when it receives new tasks to start from Amazon ECS (from PENDING to RUNNING), then the agent won't start these new tasks until the old tasks are stopped.

You can set the following two parameters to control container stop and start timeout at the container instance level:

1.    In /etc/ecs/ecs.config, set the value of the ECS_CONTAINER_STOP_TIMEOUT parameter to the amount of time that you want to pass before your containers are forcibly killed if they don't exit normally on their own.

Note: The default value for Linux and Windows is 30s.

2.    In /etc/ecs/ecs.config, set the value of the ECS_CONTAINER_START_TIMEOUT parameter to the amount of time that you want to pass before the Amazon ECS container agent stops trying to start the container.

Note: The default value is 3m for Linux and 8m for Windows.

If your agent version is 1.26.0 or newer, you can define the preceding stop and start timeout parameters per task. This can result in the task transitioning to a STOPPED state. For example, suppose that containerA has a dependency on containerB reaching a COMPLETE, SUCCESS, or HEALTHY status. If you don't specify a startTimeout value for containerB and containerB doesn't reach the desired status within that time, then containerA doesn't start.

For an example of container dependency, see Example: Container dependency on AWS GitHub.