Why is my Amazon ECS task stuck in the PENDING state?

Last updated: 2019-08-20

Why is my Amazon Elastic Container Service (Amazon ECS) task stuck in the PENDING state?

Short Description

Some common scenarios that can cause your ECS task to be stuck in the PENDING state include:

  • The Docker daemon is unresponsive
  • The Docker image is large
  • The ECS container agent lost connectivity with the Amazon ECS service in the middle of a task launch
  • The ECS container agent takes a long time to stop an existing task

To see why your task is stuck in the PENDING state, complete the following troubleshooting steps based on the issue you're having.

Resolution

The Docker daemon is unresponsive

For CPU issues, complete the following steps:

1.    Use Amazon CloudWatch metrics to see if your container instance exceeded maximum CPU.

2.    Increase the size of your container instance as needed.

Note: To change the instance type, you might need to remove the agent state files (/var/lib/ecs/data/ecs_agent_data.json) to allow the agent to re-register to the ECS cluster.

For memory issues, complete the following steps:

1.    Run the free command to see how much memory is available for your system.

2.    Increase the size of your container instance as needed.

For I/O issues, complete the following steps:

1.    Run the iotop command.

2.    Find out what tasks in what services are using the most IOPS. Then, distribute these tasks to distinct container instances using task placement constraints and strategies.

OR

Use CloudWatch to create an alarm for your Amazon Elastic Block Store (Amazon EBS) BurstBalance metrics. Then, use an AWS Lambda function or your own custom logic to balance tasks.

The Docker image is large

Larger images take longer to download, and increase the amount of time the task is in the PENDING state.

To speed up the transition time, tune the ECS_IMAGE_PULL_BEHAVIOR parameter to take advantage of image cache.

Note: For example, set the ECS_IMAGE_PULL_BEHAVIOR parameter to prefer-cached in /etc/ecs/ecs.config. If prefer-cached is specified, then the image is pulled remotely if there's no cached image. Otherwise, the cached image on the instance is used.

The ECS container agent lost connectivity with the Amazon ECS service in the middle of a launch

1.    To verify the status and connectivity of the container agent, run either of the following commands on your container instance.

For Amazon Linux 1:

$ sudo status ecs
$ sudo docker ps -f name=ecs-agent

For Amazon Linux 2:

$ sudo systemctl status ecs
$ sudo docker ps -f name=ecs-agent

Note: The expected output should be active/running.

2.    To view metadata on running tasks in your ECS container instance, run the following commands on your container instance:

$ curl http://localhost:51678/v1/metadata

You should receive the following output:

{
  "Cluster": "CLUSTER_ID",
  "ContainerInstanceArn": "arn:aws:ecs:REGION:ACCOUNT_ID:container-instance/TASK_ID",
  "Version": "Amazon ECS Agent - AGENT "
}

3.    To view information on running tasks, run the following command on your container instance:

$ curl http://localhost:51678/v1/tasks

You should receive the following output:

{
  "Tasks": [
    {
      "Arn": "arn:aws:ecs:REGION:ACCOUNT_ID:task/TASK_ID",
      "DesiredStatus": "RUNNING",
      "KnownStatus": "RUNNING",
      ... ...
    }
  ]
}

4.    If the issue is related a disconnected agent, restart your container agent with either of the following commands.

For Amazon Linux 1:

$ sudo stop ecs
$ sudo start ecs

For Amazon Linux 2:

$ sudo systemctl stop ecs
$ sudo systemctl start ecs

You should receive output similar to the following:

ecs start/running, process xxxx

5.    To determine agent connectivity, check the following logs during your relevant time frame for key words such as "error", "warn," or "agent transition state":

View the ECS container agent log at /var/log/ecs/ecs-agent.log.yyyy-mm-dd-hh.
View the ECS init log at /var/log/ecs/ecs-init.log.
View the Docker logs at /var/log/docker.

Note: You can also use Amazon ECS logs collector to collect general operating system logs, Docker logs, and Amazon ECS container agent logs.

The ECS container agent takes a long time to stop an existing task

If the ECS container agent has older tasks to stop when it receives new tasks to start from the ECS backend (from PENDING to RUNNING), then the agent won't start these new tasks until the old tasks are stopped.

You can set the following two parameters to control container stop and start timeout at the container instance level:

1.    In /etc/ecs/ecs.config, set the value of the ECS_CONTAINER_STOP_TIMEOUT parameter to the amount of time that you want to pass before your containers are forcibly killed if they don't exit normally on their own.

Note: The default value for Linux and Windows is 30s.

2.    In /etc/ecs/ecs.config, set the value of the ECS_CONTAINER_START_TIMEOUT parameter to the amount of time that you want to pass before the ECS container agent stops trying to start the container.

Note: The default value is 3m for Linux and 8m for Windows.

If your agent version is 1.26.0 or newer, you can define the preceding stop and start timeout parameters per task. This can result in the task transitioning to a STOPPED state. For example, containerA has a dependency on containerB reaching a COMPLETE, SUCCESS, or HEALTHY status. If a startTimeout value is specified for containerB and containerB doesn't reach the desired status within that time, then containerA gives up and doesn't start.

For an example of container dependency, see Example: Container Dependency.