Why is my Amazon ECS task stopped?

Last updated: 2022-03-29

My Amazon Elastic Container Service (Amazon ECS) task stopped. How do I troubleshoot issues stopping my Amazon ECS task?

Short description

Your Amazon ECS tasks might stop due to a variety of reasons. The most common reasons are:

  • Essential container exited
  • Failed Elastic Load Balancing (ELB) health checks
  • Failed container health checks
  • Unhealthy container instance
  • Underlying infrastructure maintenance
  • Service scaling event triggered
  • ResourceInitializationError
  • CannotPullContainerError
  • Task stopped by user

Understanding the correlation between a stopped task and stopped reason can help reduce the effort needed to troubleshoot.

Resolution

You can view the details of a stopped task using the DescribeTasks API. However, the details for the stopped task appear only for one hour in the returned results. To view stopped task details longer, you can use this AWS CloudFormation template to store Amazon CloudWatch Logs from an EventBridge event that is triggered when a task is stopped.

Stopped reasons

Essential container in task exited

All tasks must have at least one essential container. If the essential parameter of a container is marked as true and that container fails or stops for any reason, then all other containers that are part of that task are stopped. To understand why a task exited with this reason, identify the exit code using DescribeTasks API and navigate to Common exit codes section of this article.

Task failed ELB health checks

When a task fails due to ELB health checks, confirm that your container security group allows traffic originating from ELB. Consider the following:

  • Define a minimum health check grace period. This instructs the service scheduler to ignore Elastic Load Balancing health checks for a predefined time period after a task has been instantiated.
  • By default, a target starts to receive its full share of requests as soon as it's registered with a target group and passes an initial health check. Using slow start mode gives targets time to warm up before the load balancer sends them a full share of requests.
  • Monitor the CPU and memory metrics of the service. For example, high CPU can make your application unresponsive and result in a 502 error.
  • Check your application logs for application errors.
  • Check if the ping port and the health check path are configured correctly.
  • Curl the health-check path from within Amazon Elastic Compute Cloud (Amazon EC2) and confirm the response code.

Failed container health checks

Health checks can be defined in the TaskDefinition API or Dockerfile.

You can view the health status of both individual containers and the task with the DescribeTasks API operation.

Be sure that the health check command exit status indicates that the container is healthy. Check your container logs for application errors using the log driver settings specified in the task definition. Following are the possible values:

  • 0: success – The container is healthy and ready for use.
  • 1: unhealthy – The container isn't working correctly.
  • 2: reserved– Don't use this exit code.

(instance i-xx) (port x) is unhealthy in (reason Health checks failed)

This indicates the container status is unhealthy. To troubleshoot this issue:

  • Verify the security group attached to the container instance is permitting traffic.
  • Confirm there is a successful response from the back end without delay.
  • Set the response time value correctly.
  • Check the access logs of your load balancer for more information.

Service ABCService: ECS is performing maintenance on the underlying infrastructure hosting the task

This indicates that the task was stopped due to a task maintenance issue. For more information, see AWS Fargate task maintenance.

A service makes sure that the scheduling strategy specified is followed and tasks are rescheduled when they are stopped or failed. If the container instance is part of an Auto Scaling group. A new container instance must be launched and tasks placed. For more information, see Verifying a scaling activity for an Auto Scaling group.

ECS service scaling event triggered

This is a standard service message. Amazon ECS leverages the Application Auto Scaling service to provide this functionality. The ECS service has the ability to increase or decrease the desired count of task automatically. Consider the following actions:

ResourceInitializationError: unable to pull secrets or registry auth: execution resource retrieval failed

To troubleshoot this error, see How do I troubleshoot the error "unable to pull secrets or registry auth" in Amazon ECS?

    CannotPullContainerError

    This error indicates that the task execution role being used doesn't have permission to communicate to Amazon ECS. To troubleshoot this issue:

    • Verify that the Task Execution Role has the needed permissions. Amazon ECS provides the managed policy named AmazonECSTaskExecutionRolePolicy which contains the permissions for most use cases.
    • Verify the ECR Service Endpoint is accessible to: ecr.region.amazonaws.com and dkr.ecr.region.amazonaws.com
    • For private images needing authentication, ensure the repositoryCredentials and credentialsParameter are defined with the correct information. For more information, see Private registry authentication for tasks.

    Task stopped by user

    This indicates the task received a StopTask. You can identify who initiated the call by viewing StopTask in CloudTrail for userIdentity information.

    Common exit codes

    • 0 – Entrypoint, success, or CMD is completing its execution and thus the container is stopped.
    • 1 – Refers to application error. For more information, review application logs.
    • 137 – Occurs when the Task was force exit ( SIGKILL) for the container:
      Failing to respond to a SIGTERM within a default 30-second period after which the SIGKILL value is sent and containers are forcibly stopped. The default 30-second period can be configured on the ECS container agent with ECS_CONTAINER_STOP_TIMEOUT parameter.
      This could also occur in an Out-of-Memory (OOM) situation. Review your CloudWatch metrics to verify if OOM occurred.
    • 139 – Occurs when a segmentation fault is experienced. Likely, the application tried to access a memory region which it not available, or there is an unset or invalid environment variable.
    • 255 – Occurs when the ENTRYPOINT CMD command in your container failed due to an error. Review your CloudWatch Logs to confirm this.

    Common error messages

    No Container Instances were found in your cluster

    Review the container instances section for your cluster. If needed, you can launch a container instance.

    InvalidParameterException

    Be sure any parameters defined in TaskDefinition are present and the ARN is correct. Verify that the task role and task execution role has sufficient permissions.

    You've reached the limit of the number of tasks you can run concurrently

    For more information on limits, see the ECS Service Quotas.

    For all other quota increase requests, create a case in the AWS Support console, and then choose Service limit increase.


    Did this article help?


    Do you need billing or technical support?