Why is my Amazon ECS task stopped?

6 minute read
3

My Amazon Elastic Container Service (Amazon ECS) task stopped. How do I troubleshoot issues stopping my Amazon ECS task?

Short description

Your Amazon ECS tasks might stop due to a variety of reasons. The most common reasons are:

  • Essential container exited
  • Failed Elastic Load Balancing (ELB) health checks
  • Failed container health checks
  • Unhealthy container instance
  • Underlying infrastructure maintenance
  • Service scaling event triggered
  • ResourceInitializationError
  • CannotPullContainerError
  • Task stopped by user

Understanding the correlation between a stopped task and stopped reason can help reduce the effort needed to troubleshoot.

Resolution

You can view the details of a stopped task using the DescribeTasks API. However, the details for the stopped task appear only for one hour in the returned results. To view stopped task details longer, you can use this AWS CloudFormation template to store Amazon CloudWatch Logs from an EventBridge event that is triggered when a task is stopped.

Stopped reasons

Essential container in task exited

All tasks must have at least one essential container. If the essential parameter of a container is marked as true and that container fails or stops for any reason, then all other containers that are part of that task are stopped. To understand why a task exited with this reason, identify the exit code using DescribeTasks API and navigate to Common exit codes section of this article.

Task failed ELB health checks

When a task fails due to ELB health checks, confirm that your container security group allows traffic originating from ELB. Consider the following:

  • Define a minimum health check grace period. This instructs the service scheduler to ignore Elastic Load Balancing health checks for a predefined time period after a task has been instantiated.
  • By default, a target starts to receive its full share of requests as soon as it's registered with a target group and passes an initial health check. Using slow start mode gives targets time to warm up before the load balancer sends them a full share of requests.
  • Monitor the CPU and memory metrics of the service. For example, high CPU can make your application unresponsive and result in a 502 error.
  • Check your application logs for application errors.
  • Check if the ping port and the health check path are configured correctly.
  • Curl the health-check path from within Amazon Elastic Compute Cloud (Amazon EC2) and confirm the response code.

Failed container health checks

Health checks can be defined in the TaskDefinition API or Dockerfile.

You can view the health status of both individual containers and the task with the DescribeTasks API operation.

Be sure that the health check command exit status indicates that the container is healthy. Check your container logs for application errors using the log driver settings specified in the task definition. Following are the possible values:

  • 0: success – The container is healthy and ready for use.
  • 1: unhealthy – The container isn't working correctly.
  • 2: reserved– Don't use this exit code.

(instance i-xx) (port x) is unhealthy in (reason Health checks failed)

This indicates the container status is unhealthy. To troubleshoot this issue:

  • Verify the security group attached to the container instance is permitting traffic.
  • Confirm there is a successful response from the back end without delay.
  • Set the response time value correctly.
  • Check the access logs of your load balancer for more information.

Service ABCService: ECS is performing maintenance on the underlying infrastructure hosting the task

This indicates that the task was stopped due to a task maintenance issue. For more information, see AWS Fargate task maintenance.

A service makes sure that the scheduling strategy specified is followed and tasks are rescheduled when they are stopped or failed. If the container instance is part of an Auto Scaling group. A new container instance must be launched and tasks placed. For more information, see Verifying a scaling activity for an Auto Scaling group.

ECS service scaling event triggered

This is a standard service message. Amazon ECS leverages the Application Auto Scaling service to provide this functionality. The ECS service has the ability to increase or decrease the desired count of task automatically. Consider the following actions:

ResourceInitializationError: unable to pull secrets or registry auth: execution resource retrieval failed

To troubleshoot this error, see How do I troubleshoot the error "unable to pull secrets or registry auth" in Amazon ECS?

CannotPullContainerError

This error indicates that the task execution role being used doesn't have permission to communicate to Amazon ECS. To troubleshoot this issue:

  • Verify that the Task Execution Role has the needed permissions. Amazon ECS provides the managed policy named AmazonECSTaskExecutionRolePolicy which contains the permissions for most use cases.
  • Verify the ECR Service Endpoint is accessible to: ecr.region.amazonaws.com and dkr.ecr.region.amazonaws.com
  • For private images needing authentication, ensure the repositoryCredentials and credentialsParameter are defined with the correct information. For more information, see Private registry authentication for tasks.

Task stopped by user

This indicates the task received a StopTask. You can identify who initiated the call by viewing StopTask in CloudTrail for userIdentity information.

Common exit codes

  • 0 – Entrypoint, success, or CMD is completing its execution and thus the container is stopped.
  • 1 – Refers to application error. For more information, review application logs.
  • 137 – Occurs when the Task was force exit ( SIGKILL) for the container:
    Failing to respond to a SIGTERM within a default 30-second period after which the SIGKILL value is sent and containers are forcibly stopped. The default 30-second period can be configured on the ECS container agent with ECS_CONTAINER_STOP_TIMEOUT parameter.
    This could also occur in an Out-of-Memory (OOM) situation. Review your CloudWatch metrics to verify if OOM occurred.
  • 139 – Occurs when a segmentation fault is experienced. Likely, the application tried to access a memory region which it not available, or there is an unset or invalid environment variable.
  • 255 – Occurs when the ENTRYPOINT CMD command in your container failed due to an error. Review your CloudWatch Logs to confirm this.

Common error messages

No Container Instances were found in your cluster

Review the container instances section for your cluster. If needed, you can launch a container instance.

InvalidParameterException

Be sure any parameters defined in TaskDefinition are present and the ARN is correct. Verify that the task role and task execution role has sufficient permissions.

You've reached the limit of the number of tasks you can run concurrently

For more information on limits, see the ECS Service Quotas.

For all other quota increase requests, create a case in the AWS Support console, and then choose Service limit increase.


AWS OFFICIAL
AWS OFFICIALUpdated 2 years ago
2 Comments

I am getting errors from the ECS service with error code 139, while I check from inside the container using ECS execute-command the memory usage for the container is around 700MB, it is assigned 8GB for the ECS task definition and the soft limit is 1024 and hard limit is 8000

is there any way to troubleshoot this error in the AWS ECS service?

Also, one thing I noticed is even though I have added 8GB as container memory in the task definition why the total memory is showing as 16GB while I check the free -m using ECS execute-command

replied 7 months ago

Thank you for your comment. We'll review and update the Knowledge Center article as needed.

profile pictureAWS
MODERATOR
replied 7 months ago