How do I troubleshoot scaling issues with my Amazon ECS capacity provider?

Last updated: 2022-04-14

I have set up a capacity provider for my Amazon Elastic Container Service (Amazon ECS) Cluster. However, the capacity provider doesn’t scale out when the cluster runs out of resources or scale in when the capacity is less occupied.

Short description

The capacity provider for your Amazon ECS cluster doesn't automatically scale in or scale out due to one or more of the following reasons:

  • The Amazon ECS service isn't associated with the capacity provider.
  • The scaling policies related to the capacity provider aren't attached to the Auto Scaling group.
  • The target tracking scaling policies aren't configured correctly.
  • The target capacity percentage isn't configured in the capacity provider correctly.
  • The task placement strategy isn't defined according to the workload.
  • The ECS Service is failing with some errors and blocking the capacity provider from scaling.
  • You're using managed scaling for the capacity provider, and the Auto Scaling group has custom scaling policies attached to it.
  • The Auto Scaling group has launched the container instance, but is unable to join the cluster.
  • Your container instances are protected from scaling in.
  • The capacity provider is stuck in failed state.
  • The Auto Scaling group is stuck in a loop of scaling out and scaling in.

Resolution

The Amazon ECS service isn't associated with the capacity provider

To check if the ECS service is associated with the capacity provider, run the AWS Command Line Interface (AWS CLI) command describe-services.

aws ecs describe-services --cluster example-cluster --services example-service --region example-region --query services[].capacityProviderStrategy

If your ECS service is associated with the capacity provider, then the output must look similar to the following:

[
  [
    {
      "capacityProvider": "example-capacity-provider",
      "weight": 1,
      "base": 1
    }
  ]
]

Be sure that the capacityProviderStrategy field is not null in the output. You can view the configuration of the service by reviewing AWS CloudTrail events for CreateService and UpdateService API calls.

To resolve this issue, update the ECS service using the AWS CLI commands update-service, run-task, or put-cluster-capacity-providers. You can also update the service using the Amazon ECS console.

Note: If you receive errors when running AWS CLI commands, make sure that you’re using the most recent version of the AWS CLI.

When a capacity provider is created and associated with an Auto Scaling group, the Auto Scaling group creates a scaling policy that uses target tracking to modify the desired capacity for accommodating cluster loads.

To troubleshoot this issue, review CloudTrail events for UpdateAutoScalingGroup, CreateCapacityProvider, UpdateCapacityProvider, and PutScalingPolicy APIs.

Verify that the Auto Scaling group is created as a cluster attachment by running the following command:

aws ecs describe-clusters --clusters example-cluster --include ATTACHMENTS --region example-region --query clusters[].attachments[]

The output of the command must look similar to the following:

[
  {
    "id": "100a23456-5f0b-4abc-b998-d6789d111a",
    "type": "asp",
    "status": "CREATED",
    "details": [
      {
        "name": "capacityProviderName",
        "value": "example-capacityProvider"
      },
      {
        "name": "scalingPlanName",
        "value": "ECSManagedAutoScalingPlan-bb60c8fa-3ed7-4808-b39c-abcdef2345"
      }
    ]
  }
]

If you're using a managed scaling policy, then check whether the policy is attached to the Auto Scaling group by doing the following:

  1. Open the Amazon ECS console.
  2. In the navigation pane, choose Clusters.
  3. Open the cluster that you want to check.
  4. Choose the Capacity Providers tab.
  5. For the capacity provider that you want to check, choose the ASG.
    You are directed to the Auto Scaling groups page in the Amazon EC2 console.
  6. Choose the Automatic Scaling tab.
    You can view the scaling policies.
  7. Check whether the scaling policy that you're using is included.

Also, be sure to include the prefix AutoScaling-ECSManagedAutoScalingPlan to the name of the Auto Scaling group scaling policy. Otherwise, the Auto Scaling group uses a scaling policy that's different from the one managed by the capacity provider. Note that capacity providers can be used along with other types of scaling policies. For more information, see Service auto scaling.

The target tracking scaling policies aren't configured correctly

A target tracking scaling policy tracks a target value for the metric you define. Amazon ECS service auto scaling creates and manages the Amazon CloudWatch alarms that trigger the scaling policy and calculates the scaling adjustment based on the metric and target value. If the target tracking policy isn't configured correctly, tasks might not automatically scale as required.

Suppose that the target tracking auto scaling policy is tracking the CPUUtilization metric in CloudWatch, and you specify a target tracking percentage of 60. In this case, the capacity provider works on a best effort basis to keep the aggregate CPU utilization at 60%. This results in a scale out event when the CPU utilization is greater than 60% and a scale in event when the utilization is less than 60%.

To resolve this issue, choose the right metric and set the correct scale-in and scale-out values in the target tracking policy based on your workload. For more information, see Target tracking scaling policies.

The target capacity percentage isn't configured in the capacity provider correctly

The target capacity value is used as the target value for the CloudWatch metric that's used in the Amazon ECS-managed target tracking scaling policy. This target capacity value is matched on a best effort basis. The allowed values for this value are integers between 1 and 100. For example, if you set the target capacity to 100%, all instances are utilized and any instances that are not running tasks are scaled in. However, this behavior is not guaranteed at all times. If you need spare capacity, set the target capacity to a value that's slightly lower than 100% based on your requirement.

To update the capacity provider with the correct target capacity percentage, follow the instructions in Updating an Auto Scaling group capacity provider using the classic console.

The task placement strategy isn't defined according to the workload

Task placement strategies can be specified when you create a service or run a task. You can also update the task placement strategies for existing services. For example, if your workload is memory-intensive and you didn't configure the task placement strategy accordingly, the tasks don't scale in or out based on your memory usage. Be sure to check the task placement strategy types and define these strategies according to your workload.

The ECS Service is failing with some errors and blocking the capacity provider from scaling

If your ECS service fails with any errors, then the capacity provider is blocked from scaling in and scaling out. To troubleshoot why the ECS service failed, check the service event messages in the Amazon ECS console.

You're using managed scaling for the capacity provider, and the Auto Scaling group has custom scaling policies attached to it

When your cluster doesn't automatically scale, you might get the following error:

"StatusCode": "ActiveWithProblems"
"StatusMessage": "Scaling plan has been created but failed to be applied to all resources. Problems were encountered for 1 resource. See scaling plan resources for the failure details."
This error occurs when both the following conditions are true:
  • You're using the AWS managed scaling for the capacity provider.
  • The Auto Scaling group has custom scaling policies that are not created by Amazon ECS attached.

To resolve this error, see Avoiding the ActiveWithProblems error. When you enable managed scaling, Amazon ECS manages the scale-in and scale-out actions of the Auto Scaling group with Auto Scaling scaling plans. It's best practice to always create a new Auto Scaling group and attach this group to the capacity provider.

The Auto Scaling group has launched the container instance, but is unable to join the cluster

Your container instances are protected from scaling in

If you enabled managed termination protection when you configured the capacity provider, Amazon ECS prevents the Amazon EC2 instances in an Auto Scaling group that contain tasks from being terminated during a scale-in action.

To make sure that the Auto Scaling group can terminate old instances when you change the desired capacity, do the following:

For more information, see How do I resolve the error "The managed termination protection setting for the capacity provider is invalid" in Amazon ECS?

The capacity provider is stuck in failed state

It's a best practice to create a new Auto Scaling group to use with your capacity provider instead of using an existing group. If you use an existing Auto Scaling group, you might have issues using the capacity provider. This is because the Amazon EC2 instances in the running state that are associated with the existing group and registered to an Amazon ECS cluster might not be properly registered with the capacity provider.

To see the status of the capacity provider, run the AWS CLI command describe-capacity-providers.

Also, review CloudTrail events, and check for errors related to the CreateCapacityProvider API.

The Auto Scaling group is stuck in a loop of scaling out and scaling in

When the metric value that's specified in the scaling policy for your ECS service spikes, the Auto Scaling group scales out and launches instances as required. However, if the value of the metric drops after the sudden spike, the Auto Scaling group tries to scale in the instances. If the metric value fluctuates several times within a short time frame, then the Auto Scaling group might get stuck in a loop of scaling out and scaling in. To avoid this issue, be sure to define the threshold value of the metric in the scaling policy according to your workload.