AWS Storage Blog

Improve Kubernetes pod scheduling accuracy using Amazon EBS

In the cloud-native landscape, Amazon Elastic Block Store (Amazon EBS) volumes serve as the backbone for persistent storage in containerized applications. As organizations scale their Kubernetes workloads on Amazon Elastic Kubernetes Service (Amazon EKS), they increasingly rely on EBS volumes to provide high-performance, durable storage for stateful applications such as databases, message queues, and data processing pipelines.

However, Kubernetes users have had to overcome a persistent challenge: inaccurate pod scheduling due to outdated volume attachment capacity information. This is particularly acute in dense, multi-tenant clusters where efficient resource usage is critical.

In this post, we demonstrate how the new Mutable CSI Node Allocatable Count feature in Amazon EKS 1.34 solves this persistent scheduling problem by enabling real-time capacity updates and automatic recovery from volume attachment failures. We walk through the technical implementation, how to configure and monitor the feature, and provide a hands-on demonstration of how it prevents pods from getting stuck in ContainerCreating state.

The challenge

Imagine this scenario: your Kubernetes scheduler confidently places a critical database pod on a node, believing that it has 15 available volume attachment slots. However, in reality, only 2 slots remain available. The stateful pod gets stuck in the ContainerCreating state indefinitely, and your application fails to scale when you need it most.

This happens because of a fundamental disconnect between what the scheduler thinks is available and what’s actually available on the node. More precisely, kube-scheduler makes decisions based on static capacity information that was accurate when the CSI driver started but becomes stale as the cluster operates.

The following explains how attachment capacity may be consumed without the scheduler’s knowledge:

  • Manual volume attachments: Operations teams attach EBS volumes directly to instances outside of Kubernetes.
  • Shared device limits: On older Amazon Elastic Compute Cloud (Amazon EC2) instance families, network interfaces and EBS volumes compete for the same attachment slots.
  • Dynamic ENI scaling: The VPC CNI and Amazon Web Services (AWS) Load Balancer Controller attach network interfaces as workloads scale, consuming attachment slots on the instance.

As a result, the Kubernetes scheduler schedules pods to nodes that have no more attachment capacity, leading to persistently stuck workloads and manual recovery procedures.

How Kubernetes volume scheduling works

To understand why this enhancement is crucial, you must understand how Kubernetes makes volume scheduling decisions:

  1. Kubelet blocks pod creation until volumes are attached and mounted.
  2. kube-scheduler infers volume usage from the pod specs of existing pods on the node.
  3. The scheduler calculates the total current usage as:
    • Number of CSI volumes already in use by pods on the node;
    • And the number of other CSI volumes needed by the new pod.
  4. This sum is compared to the CSINode.Spec.Drivers[].Allocatable.Count property, which is reported by CSI drivers during plugin registration.

Because of these scheduling mechanics, Kubernetes needs an accurate view of the max attachable volume count for each node through CSINode.Spec.Drivers[].Allocatable.Count. Otherwise, the scheduler may place pods on nodes that have no capacity to support the attachment.

Solution overview

Amazon EKS 1.34 enables the Mutable CSI Node Allocatable Count feature by default. This functionality allows the EBS CSI driver and kubelet to periodically refresh a node’s maximum attachable volume count and react immediately when an attach attempt fails due to hitting the limit. This prevents pods from lingering in ContainerCreating.

What you get out of the box

  • Automatic driver configuration: Newer EBS CSI driver releases include nodeAllocatableUpdatePeriodSeconds (default 10 seconds) in the CSIDriver object. Therefore, kubelet periodically calls the driver’s NodeGetInfo and updates CSINode.spec.drivers[].allocatable.count.
  • Configurable interval: You can override the re-sync period through Helm or the Amazon EKS add‑on configuration. The minimum allowed value is 10 seconds (enforced by Kubernetes).

How it works

  • Periodic updates: These keep the allocatable count in sync with reality, such as attachments performed outside Kubernetes and ENIs that consume shared device limits on older Nitro instance families.
  • Reactive updates on failure: When an attach returns a terminal ResourceExhausted error, kubelet triggers an immediate capacity refresh so that future scheduling decisions use the corrected numbers.
  • Recovery behavior: On terminal attach‑limit errors, kubelet marks the pod Failed so that controllers can reschedule promptly—no more indefinite ContainerCreating.

Feature behavior matrix

Feature gate nodeAllocatableUpdatePeriodSeconds Behavior
Enabled (default) Set (default) Periodic updates and reactive updates on ResourceExhausted
Enabled Not set No updates, allocatable count remains static
Disabled Set Field ignored, allocatable count remains static and immutable
Disabled Not set No updates, allocatable count remains static and immutable

Prerequisites

The following prerequisites are necessary to complete this solution:

Walkthrough

The following sections walk you through reproducing this issue and describe the expected behavior in EKS 1.34.

Reproduce the legacy behavior (pre‑1.34)

To understand this problem firsthand, you can reproduce it on an EKS cluster (pre 1.34) running on older Nitro instances where network interfaces and EBS volumes share attachment limits.

Step 1: Capture initial state

Examine the current capacity and attachments by running the following command in your CLI:

# Check the CSI driver's reported capacity
kubectl get csinode $NODE_NAME -o jsonpath='{.spec.drivers[?(@.name=="ebs.csi.aws.com")].allocatable.count}'
# Example output: 39 (for m5.large)

Step 2: Consume shared slot

Simulate the scenario where ENIs are attached after CSI driver initialization by running the following command in your CLI:

# Attach one additional ENI to consume a shared slot
aws ec2 attach-network-interface \
  --network-interface-id $ENI_ID \
  --instance-id $INSTANCE_ID \
  --device-index 2
echo "Attached ENI: $ENI_ID"

Step 3: Trigger the failure mode

Scale up a StatefulSet by running the following command in your CLI:

kubectl scale statefulset my-database --replicas=15

Step 4: Observe Failure Mode

When this issue occurs, you can observe the following error patterns:

Pods

default       my-database-0            1/1     Running             0          10m
default       my-database-1            1/1     Running             0          9m
default       my-database-2            1/1     Running             0          8m
default       my-database-3            0/1     ContainerCreating   0          7m
default       my-database-4            0/1     ContainerCreating   0          6m
default       my-database-5            0/1     ContainerCreating   0          5m
kube-system   aws-node-12345           1/1     Running             0          2d
kube-system   coredns-abc123           1/1     Running             0          2d
kube-system   ebs-csi-controller-xyz   6/6     Running             0          2d

Kubelet

Unable to attach or mount volumes: unmounted volumes=[test-volume], 
unattached volumes=[kube-api-access-redact test-volume]: timed out waiting 
for the condition

Attach/detach controller

AttachVolume.Attach failed for volume "pvc-redact": rpc error: 
code = Internal desc = Could not attach volume "vol-redact" to node "i-redact": 
attachment of disk "vol-redact" failed, expected device to be attached but was attaching

CSI driver controller

GRPC error: rpc error: code = RESOURCE_EXHAUSTED desc = 
Could not attach volume "vol-redact" to node "i-redact": attachment of disk 
"vol-redact" failed, expected device to be attached but was attaching

Manual recovery process

To fix this failure mode in Amazon EKS prior to 1.34, cluster operators must manually intervene by:

  1. Cordoning nodes with incorrect allocatable properties.
  2. Deleting CSINode objects for affected nodes.
  3. Re-registering the CSI plugin.
  4. Un-cordoning affected nodes.
  5. Deleting stuck pods so they could be rescheduled.

Reinstalling the CSI driver without deleting the CSINode object is not sufficient. This is because the object is immutable—the API server rejects updates to the CSINode Allocatables.Count field:

an example of the CSINode Allocatables.Count attribute.

Figure 1 shows an example of the CSINode Allocatables.Count attribute.

Using Amazon EKS 1.34 would mean that this entire scenario would self-heal automatically.

Migrate to the new behavior (Amazon EKS 1.34 and above)

On Amazon EKS 1.34, the Mutable CSINode Allocatable feature is enabled by default. You can use this new functionality by upgrading your EBS CSI driver installation to v1.46.0 and above and ensuring that the CSI Node plugin container is able to access IMDS (Instance Metadata Service).

(Optional) Tune the update interval

The EBS CSI driver automatically configures a 10 second update interval. You can customize this through Helm or the Amazon EKS advanced add-on configuration:

Helm configuration

helm upgrade aws-ebs-csi-driver aws-ebs-csi-driver/aws-ebs-csi-driver \
  --set nodeAllocatableUpdatePeriodSeconds=60 \
  --namespace kube-system

Amazon EKS advanced add-on configuration

aws eks update-addon \
  --cluster-name my-cluster \
  --addon-name aws-ebs-csi-driver \
  --configuration-values '{"nodeAllocatableUpdatePeriodSeconds": 60}'

The minimum allowed value is 10 seconds. This parameter is supported in Kubernetes 1.33 and above and needs the MutableCSINodeAllocatableCount feature gate, which is enabled by default in Amazon EKS 1.34. This feature is supported in aws‑ebs‑csi‑driver v1.46.0 and above with IMDS access.

Self-managed clusters

If you operate self‑managed clusters, enable the feature gate on both kube‑apiserver and kubelet:

--feature-gates=MutableCSINodeAllocatableCount=true 

Rollback considerations

To disable the feature and revert the allocatable value to static/immutable, set the new field in the CSIDriver object to null:

nodeAllocatableUpdatePeriodSeconds: null

Conclusion

The introduction of Mutable CSI Node Allocatable Count in Amazon EKS 1.34 represents a fundamental improvement in how Kubernetes handles storage capacity management and pod scheduling. This enhancement addresses a long-standing architectural limitation where static capacity reporting led to scheduling mismatches and stuck workloads.

The combination of synchronous attachment validation, dynamic capacity reporting, and enhanced error handling provides a comprehensive solution for reliable stateful workload scheduling on Amazon EKS. This represents a significant evolution in Kubernetes storage architecture, moving from static capacity assumptions to dynamic, real-time awareness of node storage capabilities.

Related resources

Anuj Butail

Anuj Butail

Anuj Butail is a Principal Solutions architect at AWS. He is based out of San Francisco and helps customers in San Francisco and Silicon Valley design and build large scale applications on AWS. He has expertise in the area of AWS, edge services, and containers. He enjoys playing tennis, reading, and spending time with his family.

Eddie Torres

Eddie Torres

Eddie is a member of the Amazon EBS team and a maintainer of the AWS EBS CSI driver. He is an active contributor to the Kubernetes community, where he participates as a member of the Storage Special Interest Group (SIG-Storage).