AWS Storage Blog
Improve Kubernetes pod scheduling accuracy using Amazon EBS
In the cloud-native landscape, Amazon Elastic Block Store (Amazon EBS) volumes serve as the backbone for persistent storage in containerized applications. As organizations scale their Kubernetes workloads on Amazon Elastic Kubernetes Service (Amazon EKS), they increasingly rely on EBS volumes to provide high-performance, durable storage for stateful applications such as databases, message queues, and data processing pipelines.
However, Kubernetes users have had to overcome a persistent challenge: inaccurate pod scheduling due to outdated volume attachment capacity information. This is particularly acute in dense, multi-tenant clusters where efficient resource usage is critical.
In this post, we demonstrate how the new Mutable CSI Node Allocatable Count feature in Amazon EKS 1.34 solves this persistent scheduling problem by enabling real-time capacity updates and automatic recovery from volume attachment failures. We walk through the technical implementation, how to configure and monitor the feature, and provide a hands-on demonstration of how it prevents pods from getting stuck in ContainerCreating state.
The challenge
Imagine this scenario: your Kubernetes scheduler confidently places a critical database pod on a node, believing that it has 15 available volume attachment slots. However, in reality, only 2 slots remain available. The stateful pod gets stuck in the ContainerCreating state indefinitely, and your application fails to scale when you need it most.
This happens because of a fundamental disconnect between what the scheduler thinks is available and what’s actually available on the node. More precisely, kube-scheduler makes decisions based on static capacity information that was accurate when the CSI driver started but becomes stale as the cluster operates.
The following explains how attachment capacity may be consumed without the scheduler’s knowledge:
- Manual volume attachments: Operations teams attach EBS volumes directly to instances outside of Kubernetes.
- Shared device limits: On older Amazon Elastic Compute Cloud (Amazon EC2) instance families, network interfaces and EBS volumes compete for the same attachment slots.
- Dynamic ENI scaling: The VPC CNI and Amazon Web Services (AWS) Load Balancer Controller attach network interfaces as workloads scale, consuming attachment slots on the instance.
As a result, the Kubernetes scheduler schedules pods to nodes that have no more attachment capacity, leading to persistently stuck workloads and manual recovery procedures.
How Kubernetes volume scheduling works
To understand why this enhancement is crucial, you must understand how Kubernetes makes volume scheduling decisions:
- Kubelet blocks pod creation until volumes are attached and mounted.
- kube-scheduler infers volume usage from the pod specs of existing pods on the node.
- The scheduler calculates the total current usage as:
- Number of CSI volumes already in use by pods on the node;
- And the number of other CSI volumes needed by the new pod.
- This sum is compared to the
CSINode.Spec.Drivers[].Allocatable.Count
property, which is reported by CSI drivers during plugin registration.
Because of these scheduling mechanics, Kubernetes needs an accurate view of the max attachable volume count for each node through CSINode.Spec.Drivers[].Allocatable.Count
. Otherwise, the scheduler may place pods on nodes that have no capacity to support the attachment.
Solution overview
Amazon EKS 1.34 enables the Mutable CSI Node Allocatable Count feature by default. This functionality allows the EBS CSI driver and kubelet to periodically refresh a node’s maximum attachable volume count and react immediately when an attach attempt fails due to hitting the limit. This prevents pods from lingering in ContainerCreating.
What you get out of the box
- Automatic driver configuration: Newer EBS CSI driver releases include
nodeAllocatableUpdatePeriodSeconds
(default 10 seconds) in the CSIDriver object. Therefore, kubelet periodically calls the driver’sNodeGetInfo
and updatesCSINode.spec.drivers[].allocatable.count
. - Configurable interval: You can override the re-sync period through Helm or the Amazon EKS add‑on configuration. The minimum allowed value is 10 seconds (enforced by Kubernetes).
How it works
- Periodic updates: These keep the allocatable count in sync with reality, such as attachments performed outside Kubernetes and ENIs that consume shared device limits on older Nitro instance families.
- Reactive updates on failure: When an attach returns a terminal
ResourceExhausted
error, kubelet triggers an immediate capacity refresh so that future scheduling decisions use the corrected numbers. - Recovery behavior: On terminal attach‑limit errors, kubelet marks the pod
Failed
so that controllers can reschedule promptly—no more indefinite ContainerCreating.
Feature behavior matrix
Feature gate | nodeAllocatableUpdatePeriodSeconds | Behavior |
Enabled (default) | Set (default) | Periodic updates and reactive updates on ResourceExhausted |
Enabled | Not set | No updates, allocatable count remains static |
Disabled | Set | Field ignored, allocatable count remains static and immutable |
Disabled | Not set | No updates, allocatable count remains static and immutable |
Prerequisites
The following prerequisites are necessary to complete this solution:
- EKS cluster with nodes using older Nitro instance types.
- Amazon EKS version prior to 1.34.
- AWS Command Line Interface (AWS CLI) and kubectl configured.
Walkthrough
The following sections walk you through reproducing this issue and describe the expected behavior in EKS 1.34.
Reproduce the legacy behavior (pre‑1.34)
To understand this problem firsthand, you can reproduce it on an EKS cluster (pre 1.34) running on older Nitro instances where network interfaces and EBS volumes share attachment limits.
Step 1: Capture initial state
Examine the current capacity and attachments by running the following command in your CLI:
# Check the CSI driver's reported capacity
kubectl get csinode $NODE_NAME -o jsonpath='{.spec.drivers[?(@.name=="ebs.csi.aws.com")].allocatable.count}'
# Example output: 39 (for m5.large)
Step 2: Consume shared slot
Simulate the scenario where ENIs are attached after CSI driver initialization by running the following command in your CLI:
# Attach one additional ENI to consume a shared slot
aws ec2 attach-network-interface \
--network-interface-id $ENI_ID \
--instance-id $INSTANCE_ID \
--device-index 2
echo "Attached ENI: $ENI_ID"
Step 3: Trigger the failure mode
Scale up a StatefulSet by running the following command in your CLI:
kubectl scale statefulset my-database --replicas=15
Step 4: Observe Failure Mode
When this issue occurs, you can observe the following error patterns:
Pods
default my-database-0 1/1 Running 0 10m
default my-database-1 1/1 Running 0 9m
default my-database-2 1/1 Running 0 8m
default my-database-3 0/1 ContainerCreating 0 7m
default my-database-4 0/1 ContainerCreating 0 6m
default my-database-5 0/1 ContainerCreating 0 5m
kube-system aws-node-12345 1/1 Running 0 2d
kube-system coredns-abc123 1/1 Running 0 2d
kube-system ebs-csi-controller-xyz 6/6 Running 0 2d
Kubelet
Unable to attach or mount volumes: unmounted volumes=[test-volume],
unattached volumes=[kube-api-access-redact test-volume]: timed out waiting
for the condition
Attach/detach controller
AttachVolume.Attach failed for volume "pvc-redact": rpc error:
code = Internal desc = Could not attach volume "vol-redact" to node "i-redact":
attachment of disk "vol-redact" failed, expected device to be attached but was attaching
CSI driver controller
GRPC error: rpc error: code = RESOURCE_EXHAUSTED desc =
Could not attach volume "vol-redact" to node "i-redact": attachment of disk
"vol-redact" failed, expected device to be attached but was attaching
Manual recovery process
To fix this failure mode in Amazon EKS prior to 1.34, cluster operators must manually intervene by:
- Cordoning nodes with incorrect allocatable properties.
- Deleting CSINode objects for affected nodes.
- Re-registering the CSI plugin.
- Un-cordoning affected nodes.
- Deleting stuck pods so they could be rescheduled.
Reinstalling the CSI driver without deleting the CSINode object is not sufficient. This is because the object is immutable—the API server rejects updates to the CSINode Allocatables.Count field:

Figure 1 shows an example of the CSINode Allocatables.Count attribute.
Using Amazon EKS 1.34 would mean that this entire scenario would self-heal automatically.
Migrate to the new behavior (Amazon EKS 1.34 and above)
On Amazon EKS 1.34, the Mutable CSINode Allocatable feature is enabled by default. You can use this new functionality by upgrading your EBS CSI driver installation to v1.46.0 and above and ensuring that the CSI Node plugin container is able to access IMDS (Instance Metadata Service).
(Optional) Tune the update interval
The EBS CSI driver automatically configures a 10 second update interval. You can customize this through Helm or the Amazon EKS advanced add-on configuration:
Helm configuration
helm upgrade aws-ebs-csi-driver aws-ebs-csi-driver/aws-ebs-csi-driver \
--set nodeAllocatableUpdatePeriodSeconds=60 \
--namespace kube-system
Amazon EKS advanced add-on configuration
aws eks update-addon \
--cluster-name my-cluster \
--addon-name aws-ebs-csi-driver \
--configuration-values '{"nodeAllocatableUpdatePeriodSeconds": 60}'
The minimum allowed value is 10 seconds. This parameter is supported in Kubernetes 1.33 and above and needs the MutableCSINodeAllocatableCount
feature gate, which is enabled by default in Amazon EKS 1.34. This feature is supported in aws‑ebs‑csi‑driver v1.46.0 and above with IMDS access.
Self-managed clusters
If you operate self‑managed clusters, enable the feature gate on both kube‑apiserver and kubelet:
--feature-gates=MutableCSINodeAllocatableCount=true
Rollback considerations
To disable the feature and revert the allocatable value to static/immutable, set the new field in the CSIDriver object to null:
nodeAllocatableUpdatePeriodSeconds: null
Conclusion
The introduction of Mutable CSI Node Allocatable Count in Amazon EKS 1.34 represents a fundamental improvement in how Kubernetes handles storage capacity management and pod scheduling. This enhancement addresses a long-standing architectural limitation where static capacity reporting led to scheduling mismatches and stuck workloads.
The combination of synchronous attachment validation, dynamic capacity reporting, and enhanced error handling provides a comprehensive solution for reliable stateful workload scheduling on Amazon EKS. This represents a significant evolution in Kubernetes storage architecture, moving from static capacity assumptions to dynamic, real-time awareness of node storage capabilities.