Containers
Faster nodes, smarter scaling: What’s new inside Amazon Elastic Kubernetes Service (Amazon EKS) Auto Mode
When you’re running production Kubernetes workloads, every second matters. The time a node takes to become ready, how quickly your cluster scales in response to a traffic spike, or how fast DNS resolves from a new pod—these non-functional characteristics aren’t flashy feature announcements, but they determine whether your applications feel responsive or sluggish under real-world conditions.
Since launching Amazon Elastic Kubernetes Service (Amazon EKS) Auto Mode, we’ve been focused on making the infrastructure beneath your workloads faster, more efficient, and more resilient—without requiring changes on your part.
In this post, we walk through the performance and scalability improvements we shipped across the four pillars of EKS Auto Mode: runtime, compute, storage, and networking.
Key takeaways:
- Node boot time reduced 39 percent (13 seconds faster) through startup detection optimization.
- Karpenter, the node lifecycle manager in EKS Auto Mode, delivers 43 percent faster scale-out. Consolidation is up to 69 percent faster, with 30 percent more cluster capacity.
- Node-local DNS delivers sub-millisecond resolution without cluster-wide bottlenecks.
- Separate pod subnets and security groups bring enterprise networking to Auto Mode.
- All improvements ship automatically. No configuration changes are required for clusters already running EKS Auto Mode.
Runtime: Faster nodes, fewer surprises
EKS Auto Mode manages the node operating system, bootstrap process, and system daemons on your behalf. The improvements in this section reduce startup latency, improve memory resilience, and accelerate container image pulls.
39 percent faster node startup
When you scale up in EKS Auto Mode, new Amazon Elastic Compute Cloud (Amazon EC2) instances must bootstrap Kubernetes components before they can accept workloads. We profiled the boot sequence and identified that our service-readiness detection was adding unnecessary latency. The system was designed to poll services at conservative intervals appropriate for steady-state health monitoring, but these same intervals were being used during startup. This caused systemd to wait several seconds after a service was actually ready before starting dependent services.
The fix: a fast-path startup detection mode that checks readiness at sub-second intervals during boot, then transitions to standard health-check intervals for ongoing monitoring. The result: mean Node Ready latency dropped by 39 percent (13 seconds). For clusters scaling dozens or hundreds of nodes simultaneously, this compounds into significantly faster time-to-workload.
Memory stability with zram
On smaller instance types, EKS Auto Mode’s system components (kubelet, containerd, the Amazon Virtual Private Cloud (Amazon VPC) CNI agent, CoreDNS, and kube-proxy) share memory with your workload pods. Under normal operation this is fine, but transient spikes can temporarily push the node past its memory limit. For example, kubelet performing garbage collection, a large pod listing, or a burst of DNS queries can cause pressure. Without intervention, the Linux out-of-memory (OOM) killer terminates a process, often one of the system components themselves, causing the node to transition to NotReady and triggering unnecessary pod rescheduling.
The key insight is that zram protects the infrastructure layer without affecting workload performance. Pods with properly configured resource limits and requests behave identically: their memory accounting doesn’t change. The system daemons keeping the node healthy now have a safety buffer against brief memory contention. They no longer become the first casualty of an OOM event.
Auto Mode nodes now run zram to absorb these transient spikes. zram creates a compressed swap device backed entirely by memory: no disk I/O, no EBS volumes, no added latency. When memory pressure rises, the kernel identifies pages that haven’t been accessed recently and compresses them in-place using LZ4 (typically achieving 2–4x compression). A page occupying 4 KB compresses down to approximately 1–2 KB, and the freed space becomes immediately available to whichever process needs it. If the compressed page is accessed again later, decompression takes microseconds, effectively invisible to the application.
Faster container image pulls
Three improvements speed up how quickly containers start on Auto Mode nodes. First, we increased kubelet’s registryPullQPS from 5 to 25 and registryBurst from 10 to 50. This removes an artificial throttle that prevented nodes from pulling images in parallel at full network speed.
Second, for instance types with local NVMe storage (common for GPU and machine learning (ML) workloads), we optimized image decompression to take advantage of the faster local disk. Container layers are decompressed directly onto NVMe rather than network-attached EBS, significantly reducing image pull time for large ML framework images that can exceed 20 GB.
Third, we turned on Seekable OCI (SOCI) parallel pull and unpack for G, P, and Trn family instances with local NVMe storage. SOCI allows containers to start before the full image is downloaded. Only the layers needed for initial execution are pulled first, with the rest streaming in the background. SOCI runs by default for these instance families in EKS Auto Mode. No configuration is required.
Automatic security hardening
When you configure a custom AWS Key Management Service (AWS KMS) key on your NodeClass, EKS Auto Mode encrypts the entire disk surface, covering both the read-only root volume and the read/write data volume. This provides full encryption coverage with no additional configuration.
Compute: Scaling faster and smarter with Karpenter
EKS Auto Mode uses Karpenter to manage node lifecycle: provisioning right-sized instances for pending pods and consolidating underutilized nodes to reduce cost. In 2025 and 2026, we shipped dozens of optimizations that make Karpenter faster at both scaling out and scaling in. The benchmark results in the following section quantify the gains.
What changed
The improvements span five areas:
- Scheduling simulation: We cache pod resource requests and requirements in memory, eliminating recomputations during scheduling loops. Hostname topology operations went from O(n) to O(1), and instance types are pre-filtered based on NodePool requirements before simulation begins.
- Memory efficiency: Removed unnecessary object copies in hot paths, reducing garbage collection pressure that was causing latency spikes in large clusters.
- Parallelization: Node filtering, pod eviction queues, and disruption execution now run concurrently rather than sequentially.
- Smarter disruption: Empty nodes are consolidated first (no simulation needed). NodePool-aware candidate shuffling prevents one pool from being starved during consolidation. Scheduling simulation times out after 60 seconds to avoid blocking other operations.
- Reduced API calls: Cached EC2 instance data and limited redundant DescribeInstances calls in the drift controller.
Results
The benchmark: a workload scaling from 0 to 1,000 pods across 250 nodes (m5.xlarge instances in us-east-1, running a CPU-bound scheduling simulation workload), then consolidating down:
| Metric | Before | After | Improvement |
| Scale-out time (0→1,000 pods) | 254 sec | 145 sec | 43% faster |
| Total pending-pod-seconds | ~68,000 sec | ~46,000 sec | 33% less wait |
| First consolidation round | 44 sec | 18 sec | 59% faster |
| Scale-in (100%→70% load) | 406 sec | 263 sec | 35% faster |
| Scale-in (70%→2% load) | 302 sec | 93 sec | 69% faster |
Customer impact
One enterprise customer with over 10,000 pending pods was seeing 23-minute provisioning delays. The scheduling simulation was exceeding the expiration window for capacity-error instance types, causing repeated retries against unavailable capacity. After these optimizations, their scale-out improved significantly.
Storage: Smoother EBS integration
Many teams adopt EKS Auto Mode incrementally, running Auto Mode nodes alongside existing managed node groups or self-managed nodes during migration. The following improvements help Amazon Elastic Block Store (Amazon EBS) volumes work without disruption in these mixed environments.
Topology-aware volume scheduling
If you’re running EKS Auto Mode alongside traditional managed node groups during migration, you might need to restrict EBS volumes so they only attach to Auto Mode nodes. We added support for allowedTopologies on StorageClasses:
Note: For full YAML examples and additional StorageClass configurations, see the Amazon EKS Auto Mode storage documentation or the GitHub samples repository.
This prevents scheduling conflicts in clusters where a pod is bound to a volume in an Availability Zone only reachable by Auto Mode nodes.
Migration tooling
For clusters transitioning to Auto Mode, we released a migration tool that converts existing ebs.csi.aws.com StorageClass volumes to the Auto Mode EBS StorageClass. The migration runs with no data loss and no workload disruption.
Networking: Local-first, zero-configuration
EKS Auto Mode runs core networking components (CoreDNS, VPC CNI, and kube-proxy) as systemd services on each node rather than as cluster pods. This removes circular dependencies during startup, scales networking with node count automatically, and improves reliability through OS-level process management.
Node-local DNS
In traditional EKS clusters, pod DNS queries traverse the cluster network to reach CoreDNS pods. In Auto Mode, every node runs its own CoreDNS instance. DNS queries always resolve locally, and don’t leave the node. Because CoreDNS runs as a systemd process rather than a pod, it doesn’t consume a pod IP address from your VPC subnet. At scale, this saves one IP per node that would otherwise be allocated to a CoreDNS pod.
Earlier this year, we fixed two DNS issues on EKS Auto Mode nodes. First, a CoreDNS pod scheduled on the same Auto Mode node as a querying pod could hijack queries from that pod. Those queries now always go through the node-local DNS, giving consistent low-latency resolution. Second, Auto Mode nodes can now resolve kube-dns.kube-system.svc.cluster.local correctly even when no kube-dns Service is installed in the cluster.
Separate pod subnets and security groups
You can now specify podSubnetSelectorTerms and podSecurityGroupSelectorTerms on your Auto Mode NodeClass. The two fields must be set together, and they’re additive to subnetSelectorTerms/securityGroupSelectorTerms (which still apply to the node’s primary Elastic Network Interface (ENI)).
Pods now run in dedicated subnets with distinct security groups, while maintaining Auto Mode’s zero-configuration defaults. For a full example with YAML, see Navigating enterprise networking challenges with Amazon EKS Auto Mode.
IPv4 egress in IPv6 clusters
For clusters running in IPv6 mode, IPv4 traffic from pods automatically translates through the node’s primary ENI IPv4 address. This lets you adopt IPv6 while still reaching legacy IPv4 endpoints without manual network address translation (NAT) configuration. Teams can run IPv6-native clusters today and maintain connectivity to IPv4-only services (external APIs, on-premises systems, third-party SaaS) without deploying additional infrastructure.
DNS-based network policies
DNS-based network policies let you define egress rules using fully qualified domain names (FQDNs) such as api.example.com or *.example.com instead of IP Classless Inter-Domain Routing (CIDR) blocks. Your policies keep working as upstream IPs change. Enforcement happens through the Network Policy Agent on each node, which ties into the node-local DNS path described in the preceding section. For the full walkthrough, see Enhance Amazon EKS network security posture with DNS and admin network policies.
Network Flow Monitor
Also launched at re:Invent, Network Flow Monitor provides pod-level visibility into cluster traffic (service maps, flow tables, and Prometheus-scrapable metrics) without deploying additional observability infrastructure. For a deep dive, see Track inter-AZ and NAT gateway traffic with EKS Container Network Observability.
EFA support for ML/HPC on Auto Mode
Request vpc.amazonaws.com/efa and Auto Mode provisions Elastic Fabric Adapter (EFA)-capable nodes with interfaces configured as type EFA, brought up by a systemd unit at boot. Install the device plugin and you’re done.
See it in your cluster
These improvements are already live in every EKS Auto Mode cluster. Here’s how to observe them:
Boot time: Run kubectl get nodes -w during a scale-up event and watch Node Ready timestamps. Compare them to your baseline. You should see nodes ready in under 25 seconds on most instance types.
Karpenter scheduling: Query the karpenter_provisioner_scheduling_simulation_duration_seconds metric in your monitoring stack (Amazon CloudWatch Container Insights or Prometheus) to see scheduling round times.
Node-local DNS: From any pod on an Auto Mode node, run nslookup kubernetes.default. Resolution should finish in sub-millisecond time because it never leaves the node.
Network Flow Monitor: Navigate to Amazon CloudWatch, then Network Flow Monitor in the AWS Management Console to see traffic flows for your EKS Auto Mode cluster.
Conclusion
In this post, we showed how Amazon EKS Auto Mode improved across every layer of the stack in 2025. Nodes boot 39 percent faster. Karpenter scales out 43 percent more quickly and delivers up to 30 percent more cluster capacity. Storage handles mixed-cluster scenarios gracefully. Networking delivers node-local DNS with separate pod subnets and security groups.
These improvements shipped automatically to every EKS Auto Mode cluster. For clusters already running Auto Mode, no upgrades or configuration changes were required.
Getting started
If you’re already using EKS Auto Mode, you’re benefiting from these improvements today. If you haven’t tried it yet:
- Amazon EKS Auto Mode documentation: Learn how Auto Mode works and how to turn it on.
- EKS Auto Mode public change log: Track ongoing improvements.
- Karpenter documentation: Deep dive into compute lifecycle management.
- EKS Best Practices Guide: Operational guidance for production clusters.
We’d love to hear about your experience. Share your thoughts in the comments or reach out through AWS re:Post.