Containers
Category: Amazon Elastic Kubernetes Service
Unlocking next-generation AI performance with Dynamic Resource Allocation on Amazon EKS and Amazon EC2 P6e-GB200
In this post, we explore how Amazon EC2 P6e-GB200 UltraServers are transforming distributed AI workload through seamless Kubernetes integration, featuring NVIDIA GB200 Grace Blackwell architecture that enables memory-coherent domains of up to 72 GPUs. The post demonstrates how Dynamic Resource Allocation (DRA) on Amazon EKS enables sophisticated GPU topology management and cross-node GPU communication through IMEX channels, making it possible to efficiently train and deploy trillion-parameter AI models at scale.
Introducing Seekable OCI Parallel Pull mode for Amazon EKS
In this post, we explore how SOCI Parallel Pull Mode transforms container image pulls through configurable parallelization strategies, addressing performance bottlenecks in both download and unpacking phases. The solution demonstrates significant improvements in pull times, showing nearly 60% acceleration when tested with a 10GB Deep Learning Container image, making it particularly valuable for AI/ML workloads with large, complex images.
Migrate to Amazon EKS: Data plane cost modeling with Karpenter and KWOK
In this post, we demonstrate how to use Karpenter and KWOK to simulate Kubernetes migrations to Amazon EKS, enabling organizations to estimate compute costs before actual migration. The solution involves creating a test environment, backing it up with Velero, restoring it in a new EKS cluster, and analyzing Karpenter’s node provisioning decisions to build accurate cost estimates.
Canary delivery with Argo Rollout and Amazon VPC Lattice for Amazon EKS
This post explores how to implement progressive delivery using Amazon VPC Lattice, Amazon CloudWatch Synthetics, and Argo Rollouts for canary deployments in Amazon EKS environments. The solution enables gradual traffic shifting between service versions, real-time health monitoring through synthetic tests, and automated rollbacks if issues are detected, providing a comprehensive approach to safe and reliable application updates.
Simplify network connectivity using Tailscale with Amazon EKS Hybrid Nodes
This post guides readers through integrating Tailscale with Amazon EKS Hybrid Nodes to simplify and secure network connectivity between on-premises infrastructure and AWS. The integration enables encrypted point-to-point connections using the WireGuard protocol, creating a peer-to-peer mesh network that streamlines the network architecture needed for EKS Hybrid Nodes.
Scaling beyond IPv4: integrating IPv6 Amazon EKS clusters into existing Istio Service Mesh
Organizations are increasingly adopting IPv6 for their Amazon Elastic Kubernetes Service (Amazon EKS) deployments, driven by three key factors: depletion of private IPv4 addresses, the need to streamline or eliminate overlay networks, and improved network security requirements on Amazon Web Services (AWS). In IPv6-enabled EKS clusters, each pod receives a unique IPv6 address from the […]
Deep dive into cluster networking for Amazon EKS Hybrid Nodes
In this post, we dive deep into cluster networking configurations for Amazon EKS Hybrid Nodes, exploring different Container Network Interface (CNI) options and load balancing solutions to meet various networking requirements. The post demonstrates how to implement BGP routing with Cilium CNI, static routing with Calico CNI, and set up both on-premises load balancing using MetalLB and external load balancing using AWS Load Balancer Controller.
Under the hood: Amazon EKS ultra scale clusters
This post was co-authored by Shyam Jeedigunta, Principal Engineer, Amazon EKS; Apoorva Kulkarni, Sr. Specialist Solutions Architect, Containers and Raghav Tripathi, Sr. Software Dev Manager, Amazon EKS. Today, Amazon Elastic Kubernetes Service (Amazon EKS) announced support for clusters with up to 100,000 nodes. With Amazon EC2’s new generation accelerated computing instance types, this translates to […]
Amazon EKS enables ultra scale AI/ML workloads with support for 100K nodes per cluster
We’re excited to announce that Amazon Elastic Kubernetes Service (Amazon EKS) now supports up to 100,000 worker nodes in a single cluster, enabling customers to scale up to 1.6 million AWS Trainium accelerators or 800K NVIDIA GPUs to train and run the largest AI/ML models. This capability empowers customers to pursue their most ambitious AI […]
Amazon EKS Pod Identity streamlines cross account access
This post was co-authored by Ashok Srirama, Principal Container Specialist SA and George John, Senior Product Manager EKS. Introduction Today, we’re excited to announce a significant enhancement to Amazon EKS Pod Identity –streamlined cross-account access for Kubernetes applications. This new feature simplifies the process of granting pods permission to access AWS resources in other accounts. […]









