Containers
Extending deployment pipelines with Amazon ECS blue green deployments and lifecycle hooks
In this post, we explore how Amazon ECS’s native support for blue/green deployments can be extended using lifecycle hooks to integrate test suites, manual approvals, and metrics into deployment pipelines.
Kubernetes right-sizing with metrics-driven GitOps automation
In this post, we introduce an automated, GitOps-driven approach to resource optimization in Amazon EKS using AWS services such as Amazon Managed Service for Prometheus and Amazon Bedrock. The solution helps optimize Kubernetes resource allocation through metrics-driven analysis, pattern-aware optimization strategies, and automated pull request generation while maintaining GitOps principles of collaboration, version control, and auditability.
How to build highly available Kubernetes applications with Amazon EKS Auto Mode
In this post, we explore how to build highly available Kubernetes applications using Amazon EKS Auto Mode by implementing critical features like Pod Disruption Budgets, Pod Readiness Gates, and Topology Spread Constraints. Through various test scenarios including pod failures, node failures, AZ failures, and cluster upgrades, we demonstrate how these implementations maintain service continuity and maximize uptime in EKS Auto Mode environments.
How to run AI model inference with GPUs on Amazon EKS Auto Mode
In this post, we show you how to swiftly deploy inference workloads on EKS Auto Mode and demonstrate key features that streamline GPU management. We walk through a practical example by deploying open weight models from OpenAI using vLLM, while showing best practices for model deployment and maintaining operational efficiency.
Dynamic Kubernetes request right sizing with Kubecost
In this post, we demonstrate how to utilize the Kubecost Amazon EKS add-on to reduce infrastructure costs and enhance Kubernetes efficiency through Container Request Right Sizing, which helps identify and fix inefficient container resource configurations. We explore how to review Kubecost’s right sizing recommendations and implement them through either one-time updates or scheduled automated resizing within Amazon EKS environments for continuous resource optimization.
Unlocking next-generation AI performance with Dynamic Resource Allocation on Amazon EKS and Amazon EC2 P6e-GB200
In this post, we explore how Amazon EC2 P6e-GB200 UltraServers are transforming distributed AI workload through seamless Kubernetes integration, featuring NVIDIA GB200 Grace Blackwell architecture that enables memory-coherent domains of up to 72 GPUs. The post demonstrates how Dynamic Resource Allocation (DRA) on Amazon EKS enables sophisticated GPU topology management and cross-node GPU communication through IMEX channels, making it possible to efficiently train and deploy trillion-parameter AI models at scale.
Implementing usage and security reporting for Amazon ECR
In this post, we demonstrate how to generate comprehensive reports for Amazon ECR repositories that include cost breakdowns, usage metrics, security scan results, and compliance status across all repositories. The solution provides two types of reports: a Repository Summary report containing attributes for tracking and optimizing cost, usage, and OS vulnerabilities, and an Image-Level report for detailed analysis of specific repository images.
Introducing Seekable OCI Parallel Pull mode for Amazon EKS
In this post, we explore how SOCI Parallel Pull Mode transforms container image pulls through configurable parallelization strategies, addressing performance bottlenecks in both download and unpacking phases. The solution demonstrates significant improvements in pull times, showing nearly 60% acceleration when tested with a 10GB Deep Learning Container image, making it particularly valuable for AI/ML workloads with large, complex images.
Migrate to Amazon EKS: Data plane cost modeling with Karpenter and KWOK
In this post, we demonstrate how to use Karpenter and KWOK to simulate Kubernetes migrations to Amazon EKS, enabling organizations to estimate compute costs before actual migration. The solution involves creating a test environment, backing it up with Velero, restoring it in a new EKS cluster, and analyzing Karpenter’s node provisioning decisions to build accurate cost estimates.
Best practices for resilience and availability on Amazon ECS
In this post, we explore advanced implementation patterns for building highly available services on Amazon ECS, including idempotency, resilience to transient failures, static stability across Availability Zones, deployment safety, and chaos engineering techniques. The post provides detailed guidance on how these patterns can be implemented when deploying applications on Amazon ECS to ensure maximum resilience and availability.