Containers
How to run AI model inference with GPUs on Amazon EKS Auto Mode
AI model inference using GPUs is becoming a core part of modern applications, powering real-time recommendations, intelligent assistants, content generation, and other latency-sensitive AI features. Kubernetes has become the orchestrator of choice for running inference workloads, and organizations want to use its capabilities while still maintaining a strong focus on rapid innovation and time-to-market. But here’s the challenge: while teams see the value of Kubernetes for its dynamic scaling and efficient resource management, they often get slowed down by the need to learn Kubernetes concepts, manage cluster configurations, and handle security updates. This shifts focus away from what matters most: deploying and optimizing AI models. That is where Amazon Elastic Kubernetes Service (Amazon EKS) Auto Mode comes in. EKS Auto Mode Automates node creation, manages core capabilities, and handles upgrades and security patching. In turn, this enables to run your inference workloads without the operational overhead.
In this post, we show you how to swiftly deploy inference workloads on EKS Auto Mode. We also demonstrate key features that streamline GPU management, show best practices for model deployment, and walk through a practical example by deploying open weight models from OpenAI using vLLM. Whether you’re building a new AI/machine learning (ML) platform or optimizing existing workflows, these patterns help you accelerate development while maintaining operational efficiency.
Key features that make EKS Auto Mode ideal for AI/ML workloads
In this section, we take a closer look at the GPU-specific features that come pre-configured and ready to use with an EKS Auto Mode cluster. These capabilities are also available in self-managed Amazon EKS environments, but they typically need manual setup and tuning. However, EKS Auto Mode has them enabled and configured out of the box.
Dynamic autoscaling with Karpenter: EKS Auto Mode includes a managed version of open source conformant Karpenter that provisions right-sized Amazon Elastic Compute Cloud (Amazon EC2) instances, such as GPU‑accelerated options, based on pod requirements. It supports just-in-time scaling and allows you to configure provisioning behavior so that you can optimize for cost, performance, or instance placement. EKS Auto Mode supports a predefined set of instance types and sizes, along with node labels and taints for scheduling control.
Automatic GPU failure handling: EKS Auto Mode includes Node Monitoring Agent (NMA) and Node Auto Repair, which detect GPU failures and initiate automated recovery 10 minutes after detection. The repair process cordons the affected node and either reboots or replaces it, while respecting Pod Disruption Budgets. GPU telemetry tools, either DCGM-Exporter for NVIDIA or Neuron Monitor for Amazon Web Services (AWS) Inferentia and AWS Trainium, are pre-installed and integrated with NMA for device-level health monitoring.
Amazon EKS-optimized AMIs for accelerated instances: EKS Auto Mode allows you to create a Karpenter NodePool using GPU instance types. Furthermore, when a workload requests a GPU, it automatically launches the appropriate Bottlerocket Accelerated Amazon Machine Image (AMI)—with no need to configure AMI IDs, launch templates, or software components. These AMIs come pre-installed with the necessary drivers, runtimes, and plugins, whether you’re using NVIDIA GPUs or AWS Inferentia and Trainium, so that your AI workloads are ready to run by default.
Together, these features remove the heavy lifting of configuring and operating GPU infrastructure, so that teams can focus on building, scaling, and running AI/ML workloads without becoming Kubernetes experts.
Walkthrough
In this section, you’ll walk through deploying an open-source large language model (LLM) on a GPU-enabled EKS Auto Mode cluster. You’ll create the cluster, configure a GPU NodePool, deploy the model, and send a test prompt, all with minimal setup.
Prerequisites
To get started, make sure that you have the following prerequisites installed and configured:
- AWS Command Line Interface (AWS CLI) (v2.27.11 or later)
- kubectl
- eksctl (v0.195.0 or later)
- jq
Set up environment variables
Configure the following environment variables, replacing the placeholder values as appropriate for your setup:
Set up EKS Auto Mode cluster and run a model
Step 1: Create an EKS Auto Mode using eksctl
Begin by creating your EKS cluster with Auto Mode enabled by running the following command:
This process takes a few minutes to complete. After completion, eksctl
automatically updates your kubeconfig and targets your newly created cluster. To verify that the cluster is operational, use the following:
Sample output:
You won’t see components such as VPC CNI, kube-proxy
, Karpenter, and CoreDNS
in the pod list. In EKS Auto Mode, AWS runs these components on the fully managed infrastructure layer, alongside the Amazon EKS control plane.
Step 2: Create a GPU NodePool with Karpenter
Deploy a GPU NodePool tailored to run ML models. Apply the following NodePool manifest:
This NodePool targets GPU-based EC2 instances in the g
category with a generation greater than four, such as G5 and G6e instances. These instance families offer powerful NVIDIA GPUs and high-bandwidth networking, making them well-suited for demanding ML inference and generative AI workloads. The applied taint makes sure that only GPU-eligible pods are scheduled on these nodes, maintaining efficient resource isolation. Allowing both On-Demand and Spot capacity types gives EKS Auto Mode the flexibility to optimize for cost while maintaining performance.
Validate the NodePool:
Sample output:
The gpu-node-pool
is created with zero nodes initially. To inspect available nodes, use:
Sample output:
EKS Auto Mode runs two c6g
instances using the non-accelerated Bottlerocket AMI variant (aws-k8s-1.32-standard
), which are CPU-only and used for running metrics server.
Step 3. Deploy the gpt-oss-20b model using vLLM
vLLM is a high-throughput, open source inference engine optimized for large language models (LLMs). The following YAML deploys the vllm/vllm-openai:gptoss
container image, which is model-agnostic. In this example, we specify openai/gpt-oss-20b
as the model for vLLM to serve.
This deployment uses a toleration for nvidia.com/gpu
, matching the taint on your GPU NodePool. Initially, no GPU nodes are present, so the pod enters the Pending
state. Karpenter detects the unschedulable pod and automatically provision a GPU node. When the instance is ready, the pod is scheduled and transitions to the ContainerCreating
state, at which point it begins pulling the vllm
container image. When the container image is pulled and unpacked, the container enters the Running
state.
Wait for the pod to show Running
. To monitor pod events, use the following:
Sample output:
To check the pod events use the following:
Sample output:
It may take a few minutes for the pod to be in the Running
state. In the preceding example, Karpenter provisioned the instance and scheduled the pod in under a minute. The remaining time was spent downloading the vllm/vllm-openai:gptoss
image over the internet, which is roughly 17 GB in size.
When the container is Running
, the model weights start loading into GPU memory, which takes a few minutes. View logs to track the loading progress:
When the model has loaded, you should see output similar to the following:
After applying the manifest, Karpenter provisioned a GPU instance that satisfies the constraints defined in the NodePool. To see which instance was launched, run the following command:
Sample output:
In this case Karpenter determined that the G6e xlarge spot instance is the most cost-efficient instance type that adheres to the constraints defined in the NodePool.
Step 5. Test the model endpoint
First, execute a port forward to the gptoss-service
service using kubectl:
In another terminal, send a test prompt using curl
:
Sample output:
This setup allows you to test and interact with your inference server without exposing it externally. To make it accessible to other applications or users, you can update the service type to LoadBalancer
, for either external access or within your VPC. If exposing the service, then make sure to implement appropriate access controls such as authentication, authorization, and rate limiting.
Step 5. Cleaning up
When you have finished your experiments, you must clean up the resources you created to avoid incurring ongoing charges. To delete the cluster and all associated resources managed by EKS Auto Mode, run the following command:
This command removes the entire EKS cluster along with its control plane, data plane nodes, NodePools, and all resources managed by EKS Auto Mode.
Reducing model cold start time in AI inference workloads
As you saw in the preceding section, it took a few minutes for the container image to download, the model to be fetched, and the weights to load into GPU memory. This delay is often caused by large container images (over 17 GB in this case), model downloads from external sources, and the time needed to load the model into memory, adding latency to pod startup and scaling events. In production scenarios, especially when running inference at scale, you must use Kubernetes autoscaling and minimize this startup time to make sure of fast, responsive scaling. In this section, we walk through techniques to optimize model startup time and reduce cold start delays.
Store vLLM container image in Amazon ECR and use a VPC endpoint: Pulling container images from public registries over the internet introduces latency during pod startup, especially when images are large or network bandwidth is constrained. To reduce this overhead:
- Store your container image in Amazon Elastic Container Registry (Amazon ECR), a fully managed container registry that is regionally available and optimized for use with Amazon EKS.
- Configure an Amazon ECR VPC endpoint so that nodes pull images over the AWS backbone rather than the public internet.
Prefetch model artifacts using AWS Storage options: To reduce the startup time introduced by model downloads and loading from Hugging Face, store the model artifacts in an AWS storage option that supports concurrent access and high-throughput reads across multiple nodes and AWS Availability Zones (AZs). This is essential when multiple replicas of your inference service, possibly running across different nodes or AZs, need to read the same model weights simultaneously. Shared storage avoids the need to download and store duplicate copies of the model per pod or per node.
- Amazon S3 with Mountpoint and S3 Express One Zone: Express One Zone stores data in a single AZ, although it can be accessed from other AZs in the same Region. This is the lowest-cost storage option and most direct to set up. It is ideal for general-purpose inference workloads where performance requirements are moderate and directness is key. For best results, configure a VPC endpoint to make sure that traffic stays within the AWS network.
- Amazon Elastic File System (Amazon EFS): A natively multi-AZ service that automatically replicates data across AZs. Amazon EFS is an easy-to-use shared file system that offers a good balance of cost, latency, and throughput. It is suitable for workloads that need consistent access to models from multiple AZs with built-in high availability.
- Amazon FSx for Lustre: Deployed in a single AZ and accessible from other AZs within the same VPC. It delivers the highest-performance option for shared storage. Although FSx for Lustre may have a higher storage cost, its speed in loading model weights can reduce overall GPU idle time, often balancing out the cost while providing the fastest model loading performance.
Separating model artifacts from container images, storing your containers in Amazon ECR, and choosing the right storage backend for your models, such as Amazon S3 Mountpoint, Amazon EFS, or Amazon FSx for Lustre allows you to significantly reduce startup time and improve the responsiveness of your inference workloads. To explore more strategies for optimizing container and model startup time on Amazon EKS, refer to the AI on Amazon EKS guidance.
Conclusion
Amazon EKS Auto Mode streamlines running GPU-powered AI inference workloads by handling cluster provisioning, node scaling, and GPU configuration for you. Dynamic autoscaling through Karpenter, pre-configured AMIs, and built-in GPU monitoring and recovery enable you to deploy models faster—without needing to configure or maintain the underlying infrastructure.
To further explore running inference workloads on EKS Auto Mode, the following are a few next steps:
- Learn more: Visit the EKS Auto Mode documentation for full capabilities, supported instance types, and configuration options. You can also check out the EKS Auto Mode post for a hands-on introduction.
- Get hands-on experience: Join an instructor-led AWS virtual workshops from the Amazon EKS series, featuring dedicated sessions on Auto Mode and AI inference.
- Explore best practices: Review the Amazon EKS best practices guide for AI/ML workloads.
- Plan for scale and costs: If you’re running LLMs or other high-demand GPU workloads, then connect with your AWS account team for pricing guidance, right-sizing recommendations, and planning support. AWS recently announced up to a 45% price reduction on NVIDIA GPU-accelerated instances, including the P4d, P4de, and P5, so it’s a good time to evaluate your options.
These tools and guidance allow you to run inference workloads at scale with less effort and more confidence.
About the authors
Shivam Dubey is a Specialist Solutions Architect at AWS, where he helps customers build scalable, AI-powered solutions on Amazon EKS. He is passionate about open-source technologies and their role in modern cloud-native architectures. Outside of work, Shivam enjoys hiking, visiting national parks, and exploring new genres of music.
Bharath Gajendran is a Technical Account Manager at AWS, where he empowers customers to design and operate highly scalable, cost-effective, and fault-tolerant workloads utilizing AWS. He is passionate about Amazon EKS and open-source technologies, and specializes in enabling organizations to run and scale AI workloads on EKS.
Christina Andonov is a Sr. Specialist Solutions Architect at AWS, helping customers run AI workloads on Amazon EKS with open source tools. She’s passionate about Kubernetes and known for making complex concepts easy to understand.