Containers

Bottlerocket support for NVIDIA GPUs

Today, we are happy to announce that Bottlerocket, a Linux-based, open-source, container-optimized operating system, now supports NVIDIA GPUs for accelerated computing workloads. You can now use NVIDIA-based Amazon Elastic Compute Cloud (EC2) instance types with Bottlerocket to accelerate your machine learning (ML), artificial intelligence (AI), and similar workloads that require GPU compute devices.

This release includes a new NVIDIA Bottlerocket variant for Amazon Elastic Kubernetes Service (Amazon EKS). The variant comes with GPU drivers pre-installed and configured for the containered runtime. You don’t have to install or configure the GPU driver, or run the k8s-device-plugin, because all of the libraries and kernel modules are already available inside the image. By including the driver directly in the AMI, you can speed up the provisioning time of a GPU-based EC2 instance, avoid external dependencies, and reduce errors for device and kernel compatibility.

The NVIDIA Bottlerocket variant supports self-managed node groups on Amazon EKS and Karpenter node auto scaler. You can use the provided AMIs with custom provisioning tools or community tools, such as kops, for any Kubernetes cluster using EC2 instances.

Let’s see how you can make your first Amazon EKS cluster with NVIDIA GPU instances using Bottlerocket for the node operating system.

Create a cluster

We’ll use eksctl—the official Amazon EKS command line interface—to create our example cluster. You need to be using eksctl 0.86.0 or newer to use the new Bottlerocket NVIDIA variant.

In this example, we will create a cluster called “br-gpu” with an Amazon EC2 G4dn node group powered by NVIDIA T4 Tensor Core GPUs. We use the Bottlerocket AMI family, which will automatically use the correct Bottlerocket variant for the GPU instance type. The NVIDIA Bottlerocket variant supports both x86_64 and arm64 instance types. Make sure your containers are built for the correct architecture you’ll be using.

eksctl create cluster --name br-gpu \

    --node-type g4dn.xlarge \

    --node-ami-family Bottlerocket \

    --managed=false

Once the cluster is created you can see the instances in the cluster with:

kubectl get no -L node.kubernetes.io/instance-type

NAME                                          STATUS   ROLES    AGE   VERSION   INSTANCE-TYPE
ip-192-168-27-24.us-west-2.compute.internal   Ready    <none>   18m   v1.21.9   g4dn.2xlarge

Deploy a GPU-accelerated workload

Now that we have a node with a GPU attached, we can deploy our first GPU workload and attach the GPU resource to the container.

cat << EOF | kubectl apply -f -
apiVersion: v1
kind: Pod
metadata:
  name: nvidia-smi
spec:
  restartPolicy: OnFailure
  containers:
  - name: vectoradd
    image: nvcr.io/nvidia/k8s/cuda-sample:vectoradd-cuda11.1
    command: ["nvidia-smi"]
    resources:
      limits:
         nvidia.com/gpu: 1
EOF

This pod is being used to run the nvidia-smi command so we can see which NVIDIA GPUs  are available inside the pod.

You can see the output with:

kubectl logs nvidia-smi

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.82.01    Driver Version: 470.82.01    CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla T4            Off  | 00000000:00:1E.0 Off |                    0 |
| N/A   27C    P8     9W /  70W |      0MiB / 15109MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

You’ll notice that the 470.X driver and an NVIDIA T4 GPU are available to the container.

You are now ready to run any of your GPU-accelerated workloads on Kubernetes with Bottlerocket. If you’re looking for more example workloads, you can check out the NVIDIA GPU-optimized containers in the NVIDIA NGC Catalog on AWS Marketplace.

Delete the cluster

To delete the cluster and all provisioned EC2 instances you can run this command:

eksctl delete cluster --name br-gpu

Learn more

Now is a great time to use Bottlerocket with your AI/ML workloads. The new Bottlerocket NVIDIA variant helps you run GPU-accelerated workloads quickly and securely. A minimal operating system that includes the required drivers and libraries reduces configuration and compatibility issues. Integrated drivers also provide seamless operating system updates and improves provisioning time.

We will support Amazon EKS managed node groups and Amazon Elastic Container Service (ECS) in a future update. We want to hear your feedback on use cases with Bottlerocket and NVIDIA GPUs. Let us know what workloads you would like to run and how Bottlerocket can help secure them in the Bottlerocket GitHub repo.

Justin Garrison

Justin Garrison

Justin Garrison is a Sr Developer Advocate in the AWS containers team. He is a long time open source contributor and cares deeply for open communities. Before AWS, Justin built infrastructure for Disney+ and animated movies such as Frozen II and Moana. You can reach him on Twitter via @rothgar