Containers

Utilizing NVIDIA Multi-Instance GPU (MIG) in Amazon EC2 P4d Instances on Amazon Elastic Kubernetes Service (EKS)

In November 2020, AWS released the Amazon EC2 P4d instances. The Amazon EC2 P4d instances deliver the highest performance for machine learning (ML) training and high performance computing (HPC) applications in the cloud. This instance comes with the following characteristics:

  • Eight NVIDIA A100 Tensor core GPUs
  • 96 vCPUs
  • 1 TB of RAM
  • 400 Gbps Elastic Fabric Adapter (EFA) with support for GPUDirectRDMA

One of the primary benefits of AWS is elasticity. You can elastically scale workloads according to demand where increased utilization of compute triggers additional scale. With P4d instances, you can now reshape compute resources by creating additional slices of NVIDIA GPUs for various workloads called Multi-instance GPU (MIG).

With MIG, you can partition the GPU with dedicated stream multiprocessor isolation based on different memory profiles. With this option, you can dispatch multiple diverse workloads (which do not require the whole memory footprint of a whole GPU) on the same GPU without performance interference.

Scheduling workloads on these slices concurrently with elastically scaling the nodes through Amazon EC2 Auto Scaling allows you to reshape scaled compute. With MIG, EC2 P4d instances can be used for scalable mixed topology workloads. This post walks through an example of running an ML inferencing workload with and without MIG on Amazon Kubernetes Service (EKS).

MIG Profiles

Different MIG profiles exist for each GPU in the P4d instance. Recall that each p4d.24xlarge comes with eight NVIDIA A100s; each A100 is capable of up to 7x 5 GB A100 slices. This means you can have a node with up to 56 accelerators per node. By shepherding requests across all 56 GPU slices, you can run many diverse workloads per node. The following table shows the available profiles per A100 GPU.

$ sudo nvidia-smi mig -lgip
+--------------------------------------------------------------------------+
| GPU instance profiles:                                                   |
| GPU   Name          ID    Instances   Memory     P2P    SM    DEC   ENC  |
|                           Free/Total   GiB              CE    JPEG  OFA  |
|==========================================================================|
|   0  MIG 1g.5gb     19     7/7        4.95       No     14     0     0   |
|                                                          1     0     0   |
+--------------------------------------------------------------------------+
|   0  MIG 2g.10gb    14     3/3        9.90       No     28     1     0   |
|                                                          2     0     0   |
+--------------------------------------------------------------------------+
|   0  MIG 3g.20gb     9     2/2        19.79      No     42     2     0   |
|                                                          3     0     0   |
+--------------------------------------------------------------------------+
|   0  MIG 4g.20gb     5     1/1        19.79      No     56     2     0   |
|                                                          4     0     0   |
+--------------------------------------------------------------------------+
|   0  MIG 7g.40gb     0     1/1        39.59      No     98     5     0   |
|                                                          7     1     1   |
+--------------------------------------------------------------------------+

As an added feature, you can mix multiple profiles per GPU for further reshaping and scheduling. For the rest of this post, I refer to the MIG profile by the MIG profile ID (third column preceding) for simplicity.

Deployment on EKS

Amazon EC2 P4d instance supports EKS. So, it is possible, through some configuration changes, to deploy a self-managed nodegroup on which to schedule jobs. In the example here, I use Argo Workflows on top of EKS with MIG to show how you can quickly run DAG workflows that use MIG slices in the backend. The configuration changes can be found in the aws-samples/aws-efa-nccl-baseami-pipeline GitHub. This GitHub requires Packer and if you build the components of the packer script and save an Amazon Machine Image (AMI) this is available by default.

Step 1. Start an EKS cluster with the following command:

eksctl create cluster 	--name=${cluster_name} \ 
                      		--region=us-west-2 \ 
 			--ssh-access --ssh-public-key ~/.ssh/id_rsa.pub \ 
--without-nodegroup

Step 2. Next, create a managed node group with a p4d node

eksctl create nodegroup --cluster adorable-rainbow-1613757615 \
                        --name p4d-mig --nodes 1 --ssh-access \
                        --instance-types p4d.24xlarge \
                        --full-ecr-access --managed

It is important to note that MIG is disabled by default when launching a P4d instance. A systemd service was created to enable MIG and set up a default partition scheme. The following code is the systemd service created, this systemd unit file starts before the nvidia-fabricmanager service unit starts in the systemd chain.

[Unit]
Description=Create a default MIG configuration
Before=nvidia-fabricmanager.service
Requires=nvidia-persistenced.service
After=nvidia-persistenced.service

[Service]
Type=oneshot
EnvironmentFile=/etc/default/mig
RemainAfterExit=yes
ExecStartPre=/bin/nvidia-smi -mig 1
ExecStart=/opt/mig/create_mig.sh $MIG_PARTITION
TimeoutStartSec=0

[Install]
WantedBy=multi-user.target

The environment file /etc/default/mig defines the $MIG_PARTITION that is used in the script /opt/mig/create_mig.sh.

#!/bin/bash -xe
nvidia-smi mig -cgi $1 -C

This is set by user-data in our AWS Launch Template (LT). In the launch template, you can iterate over versions to create Launch Templates (LTs) with different MIG partition profiles. In the following example, create seven slices of the 5GB A100 profile.

--==BOUNDARY==
Content-Type: text/x-shellscript; charset="us-ascii"

#!/bin/bash
set -o xtrace
echo -e "MIG_PARTITION=19,19,19,19,19,19,19" >> /etc/default/mig
systemctl start aws-gpu-mig.service

--==BOUNDARY==

Step 3. Once the EKS cluster is running and the nodegroup is created with the nodes in Ready state, you can install the NVIDIA MIG-K8s plugin through Helm.

helm repo add nvdp https://nvidia.github.io/k8s-device-plugin
helm repo add nvgfd https://nvidia.github.io/gpu-feature-discovery
helm repo update 

Now, you verify the repo and that the latest version of the nvidia-device-plugin and gpu-feature-discovery plugins are available.

helm search repo nvdp --devel
NAME                       CHART VERSION APP VERSION DESCRIPTION 
nvdp/nvidia-device-plugin  0.8.2         0.8.2 A Helm chart for the nvidia...
 
helm search repo nvgfd --devel
NAME                        CHART VERSION APP VERSION DESCRIPTION
nvgfd/gpu-feature-discovery 0.4.1         0.4.1       A Helm chart for gpu-feature-...

You can set a MIG strategy to “MIXED” which allows you to address each MIG GPU slice. Install the plugins:

helm install --version=0.8.2 --generate-name --set migStrategy=${MIG_STRATEGY} nvdp/nvidia-device-plugin
helm install --version=0.4.1 --generate-name --set migStrategy=${MIG_STRATEGY} nvgfd/gpu-feature-discovery

Step 4. After a few minutes, the kubectl describe node should report the 56 GPU slices, which, can be used for allocation.

Capacity:
  Attachable-volumes-aws-ebs:	39
  Cpu:					96
  Ephemeral-storage:			104845292Ki
  hugepages-1Gi:			0
  hugepages-2Mi:			10562Mi
  Memory:				1176334124Ki
  nvidia.com/mig-1g.5gb:		56
  Pods:					737
Allocatable:
  Attachable-volumes-aws-ebs:	39
  Cpu:					95690m
  Ephemeral-storage:			95551679124
  hugepages-1Gi:			0
  hugepages-2Mi:			10562Mi
  Memory:				1156853548Ki
  nvidia.com/mig-1g.5gb:		56
  Pods:					737

Step 5. Argo Deployment and Testing

With the base cluster in place, you can go ahead and deploy the Argo Workflows and run through a few tests. Argo Workflows is a Kubernetes plugin purpose-built for orchestrating parallel jobs. In this example, I use Idealo/superresolution workload. This is an ML inferencing example which performs GANs image upscaling.

kubectl create namespace argo
kubectl apply -n argo -f https://raw.githubusercontent.com/argoproj/argo-workflows/stable/manifests/install.yaml

After deploying the Argo Workflow K8s plugin. You can submit the example workflow below. This directed acyclic graph (DAG) generates a loop launching a variable number of ML upscaling jobs and scheduling them on a single MIG slice.

apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
  generateName: super-resolution-example-
spec:
  entrypoint: super-resolution-result-example
  templates:
  - name: super-resolution-result-example
    steps:
    - - name: generate
        template: gen-number-list
    # Iterate over the list of numbers generated by the generate step preceding
    - - name: super-resolution-mig
        template: super-resolution-mig
        arguments:
          parameters:
          - name: super-resolution-mig
            value: "{{item}}"
        withParam: "{{steps.generate.outputs.result}}"

  # Generate a list of numbers in JSON format
  - name: gen-number-list
    script:
      image: python:alpine3.6
      command: [python]
      source: |
        import json
        import sys
        json.dump([i for i in range(0, 56)], sys.stdout)

  - name: super-resolution-mig
    retryStrategy:
      limit: 10
      retryPolicy: "Always"
    inputs:
      parameters:
      - name: super-resolution-mig
    container:
      image: 231748552833.dkr.ecr.us-east-1.amazonaws.com/super-res-gpu:latest
      resources:
        limits:
          nvidia.com/mig-1g.5gb: 1
      workingDir: /root/image-super-resolution
      command: ["python"]
      args: ["super-resolution-predict.py"]

This workflow includes resources that tell the kubernetes scheduler to schedule this onto an instance that can fulfill this request, i.e. the P4d, and allocate a single 5 GB MIG slice to one super-resolution workflow.

Step 6. Submit the job.

argo submit super-res-5g.argo --watch

The loop expands the number of members in the range. You can see that all of the 56 GPU slices are allocated when using kubectl describe node as shown in the following code block:

Allocated resources:
  (Total limits may be over 100 percent, i.e., overcommitted.)
  Resource                    Requests    Limits
  --------                    --------    ------
  cpu                         310m (0%)   0 (0%)
  memory                      140Mi (0%)  340Mi (0%)
  ephemeral-storage           0 (0%)      0 (0%)
  attachable-volumes-aws-ebs  0           0
  nvidia.com/mig-1g.5gb       56          56

Check the Argo logs for status of the workflow.

ServiceAccount: default
Status: Succeeded
Conditions: 
 Completed True
Created: Tue Jan 05 13:57:35 -0500
Started: Tue Jan 05 13:57:35 -0500
Finished: Tue Jan 05 13:59:02 -0500
Duration: 1 minute 27 seconds

With the workflow and job overhead, you can complete all 56 jobs in about 1 minute 27 seconds.  Compared to the whole GPU allocation provided by the nvidia-k8s-plugin, which processes the same workflow in about four minutes. This is because the full GPU is allocated and thus blocked from scheduling further jobs until the eight complete regardless of whether the full GPU is utilized or not; highlighting one of the benefits of MIG.

Cleanup

To clean up the deployment, you can use the eksctl to delete the cluster

eksctl delete cluster --name <cluster-name>

Conclusion

With the NVIDIA multi-instance GPU (MIG) on P4d instances on Amazon Elastic Kubernetes Service it’s now possible to execute large scale disparate inferencing workloads handling multiple requests from a single endpoint. With MIG on P4d you can have up to 56 individual accelerators per P4d instance improving utilization in a multiuser and/or multi request architecture. Excited to see what our customers come up with MIG on P4d.

TAGS: ,
Amr Ragab

Amr Ragab

Sr. Solutions Architect | EC2 Accelerated Platforms for AWS, devoted to helping customers run computational workloads at scale. In my spare time I like traveling and finding new ways to integrate technology into daily life.