Delivering video content with fractional GPUs in containers on Amazon EKS

Video encoding and transcoding are critical workloads for media and entertainment companies. Delivering high-quality video content to viewers across devices and networks needs efficient and scalable encoding infrastructure. As video resolutions continue to increase to 4K and 8K, GPU acceleration is essential to real-time encoding workflows where parallel encoding tasks are necessary. Although encoding on the CPU is possible, this is better suited to smaller-scale sequential encoding tasks or where encoding speed is less of a concern. AWS offers GPU instance families, such as G4dn, G5, and G5g, which are well suited for these real-time encoding workloads.

Modern GPUs offer users thousands of shading units and the ability to process billions of pixels per second. Running a single encoding job on the GPU often leaves resources under-used, which presents an optimization opportunity. By running multiple processes simultaneously on a single GPU, processes can be bin-packed and use a fraction of the GPU. This practice is known as fractionalization.

This post explores how to build a video encoding pipeline on AWS that uses fractional GPUs in containers using Amazon Elastic Kubernetes Service (Amazon EKS). By splitting the GPU into fractions, multiple encoding jobs can share the GPU concurrently. This improves resource use and lowers costs. This post also looks at using Bottlerocket and Karpenter to achieve fast scaling of heterogeneous encoding capacity. With Bottlerocket’s image caching capabilities, new instances can start-up rapidly to handle spikes in demand. By combining fractional GPUs, containers, and Bottlerocket on AWS, media companies can achieve the performance, efficiency, and scale they need for delivering high-quality video streams to viewers.

The examples in this post are using the following software versions:

Amazon EKS version 1.28
Karpenter version 0.33.0
Bottlerocket version 1.16.0

To view and deploy the full example, see the GitHub repository.

Configuring GPU time-slicing in Amazon EKS

The concept of sharing or time-slicing a GPU is not new. To achieve maximum use, multiple processes can be run on the same physical GPU. By using as much of the available GPU capacity as possible, the cost per streaming session decreases. Therefore, the density – the number of simultaneous transcoding or encoding processes – is an important dimension for cost-effective media-streaming.

With the popularity of Kubernetes, GPU vendors have invested heavily in developing plugins to make this process easier. Some of benefits of using Kubernetes over running processes directly on the virtual machine (VM) are:

Resiliency – By using Kubernetes daemonsets and deployments, you can rely on Kubernetes to automatically restart any crashed or failed tasks.
Security – Network policies can be defined to prevent inter-pod communication. Additionally, Kubernetes namespaces can be used to provide additional isolation. This is useful in multi-tenant environments for software-vendors and Software-as-a-Service (SaaS) providers.
Elasticity – Kubernetes deployments allow you to easily scale-out and scale-in based on changing traffic volumes. Event-driven autoscaling, such as with KEDA, allows for responsive provisioning of additional resources. Tools such as the Cluster Autoscaler and Karpenter automatically provision compute capacity based on resource use.

A device plugin is needed to expose the GPU resources to Kubernetes. It is the device plugin’s primary job to make the details of the available GPU resources visible to Kubernetes. Multiple plugins are available for allocating fractions of GPU in Kubernetes. In this post, the NVIDIA device plugin for Kubernetes is used as it provides a lightweight mechanism to expose the GPU resources. As of version 12 this plugin supports time-slicing. Additional wrappers for the device plugin are available, such as the NVIDIA GPU Operator for Kubernetes, which provide further management and monitoring capabilities if needed.
To configure the NVIDIA device plugin with time-slicing, the following steps should be followed.

Remove any existing NVIDIA device plugin from the cluster:

kubectl delete daemonset nvidia-device-plugin-daemonset -n kube-system

Next, create a ConfigMap to define how many “slices” to split the GPU into. The number of slices needed can be calculated by reviewing the GPU use for a single task. For example, if your workload uses at most 10% of the available GPU, you could split the GPU into 10 slices. This is shown in the following example config:

apiVersion: v1
kind: ConfigMap
metadata:
  name: time-slicing-config-all
data:
  any: |-
    version: v1
    flags:
      migStrategy: none
    sharing:
      timeSlicing:
        resources:
        - name: nvidia.com/gpu
          replicas: 10

kubectl create -n kube-system -f time-slicing-config-all.yaml

Finally, deploy the latest version of the plugin, using the created ConfigMap:

helm repo add nvdp https://nvidia.github.io/k8s-device-plugin
helm repo update
helm upgrade -i nvdp nvdp/nvidia-device-plugin \
    --version=0.14.1 \
    --namespace kube-system \
    --create-namespace \
    --set config.name=time-slicing-config-all

If the nodes in the cluster are inspected, then they show an updated GPU resource limit, despite only having one physical GPU:

Capacity:
  cpu:                8
  ephemeral-storage:  104845292Ki
  hugepages-1Gi:      0
  hugepages-2Mi:      0
  memory:             32386544Ki
  nvidia.com/gpu:     10
  pods:               29

As the goal is to bin-pack as many tasks on to a single GPU as possible, it is likely the Max Pods setting are hit next. On the machine used in this post (g4dn.2xlarge) the default max pods is 29. For testing purposes, this is increased to 110 pods. 110 is the maximum recommended for nodes smaller than 32 vCPUs. To increase this, the following steps need to be followed.
Pass the max-pods flag to the kubelet in the node bootstrap script:

/etc/eks/bootstrap.sh my-cluster --use-max-pods false --kubelet-extra-args '--max-pods=110'

When using Karpenter for auto-scaling, the NodePool resource definition passes this configuration to new nodes:

kubelet:
    maxPods: 110

The number of pods is now limited by the maximum Elastic Network Interfaces (ENIs) and IP addresses per interface. See the ENI documentation for the limits for each instance type. The formula is Number of ENIs * (Number of IPv4 per ENI – 1) + 2). To increase the max pods per node beyond this, prefix delegation must be used. This is configured using the following command:

kubectl set env daemonset aws-node -n kube-system ENABLE_PREFIX_DELEGATION=true

For more details on prefix delegation, see Amazon VPC CNI increases pods per node limits.

Amazon EC2 instance type session density

The next decision is which instance type to use. GPU instances are often in high demand because of their use in both Media and Machine Learning (ML) workloads. It is a best practice to diversify across as many instance types as you can in all the Availability Zones (AZs) in an AWS Region.

At the time of writing, the three current generation NVIDIA GPU-powered instance families most used for Media workloads are G4dn, G5, and G5g. The latter uses an ARM64 CPU architecture with AWS Graviton 2 processors.

The examples used in the post use 1080p25 (1080 resolution and 25 frames-per-second) as the frame-rate profile. If you are using a different resolution or framerate, then your results vary. To test this, ffmpeg was run in the container using h264 hardware encoding with CUDA using the following arguments:

ffmpeg -nostdin -y -re -vsync 0 -c:v h264_cuvid -hwaccel cuda -i <input_file> -c:v h264_nvenc -preset p1 -profile:v baseline -b:v 5M -an -f rtp -payload_type 98 rtp://192.168.58.252:5000?pkt_size=1316

The key options used in this example are as follows, and you may want to change these based on your requirements:

`-re`: Read input at the native frame rate. This is particularly useful for real-time streaming scenarios.
`-c:v h264_cuvid`: Use NVIDIA CUVID for decoding.
`-hwaccel cuda`: Specify CUDA as the hardware acceleration API.
`-c:v h264_nvenc`: Use NVIDIA NVENC for video encoding.
`-preset p1`: Set the encoding preset to “p1” (you might want to adjust this based on your requirements).
`-profile:v baseline`: Set the H.264 profile to baseline.
`-b:v 5M`: Set the video bitrate to 5 Mbps.

To view the full deployment definition, see the GitHub repository. All instances were using NVIDIA driver version 535 and CUDA version 12.2. Then, the output was monitored on a remote instance using the following command:

ffmpeg -protocol_whitelist file,crypto,udp,rtp -i input.sdp -f null –

Concurrent Sessions	Average g4dn.2xlarge FPS	Average g5g.2xlarge FPS	Average g5.2xlarge FPS
26	25	25	25
27	25	25	25
28	25	25	25
29	23	24	25
30	23	24	25
31	23	23	25
32	22	23	24
33	22	21	23
35	21	20	22
40	19	19	19
50	12	12	15

The green highlighted cells indicate the maximum concurrent sessions at which the desired framerate was consistently achieved.

G4dn.2xlarge
The T4 GPU in the g4dn instance has a single encoder, which means that the encoder consistently reaches capacity at around 28 concurrent jobs. On a 2xl, there is still spare VRAM, CPU, and memory available at this density. This spare capacity could be used to encode additional sessions on the CPU, run the application pods, or the instance could be scaled down to a smaller instance size. Besides monitoring the FPS, the stream can be manually monitored using ffplay or VLC. Note that although additional sessions can be run beyond the preceding numbers, frame rate drops become more common. Eventually, the GPU becomes saturated and CUDA memory exceptions are thrown, causing the container to crash and restart. The following stream quality was observed when manually watching the stream through VLC:

25-28 sessions – high quality, minimal drops in frame rate, optimal viewing experience
>=30 sessions – some noticeable drops in frame rate and resolution.
>=50 sessions – frequent stutters, and heavy artifacts, mostly unwatchable (at this density CPU, Memory and Network could all become bottlenecks)

G5g.2xlarge
The Graviton-based instance performs nearly identically to the G4dn. This is expected as the T4g GPU in the G5g instance has similar specifications to the T4 GPU. The key difference is that the G5g uses ARM-based Graviton 2 processors instead of x86. This means the G5g instances have approximately 25% better price/performance than the equivalent G4dn. When deploying ffmpeg in a containerized environment, this means that multi-arch container images can be built to target both x86 and ARM architectures. Using hardware encoding with h264 and CUDA works well using cross-compiled libraries for ARM.

G5.2xlarge
The G5 instances use the newer A10G GPU. This adds an additional 8GB of VRAM and doubles the memory bandwidth compared to the T4, up to 600 GBs thanks to PCIe Gen4. This means it can produce lower latency, higher resolution video. However, it still has one encoder. When running concurrent rendering jobs, the bottleneck is the encoder capacity. The higher memory bandwidth allows a couple of extra concurrent sessions, but the density that can be achieved is similar. This does mean it is possible to achieve the same density at a slightly higher framerate or resolution.
The cost per session for each instance is shown in the following table (based on on-demand pricing in the US-East-1 Region):

Instance Type	Cost per hour ($)	Max sessions at 1080p25	Cost per session per hour ($)
G4dn	0.752	28	0.027
G5	1.212	31	0.039
G5g	0.556	28	0.02

By mixing different instance families and sizes and deploying across all AZs in a Region or multiple Regions, you can improve your resiliency and scalability. This also allows you to unlock the maximum spot discount by choosing a “price-capacity-optimized” model if your application is able to gracefully handle spot interruptions.

Horizontal node auto-scaling

As media-streaming workloads fluctuate with viewing habits, it’s important to have elastically scalable rendering capacity. The more responsively additional compute capacity can be provisioned, the better the user experience. This also optimizes the cost by reducing the need to provision for peak. Note that this section explores scaling of the underlying compute resources, not auto-scaling the workloads themselves. The latter is covered in the Horizontal Pod Autoscaler documentation.

Container images needing video drivers or frameworks are often large, typically ranging from 500MiB – 3GiB+. Fetching these large container images over the network can be time intensive. This impairs the ability to scale responsively to sudden changes in activity.

There are some tools that can be leveraged to make scaling more responsive:

Karpenter – Karpenter allows for scaling using heterogeneous instance types. This means pools of G4dn, G5, and G5g instances can be used, with Karpenter picking the most cost effective to place the pending pods.
- As the resource type used by the device plugin presents as a standard GPU resource, Karpenter can scale based on this resource.
- As of writing, Karpenter does not support scaling based on custom resources. Initially nodes are launched with the default one GPU resource until the node is properly labelled by the device plugin. During spikes in scaling, nodes may be over-provisioned until Karpenter reconciles the workload.
Bottlerocket – Bottlerocket is a minimal container OS that only contains the software needed to run container images. Due to this smaller footprint, Bottlerocket nodes can start faster than general-purpose Linux distributions in some scenarios, see the following table for a comparison of this:

Stage	General-purpose Linux elapsed time g4dn.xlarge (s)	Bottlerocket elapsed time g4dn.xlarge (s)	Bottlerocket elapsed time g5g.xlarge (s)
Instance Launch	0	0	0
Kubelet Starting	33.36	17.5	16.54
Kubelet Started	37.36	21.25	19.85
Node Ready	51.71	34.19	32.38

Caching container images on the node. When using Bottlerocket, container images are read from the data volume. This data volume is initialized on instance boot. This means that container images can be pre-fetched or cached onto an Amazon Elastic Block Store (Amazon EBS) volume, rather than pulled over the network. This can lead to considerably faster node start times. For a detailed walkthrough of this process, see Reduce container startup time on Amazon EKS with Bottlerocket data volume.
As an additional scaling technique, the cluster can be over-provisioned. Having an additional warm pool of nodes available for scheduling allows for sudden spikes while the chosen autoscaling configuration kicks in. This is explored in more detail in Eliminate Kubernetes node scaling lag with pod priority and over-provisioning.

By using a multi-arch container build, multiple Amazon Elastic Compute Cloud (Amazon EC2) instance types can be targeted using the same NodePool configuration in Karpenter. This allows for cost-effective scaling of resources. The example workload was built using the following command:

docker buildx build --platform "linux/amd64,linux/arm64" --tag ${AWS_ACCOUNT_ID}.dkr.ecr.${AWS_DEFAULT_REGION}.amazonaws.com/ffmpeg:1.0 --push  . -f Dockerfile

This allows for a NodePool defined in Karpenter as follows:

apiVersion: karpenter.sh/v1beta1
kind: NodePool
metadata:
  name: default
spec:
  template:
    spec:
      requirements:
        - key: karpenter.sh/capacity-type
          operator: In
          values: ["on-demand", "spot"]
        - key: "node.kubernetes.io/instance-type"
          operator: In
          values: ["g5g.2xlarge", "g4dn.2xlarge", "g5.2xlarge"]
      nodeClassRef:
        name: default
      kubelet:
        maxPods: 110

In this NodePool, all three instance types are available for Karpenter to use. Karpenter can choose the most efficient, regardless of processor architecture, as we are using the multi-arch image built previously in the deployment. The capacity type uses spot instances to reduce cost. If the workload cannot tolerate interruptions, then spot can be removed and only on-demand instances are provisioned.

This would work with any supported operating system. To make Karpenter use Bottlerocket-based Amazon Machine Images (AMIs), the corresponding EC2NodeClass is defined as follows:

apiVersion: karpenter.k8s.aws/v1beta1
kind: EC2NodeClass
metadata:
  name: default
spec:
  amiFamily: Bottlerocket
...

This automatically selects the latest AMI in the specified family, in this case Bottlerocket. For the full example and more details on this configuration, see the example in the GitHub repository.

Conclusion

By leveraging fractional GPUs, container orchestration, and purpose-built OS and instance types, media companies can achieve up to 95% better price-performance. The techniques covered in this post showcase how AWS infrastructure can be tailored to deliver high density video encoding at scale. With thoughtful architecture decisions, organizations can future-proof their workflows and provide exceptional viewing experiences as video continues evolving to higher resolutions and bitrates.

To start optimizing your workloads, experiment with different instance types and OS options such as Bottlerocket. Monitor performance and cost savings as you scale out encoding capacity. Use AWS’s flexibility and purpose-built tools to future-proof your video pipeline today.

Containers

Delivering video content with fractional GPUs in containers on Amazon EKS

Configuring GPU time-slicing in Amazon EKS

Amazon EC2 instance type session density

Horizontal node auto-scaling

Conclusion

Resources

Follow