Enhancing Kubernetes workload isolation and security using Kata Containers

Containers have become the dominant method for deploying and managing applications in recent years. Their widespread adoption is attributed to numerous advantages, such as isolation, efficient hardware use, scalability, and portability. In situations where resource isolation is critical for system security, many users are forced to rely on virtual machines (VMs) to mitigate the impact of a compromised container on the host or other containers sharing the host.

In a recent user engagement, we encountered a use case where the team needed to guarantee the tamper-proof nature of their containers. Specifically, they needed to compile their code and cryptographically sign it using a highly secure key. It was imperative to prevent unauthorized access to this key during the build process, making sure that other containers running on the same node could not compromise or extract it. This stringent security requirement prevented them from using containers to perform the build tasks in Kubernetes.

Kata Containers

Kata Containers is an open-source project that provides a secure container runtime that combines the lightweight nature of containers with the security benefits of VMs. It offers stronger workload isolation using hardware virtualization technology as a second layer of defense. In Kata Containers each container is effectively booted with a different guest operating system, as opposed to traditional containers where the Linux Kernel is shared among the workloads and container isolation is achieved by using namespaces and control groups (cgroups). Although traditional containers are a good fit for many workloads, they fall short when stronger isolation and security is needed.

Kata Containers run containers in a stripped-down OCI-compliant VM to provide strict isolation between containers sharing a host machine.

Kata Containers supports major architectures such as AMD64 and ARM. It also includes support for multiple hypervisors such as Cloud-Hypervisor and Firecracker – an AWS-built hypervisor that is used by AWS Lambda and integrates with the containerd project, among others.

Kata Containers abstract away the complexity of orchestrating workloads by using the Kubernetes orchestration system to provide a well-known interface to end users, while providing a custom runtime to run specific hypervisor software that use the Linux Kernel-based Virtual Machine (KVM) to provide strong workload isolation and security.

Kata Containers allows you to run containers integrating with industry standard tools such as OCI container format and Kubernetes CRI interface. It deploys your containers using a hypervisor of choice, which creates a VM to host the Kata Containers agent (kata-agent) and your workload inside the container environment. Each VM hosts a single kata-agent that acts as the supervisor for managing the containers and the workload running within those containers. The VMs have a separate guest Kernel that is highly optimized for boot time and minimal memory footprint, providing only those services required by a container workload, which is based on the latest Linux Long Term Support (LTS) kernel version. You can find detailed information about the Kata Containers architecture in their documentation pages.

By adopting Kata Containers, users can orchestrate their build jobs using Kubernetes with minimal configuration changes to their Continuous Integration (CI) System. The implementation made sure that their Pods were isolated, providing robust protection against container breakouts while maintaining the agility and efficiency of containerized workloads.

Running Kata Containers on AWS

In the next section of this post we demonstrate how to setup and run Kata Containers on AWS using Amazon Elastic Kubernetes Service (EKS). Before starting, note that it’s advised to run the following instructions from a Bastion Host that is deployed in the same Amazon Virtual Private Cloud (VPC) as where your EKS cluster is located.

We use Amazon EKS to run a fully functional Kubernetes cluster that uses Amazon Elastic Compute Cloud (EC2) bare metal instances as worker nodes to allow KVMs to be spawned up. In fact, standard EC2 instances don’t allow nested virtualization, thus the need to use bare metal.

Prerequisites

The following prerequisites are required to continue with this post:

An Amazon VPC where to run your EKS cluster
A Bastion Host to use for remote access to your Amazon VPC
An AWS Identity and Access Management (IAM) Role with minimum policies as described in this document to be associated with the Bastion Host
- note: review the policies described in the preceding document to grant permissions to create your cluster and node groups
AWS Command Line Interface (AWS CLI) v2 – installation guide
eksctl – installation guide
kubectl – installation guide (v1.29.0)

Configure EKS cluster

Once the environment is ready and available to use, you would need to Secure Shell (SSH) into your Bastion Host to perform the following commands (we recommend using AWS Systems Manager to start a new session):

eksctl create cluster \
  --name EKS-Kata \
  --region ap-southeast-1 \
  --version 1.29 \
  --vpc-private-subnets subnet-a,subnet-b \
  --without-nodegroup

You must update this command to change the subnets or the AWS Region to which you’d want to deploy the cluster. For the sake of this post, we are using the Kubernetes version 1.29, and you can update that to match the version you’d like to deploy.

This command creates an AWS CloudFormation stack and deploys a new EKS cluster. It takes a few minutes before you have the control plane active and it can be used for additional configuration.

Configure EKS nodes

After the cluster is deployed correctly and has been created, we can proceed by adding a new node group that would contain the instances needed to run our workloads. For this exercise, we use an i3.metal instance, but you can use any metal instance that fits your use-case.

Create the metal-node-group.yaml with the following content:

cat <<EOF > metal-node-group.yaml
---
apiVersion: eksctl.io/v1alpha5
kind: ClusterConfig

metadata:
  name: EKS-Kata
  region: ap-southeast-1

managedNodeGroups:
  - name: metal-instances
    instanceType: i3.metal
    amiFamily: Ubuntu2004
    desiredCapacity: 1
    minSize: 1
    maxSize: 1
    volumeSize: 150
    volumeType: gp3
    volumeEncrypted: true
    privateNetworking: true
    ssh:
      enableSsm: true
    subnets: ["subnet-a", "subnet-b"]
    iam:
      withAddonPolicies:
        cloudWatch: true
EOF

Note that you must update the subnets to use the one in your VPC before creating this file. Then use the following command to create the node group:

eksctl create nodegroup -f metal-node-group.yaml

Similarly to the create cluster command, the create nodegroup command creates a new CloudFormation template that would deploy your node group with the desired capacity in a few minutes.

Deploy Kata Containers

Kata Deploy is the faster way to deploy Kata Containers in your Kubernetes cluster. Although it’s the suggested deployment method for most cases, note that it provides a Dockerfile, which contains the binaries and artifacts needed to run Kata Containers (including the hypervisors binary files). In case you need custom versions of your hypervisor of choice or guest Kernel image, we suggest following the developer guide to build your own binaries and base images.

From the Bastion Host, we can now run the deployment of the Kata Containers into our cluster using Kata Deploy:

kubectl apply -f https://raw.githubusercontent.com/kata-containers/kata-containers/main/tools/packaging/kata-deploy/kata-rbac/base/kata-rbac.yaml
kubectl apply -f https://raw.githubusercontent.com/kata-containers/kata-containers/main/tools/packaging/kata-deploy/kata-deploy/base/kata-deploy.yaml

Then wait so that the deployment completes correctly:

kubectl -n kube-system wait --timeout=10m --for=condition=Ready -l name=kata-deploy pod

Run the following kubectl command to apply the Kata Runtime classes:

kubectl apply -f https://raw.githubusercontent.com/kata-containers/kata-containers/main/tools/packaging/kata-deploy/runtimeclasses/kata-runtimeClasses.yaml

The runtime classes allow you to quickly create pods that would run using a specific hypervisor. These are preconfigured classes that come with annotations to allow the deployment of our pods in the cluster nodes that support the Kata Containers runtime. Kata Containers provide multiple runtime classes to support the hypervisors deployed by Kata Deploy.

The following is an example of Runtime Class defined for Firecracker:

kind: RuntimeClass
apiVersion: node.k8s.io/v1
metadata:
    name: kata-fc
handler: kata-fc
overhead:
    podFixed:
        memory: "130Mi"
        cpu: "250m"
scheduling:
  nodeSelector:
    katacontainers.io/kata-runtime: "true"

The deployment process also automatically updates the containerd configuration to add the runtime classes provided by Kata configured to run with a custom runtime shim.

Configure Firecracker

Firecracker is an open source virtualization technology that is purpose-built for creating and managing secure, multi-tenant container and function-based services. Firecracker enables you to deploy workloads in lightweight VMs, called microVMs, which provide enhanced security and workload isolation over traditional VMs, while enabling the speed and resource efficiency of containers. Firecracker was developed at AWS to improve the user experience of services such as AWS Lambda.

Since Firecracker’s Virtual Machine Monitor (VMM) does not enable filesystem-level sharing between the microVM and the host, it’s required to configure a snapshotter that would create snapshots as filesystem images that can be exposed to Firecracker microVMs as devices. containerd uses the snapshotter for storing image and container data.

In this section we see how to configure a devmapper snapshotter for Firecracker. You need to log in to the node that has been provisioned in the previous steps (Systems Manager is the recommended method).

Verify that the devmapper plugin isn’t configured yet:

sudo ctr plugins ls | grep devmapper

If the output of this command is

io.containerd.snapshotter.v1           devmapper                linux/amd64    error

then you have to create your devmapper snapshotter. Copy the following content inside a create.sh script file:

#!/bin/bash
set -ex

DATA_DIR=/var/lib/containerd/io.containerd.snapshotter.v1.devmapper
POOL_NAME=devpool

mkdir -p ${DATA_DIR}
# Create data file
sudo touch "${DATA_DIR}/data"
sudo truncate -s 100G "${DATA_DIR}/data"

# Create metadata file
sudo touch "${DATA_DIR}/meta"
sudo truncate -s 40G "${DATA_DIR}/meta"

# Allocate loop devices
DATA_DEV=$(sudo losetup --find --show "${DATA_DIR}/data")
META_DEV=$(sudo losetup --find --show "${DATA_DIR}/meta")

# Define thin-pool parameters.
# See https://www.kernel.org/doc/Documentation/device-mapper/thin-provisioning.txt for details.
SECTOR_SIZE=512
DATA_SIZE="$(sudo blockdev --getsize64 -q ${DATA_DEV})"
LENGTH_IN_SECTORS=$(bc <<< "${DATA_SIZE}/${SECTOR_SIZE}")
DATA_BLOCK_SIZE=128
LOW_WATER_MARK=32768

# Create a thin-pool device
sudo dmsetup create "${POOL_NAME}" \
    --table "0 ${LENGTH_IN_SECTORS} thin-pool ${META_DEV} ${DATA_DEV} ${DATA_BLOCK_SIZE} ${LOW_WATER_MARK}"

Make it executable and run the script:

sudo chmod +x create.sh && sudo ./create.sh

Verify that it has been created successfully:

sudo dmsetup ls

Next, update the containerd configuration in /etc/containerd/config.toml with your preferred editor to add the following section at the end of the file:

[plugins."io.containerd.snapshotter.v1.devmapper"]
pool_name = "devpool"
root_path = "/var/lib/containerd/io.containerd.snapshotter.v1.devmapper"
base_image_size = "40GB"

Also, you should update the kata-fc runtime section to add the devmapper snapshotter to the configuration file:

[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.kata-fc]
  snapshotter = "devmapper" # line to add to your configuration file
  runtime_type = "io.containerd.kata-fc.v2"

Once you complete these two actions, restart the daemon using sudo systemctl restart containerd.

Now you can verify that the plugin is running correctly:

sudo ctr plugins ls | grep devmapper
...
io.containerd.snapshotter.v1           devmapper                linux/amd64    ok

The preceding script needs to be run only once, while setting up the devmapper snapshotter for containerd for the first time. Subsequently, make sure that on each reboot, the thin-pool is initialized from the same data directory. Here is a simple script (reload.sh) that can be used for that purpose:

#!/bin/bash
set -ex

DATA_DIR=/var/lib/containerd/io.containerd.snapshotter.v1.devmapper
POOL_NAME=devpool

# Allocate loop devices
DATA_DEV=$(sudo losetup --find --show "${DATA_DIR}/data")
META_DEV=$(sudo losetup --find --show "${DATA_DIR}/meta")

# Define thin-pool parameters.
# See https://www.kernel.org/doc/Documentation/device-mapper/thin-provisioning.txt for details.
SECTOR_SIZE=512
DATA_SIZE="$(sudo blockdev --getsize64 -q ${DATA_DEV})"
LENGTH_IN_SECTORS=$(bc <<< "${DATA_SIZE}/${SECTOR_SIZE}")
DATA_BLOCK_SIZE=128
LOW_WATER_MARK=32768

# Create a thin-pool device
sudo dmsetup create "${POOL_NAME}" \
    --table "0 ${LENGTH_IN_SECTORS} thin-pool ${META_DEV} ${DATA_DEV} ${DATA_BLOCK_SIZE} ${LOW_WATER_MARK}"

Then we make this file executable and create a service that re-initializes the devpool after each reboot:

sudo chmod +x reload.sh && sudo nano /lib/systemd/system/devmapper_reload.service

Add the following content:

[Unit]
Description=Devmapper reload script

[Service]
ExecStart=/path/to/script/reload.sh

[Install]
WantedBy=multi-user.target

Remember to change the absolute path of the reload.sh script, and then enable the service:

sudo systemctl daemon-reload
sudo systemctl enable devmapper_reload.service
sudo systemctl start devmapper_reload.service

Before terminating the session on the EKS node, note the Kernel version that you’re using. It is useful in the next section to compare to the guest operating system version:

$ uname -a
Linux ip-192-168-118-78 5.15.0-1051-aws

Deploy workloads

Now that the configuration is completed, we can run workloads using the Kata Containers runtime classes that have been created. The following instructions should be executed from your Bastion Host.

You can verify the available runtime classes by running the following command kubectl get runtimeclass. In this post we only show examples running on Firecracker or Cloud Hypervisor. However, if you’d like to use a different supported hypervisor, then you just need to update the runtime class accordingly.

Create the following redis-pod.yaml file:

cat <<EOF > redis-pod.yaml
---
apiVersion: v1
kind: Pod
metadata:
   name: redis-pod
spec:
   runtimeClassName: kata-fc
   containers:
   - name: redis-container
     image: public.ecr.aws/docker/library/redis:latest
     imagePullPolicy: IfNotPresent
     ports:
     - containerPort: 6379
EOF

And then deploy it on you Kubernetes cluster with kubectl create -f redis-pod.yaml.

To verify that the deployment is running on its own microVM, you can run the following command to check the operating system version:

$ kubectl exec -it redis-pod -- bash -c "uname -a"
Linux redis-pod 6.1.38

The microVM should run in a different version of the Kernel as compared to the one that is in use for the host operating system. In fact, the cluster’s node machine is using the 5.15.0-1051-aws Kernel version while the Guest Operating System is using 6.1.38.

You can also update the runtimeClassName property to run the container under a different hypervisor. For example, the following configuration describes how to update it to use Cloud Hypervisor:

cat <<EOF > redis-pod.yaml
---
apiVersion: v1
kind: Pod
metadata:
   name: redis-pod
spec:
   runtimeClassName: kata-clh
   containers:
   - name: redis-container
     image: public.ecr.aws/docker/library/redis:latest
     imagePullPolicy: IfNotPresent
     ports:
     - containerPort: 6379
EOF

From the node machine you can also confirm that your pods are actually running on the right hypervisor by inspecting the running processes using the ps command:

For example, for Firecracker:

$ ps -aux | grep firecracker
root      195793  4.7  0.0 2104568 131816 ?      Sl   12:51   0:00 /firecracker --id 9dab6e65aa7be2e23b2b999b5694d4e1 --start-time-us 35935709841 --start-time-cpu-us 0 --parent-cpu-time-us 6709 --config-file /fcConfig.json

Or, for Cloud Hypervisor:

$ ps -aux | grep cloud-hypervisor
root      193537  3.0  0.0 2385448 154092 ?      Sl   12:46   0:00 /opt/kata/bin/cloud-hypervisor --api-socket /run/vc/vm/f01ac1ae85101532c1cd93025ee25d408303ba9f74d39b119f2b327adbe16dab/clh-api.sock

Each hypervisor also has its own configuration file located under the /opt/kata/share/defaults/kata-containers/ folder which can be tweaked to meet your requirements. The containerd (/etc/containerd/config.toml) configuration refers to these Kata Containers configuration files under the specific runtime class section. For most cases the default configuration should suffice to allow you to run your workloads, but note that if there is a specific configuration needed (such as enabling debugging options or updating maximum numbers of CPU to be used), then it can be updated from these configuration files and applied by restarting the containerd service.

Cleaning up

After you’re done with your experiments, you can clean up the Kubernetes cluster and the EKS Node by deleting the CloudFormation templates created by the eksctl commands described previously. Alternatively, you can run a delete cluster command:

eksctl delete cluster \
 --name EKS-Kata \
 --region ap-southeast-1 \
 --disable-nodegroup-eviction \
 --wait

This automatically deletes the CloudFormation stacks representing the node group and the EKS cluster from your AWS account.

Conclusion

In this post, we detailed the process of setting up a self-managed microVM infrastructure on Amazon EKS by using Amazon EC2 bare metal instances and Kata Containers. This approach combines the flexibility of container orchestration systems with the enhanced security and isolation provided by VMs. While this setup may be suitable for small tests or proof of concepts, it is crucial to conduct a thorough assessment, define whether this setup is ideal for a production use-case, and conduct cost-benefit analysis before deploying these workloads in production. Operating hypervisors at scale requires a deep understanding of their implications on the entire stack, from applications to the Linux Kernel. Tasks typically handled transparently, such as health checks and host patching in Amazon EC2 or operating system and Kubernetes patching in Amazon EKS, become user responsibilities when using self-managed node groups on bare metal EC2 instances. For most users, we recommend using managed solutions such as Amazon EC2, Amazon Elastic Container Service (ECS), and Amazon EKS for smoother operations and reduced complexity.

In case you’d like to dive deeper into Kata Containers, we suggest reading their documentation

Containers