Rippling’s journey migrating to the new VPC CNI Network Policy Engine

This post was coauthored by Venkatesh Nannan, Sr. Engineering Manager at Rippling

Introduction

Rippling is a workforce management system that eliminates the friction of running a business, combining HR, IT, and Finance apps on a unified data platform. Rippling’s mission is to free up intelligent people to work on hard problems.

Existing Stack

Rippling uses a modular monolith architecture with different Docker entrypoints for multiple services and background jobs. These components are managed within a single, large, multi-tenant production cluster on Amazon Elastic Kubernetes Service (Amazon EKS per region), on a scale of over 1000 nodes. Rippling’s infra stack consists of:

Karpenter for cluster autoscaling – a flexible, high-performance Kubernetes cluster autoscaler making sure of optimal compute capacity.
Horizontal Pod Autoscaler for scaling Kubernetes pods based on demand.
KEDA, an event-driven autoscaler for scaling background job processing containers based on event volume.
IAM Roles for Service Accounts (IRSA) provide temporary AWS Identity and Access Management (IAM) credentials to the Kubernetes pod, enabling access to AWS resources such as Amazon Simple Storage Service (Amazon S3) buckets, etc.
Argo CD, an open-source, GitOps continuous delivery tool, deploys applications and add-on software to the Kubernetes cluster.
AWS Load Balancer Controller exposes Kubernetes services to end-users.
TargetGroupBinding Custom Resource binds pods to Application Load Balancer (ALB) target groups.
Amazon EKS managed node groups spanning across multiple Availability Zones (AZs).

In addition to these technologies, we were using Cilium CNI for controlling network traffic between pods. However, we were running into challenges with this part of our stack, so we decided to look for the following alternatives.

Figure 1: High level architecture of Rippling

Figure 1: High level architecture of Rippling

Challenges

As Amazon EKS version 1.23 approached end-of-life, upgrading to v1.27 became imperative. However, during our initial attempts at upgrading to v1.24 in our non-production environment, we encountered a significant hurdle. New nodes running Cilium failed to join the cluster, increasing our downtime and requiring operational work on the CNI plugin.

As a company, we prioritize using managed services to streamline operations and focus on adding value to our business. This Kubernetes upgrade task gave us an opportunity to look at alternatives that would be easier to maintain. We saw that AWS had just announced the VPC CNI support for k8s network policies using eBPF. We realized that migrating to this solution would enable us to replace our third-party networking add-on and solely rely on VPC CNI for both cluster networking and network policy implementation. This change would help reduce the overhead of managing operational software needed for cluster networking.

Introduction of Amazon VPC CNI support for network policies

When AWS announced VPC CNI support for k8s Network Policies using eBPF, we wanted to use the Amazon VPC CNI to secure the traffic in our Kubernetes clusters and simplify our EKS cluster management and operations. As network policy agents are bundled in existing VPC CNI pods, we would no longer need to run additional daemon pods and network policy controllers on the worker nodes. We followed the blue-green cluster upgrade strategy and were able to safely migrate the traffic from the old cluster to the new cluster with minimal risk of breaking existing workloads.

Planning the migration

We did an inventory of the applied network policies in our existing cluster and the various ingress/egress features used. This helped us identify deviations from upstream K8s Network Policies. This is necessary for migrating, as Amazon VPC CNI supports only the upstream k8s network policies as of this writing. Rippling was not using advanced features from our third-party network policy engines such as Global Network Policies, DNS based policy rules, or rule priority. Therefore, we did not need Custom Resource Definition (CRD) transformations going into the migration process. AWS recommends converting third-party NetworkPolicy CRDs to Kubernetes Network Policy resources and testing the converted policies in a separate test cluster before migrating from third-party to VPC CNI Network Policy engine in production.

To assist in the migration process, AWS has developed a tool called K8s Network Policy Migrator that converts existing supported Calico/Cilium network policy CRDs to Kubernetes native network policies. After conversion you can directly test the converted network policies on your new clusters running VPC CNI network policy controller. The tool is designed to help streamline the migration process and make sure of a smooth transition.

Picking migration strategy

There are broadly two strategies to migrate the CNI plugin in the EKS cluster: (1) In-place and (2) Blue-Green. The in-Place strategy replaces an existing third-party CNI plugin with the VPC CNI plugin with network policy support in an existing EKS cluster. This would entail the following steps:

Creating a new label “cni-plugin=3p” on the existing Amazon EKS managed node groups and Karpenter NodePool resources.
Updating the existing third-party CNI DaemonSet to schedule CNI pods on those labeled nodes.
Deploy the Amazon EKS Add-on version of Amazon VPC CNI and schedule them to nodes without the “cni-plugin” label. At this point the existing nodes have third-party CNI plugin pods and not the VPC CNI pods.
Launch new Amazon EKS Managed node groups, Karpenter NodePool resources without the “cni-plugin=3p” label so that VPC CNI pods can be scheduled to those nodes.
Drain and delete the existing Amazon EKS managed node groups and Karpenter NodePool resources to move the workloads to the new worker nodes with VPC CNI.
Finally, delete the third-party CNI and associated network policy controllers from the cluster.

As you can see, this process is involved, needs careful orchestration, and is more prone to errors that impact the application availability.

The second approach is to use the Blue-Green strategy, in which a new EKS cluster is launched with the VPC CNI plugin and then the workloads are migrated to it. This approach is safer since it can be rolled back and provides the ability to test the setup in isolation before routing the live production traffic. Therefore, we chose the Blue-Green strategy for our migration.

Migration

As part of the blue-green strategy, we created a new EKS cluster with the Amazon VPC CNI and enabled Network Policy support by customizing the VPC CNI Amazon EKS add-on configuration. We also deployed the Argo CD agent on the cluster and bootstrapped it using Argo CD’s App of apps pattern to deploy the applications into the cluster. Network policies were also deployed to the cluster using the Argo CD. This was tested in a non-production environment to migrate from the third-party CNI to VPC CNI to make sure that applications and services passed functional tests. Then we could safely migrate the traffic from the old cluster to the new cluster without risks by leveraging the same strategy in the production environment.

Lessons learned

Amazon VPC CNI uses the VPC IP space to assign IP addresses to k8s pods. This led us to realize our existing VPCs were not properly sized to meet the growing number of k8s pods. We added a permitted secondary CIDR block 100.64.0.0/10 to the VPC and configured VPC CNI Custom Networking feature to assign those IP addresses to the k8s pods. This proactive measure makes sure of scalability as our infrastructure expands, mitigating concerns about IP address exhaustion.
Leveraging automation and Infrastructure-as-Code (IaC) is recommended, especially as we are replicating existing clusters and migrating the workloads to them.

Conclusion

In this post, we discussed how Rippling migrated from third-party CNI to Amazon VPC CNI in their Amazon EKS clusters and enabled network policy support to secure pod-to-pod communications. Rippling used the blue-green strategy for the migration to minimize the application impact, and safely cut over the traffic to the new cluster. This migration helped Rippling to use the native features offered by AWS and reduced the burden of managing the operational software in our EKS clusters.

Venkatesh Nannan, Sr. Engineering Manager – Infrastructure at Rippling

Venkatesh Nannan is a seasoned Engineering leader with expertise in building scalable cloud-native applications, specializing in backend development and infrastructure architecture.

Containers