Harnessing Karpenter: Transitioning Kafka to Amazon EKS with AWS solutions

AppsFlyer is a global leader in mobile attribution and marketing analytics. AppsFlyer helps businesses understand the impact and their marketing efforts across every channel and device, through a comprehensive measurement platform and privacy cloud that fosters ecosystem collaboration while preserving customers’ privacy.

Within AppsFlyer, data is the core, it gives the ability to expose detailed analytics and enable customers to make the right decisions regarding where to focus their campaign efforts. The data backbone is managed with the help of Apache Kafka, which handles the time flow of more than 1000 microservices, spread across up to 50+ clusters, and holds up to 800 TB of data in any given time.

Originally, our legacy Kafka infrastructure followed a traditional setup, with each Kafka broker deployed on its own dedicated Amazon Elastic Compute Cloud (Amazon EC2) node. This system was managed by a variety of tools, such as Chef, Terraform, as well as third-party services, which were located under different git project locations.

This led to a complicated management, as every change to this infrastructure brought along a host of complex dependencies. Each component required careful consideration, testing, and approval – even for seemingly minor tasks, such as Kafka upgrades. These complications resulted in a much bigger workload and complexity for our team. The outcome is fewer resources invested in new technology solutions for AppsFlyer’s internal research and development (R&D).

Therefore, when my team and I at AppsFlyer Platform group were tasked with redesigning our legacy Kafka infrastructure to Kubernetes, we saw it as an opportunity to grow and improve our Kafka system on multiple fronts. Our main goal was to migrate our clusters to a more advanced, automated, high-performing, and easy-to-manage infrastructure – benefiting both our clients and the team’s daily operations.

In this post, we share the key benefits that our organization realized in moving our Kafka applications to Kubernetes, as well as the challenges we faced and the AWS solutions we adopted to overcome those challenges.

Unlocking efficiencies: Amazon Elastic Kubernetes Service

Redesigning a legacy infrastructure always comes with the opportunity to implement a solution that matches past infrastructure and provides a performance boost, easy management, and a maintainable automated system. An added challenge is the potential for cost savings.

Luckily, when migrating to Kubernetes, AWS Cloud offers multiple solutions.

Kubernetes is a versatile tool to manage containerized workloads, offering a variety of features such as service discovery, orchestration, storage, secret management, self-healing capabilities, and many more.

Amazon Elastic Kubernetes Service (Amazon EKS) is a fully managed Kubernetes service that simplifies building and maintaining a Kubernetes Cluster on AWS. It integrates with core AWS services, making it easier to connect and interact with various AWS Cloud components regarding your stateful application.

Our deployment uses the Strimzi Kafka Operator, a project that simplifies the process of running Apache Kafka within a Kubernetes cluster, which we chose due to its container images and operators purposely-built to effectively manage Kafka on Kubernetes.

The implementations and tools provided by AWS (such as External DNS, AWS ALB Controller) and open source tools allowed us to build a deployment that meets our needs. We used open source tools such as Cruise Control, which helps run Kafka at a large-scale and provides an easy cluster rebalance mechanism. In addition, we used Redpanda-Console, a user Interface (UI) tool to observe topics’ messages in real-time. The preceding toolset granted us the ability to manage many layers of infrastructure, from storage, network, and application, under a single Kubernetes service.

Unlike the previous deployment in our legacy infrastructure, which required us to manage distinct components from different git project locations, we can now manage everything in one place, under a centralized Git project. This helped us improve collaboration, and increase visibility and tracking of the resources involved.

Optimized performance – Let’s go Graviton!

Once we decided on our orchestrator, we needed to choose the type of AWS instance that we would use to hold our Kafka Pods, which Amazon EKS would manage.

Initially, we planned to migrate from our previous i3-type instances to the improved i3en instances due to the cost-benefit and enhanced local SSD storage. However, during benchmarking, AWS introduced another type of instance that could offer us an advantage: AWS Graviton.

Graviton instances present a step-up in performance and capabilities, with an array of types tailored to different system use cases. They feature ARM-based processors, but most importantly, for our use case, improved IOPS and Throughput local NVMe storage.

In our legacy infrastructure, we had decided to use instance local storage for our Kafka brokers to maximize performance factors, such as input/output operations per second (IOPS) and fetching times. These were areas where external storage options could not meet our IOPS and latency requirements of million of IOPS and single-digit millisecond latency.

Using local storage in Kubernetes is a relatively new concept. In this kind of system, losing a node means losing the data on it, leading to a slower recovery time to replicate the missing data.

Because of this, both in our legacy infrastructure and new Kubernetes infrastructure, we use on-demand instances. This helps us reduce incidents of node termination and interruptions, effectively decreasing the number of times Kafka must replicate missing data from other brokers, thus avoiding the risk of multiple node terminations resulting in complete data loss.

However, when it comes to fast data recovery, Graviton instances offer local NVMe SSD storage, from which we could greatly benefit for our real-time bus database, stateful application.

Between a Kafka cluster running on our original i3.2xlarge instance and a Kafka cluster running on the newer, improved im4gn.2xlarge Graviton instance, we saw amazing results, such as a 75% increase in throughput, a 10% lower CPU consumption, and most notably, a remarkable 58% increase in write I/O performance and a 92% increase in read I/O performance.

Fast-forward to now. As we look at our Kafka running on our new Kubernetes architecture on im4gn Graviton instances, we can confidently say that we’ve halved our CPU cores count, reached a 50% reduction in costs, and have the additional advantage of lowering our carbon footprint.

Kubernetes, Kafka, and the power of local storage

Following the choice of using Graviton2 instances, we wanted to use the benefit of Graviton’s Local SSD Storage. In the benchmark, we observed a remarkable 58% increase in write I/O performance, and an even bigger 92% increase in read I/O performance. This was a fact that we simply could not ignore had to use.

Although local storage does present the challenge of slow recovery time, since each node failure triggers the need for a new broker to come up and replicate back the data it has lost, it’s a challenge we were willing to take on. By having the storage located on the same node where a single Kafka pod resides (as opposed to using external storage), we can offer our clients – both producers and consumers – the benefit of boosted performance.

First, we needed to pick a provisioner that manages the attachment of our Kafka Persistent Volume Claims (PVCs) to their corresponding persistent volumes (PVs) – then these PVs can represent the local SSD storage provided by the Graviton instances.

We ultimately chose to use Rancher’s Local Path Provisioner open source project. The provisioner watches for local type storage Persistent Volume Claims and automatically creates Persistent Volumes on the nodes themselves, while also handling the bounding operation between them.

Second, we needed to address the scenario of a node failure – when the underlying node fails, the local storage it hosted gets removed. This causes a new Kafka Pod to enter a pending state, as it waits for the storage to become available again.

To address this scenario, we developed our own in-house open source controller called Local PVC Releaser. The controller listens to Kubernetes events in regards to both planned or unplanned node termination. Once spotted, the controller automatically deletes PVCs (which are local storage types) that were bound to the terminated node. This allows the Strimzi operator to recreate the claim, and safely attach the pod to a new node – in order for it to start replicating the data.

Persistent Volumes Diagram

Setting each Kafka pod to run on a single node, making sure the node failure only impacts a single broker, avoids noisy neighbors and increases the total availability of the Kafka cluster.

Optimizing Amazon EKS automation – Karpenter for stateful application

As we reached our finalized architecture, our next step was implementing an autoscaler to automate this solution to its maximum potential. Originally, we tried the Kubernetes Autoscaler solution to handle autoscaling events in our EKS clusters.

However, the added complexity of running our Kafka pods to use the instance local data storage, and distributing them across different Availability Zones (AZs) to protect against large-scale failures, meant our production needed an immediate response to scale events and failure event incidents.

That is when AWS introduced us to Karpenter. Karpenter is an open source project originated and backed by AWS that serves as a Kubernetes node autoscaling solution. It is developed to optimize cost and resource efficiency taking into account Kubernetes pods’ requirements and constraints, which aims to be more responsible and flexible.

Karpenter helped us on three main fronts:

1. Speed

Although Kafka and Kubernetes provide many self-healing capabilities, the rate at which a new AWS instance can be initiated and added to the EKS cluster is of great importance to us and our users.

Karpenter, with its swift response time, automatically recognizes when a pod requests new resources. It deploys an instance as requested by its configuration – as programmed by my team – but also considers the pod requirements concerning the instance placement on which it needs to operate. Through this logic, we reduced the instance deployment time and recovery time from our original nine-minute timeframe using the Kubernetes Autoscaler to less than a minute using Karpenter.

2. Local storage awareness

This significant reduction comes from Karpenter’s local storage awareness capabilities. Karpenter features automatic detection of storage scheduling requirements, which it then integrates into node launch decisions alongside topology spread and hardware specifications.

Local storage awareness was something that the Kubernetes autoscaler doesn’t offer, as it doesn’t have direct knowledge of where the pod’s local storage should be placed. In contrast, Karpenter automatically understands and deploys an instance on the correct AZ.

3. Significant cost savings

Karpenter’s optimization results in significant cost savings, as Karpenter can calculate not only the number of instances required to support deployment applications running on the EKS cluster, but also the specific instance family type and size to optimize cost efficiency.

This feature is useful for our third-party tool pods, which work alongside our Kafka cluster. Karpenter comes in handy with these pods, as they are stateless. It has free range to determine how many pods an instance type and size can support. It does this by strategically placing them on the same instance to avoid the need to deploy another, less-used instance.

All of this resulted in our decision to fully use Karpenter as our main autoscaler solution, as it met our needs when running our stateful application on Amazon EKS.

Final words – Kafka leap to Kubernetes

With our new architecture established and with the chosen AWS tools I described in this post, as well as many more I didn’t cover, my team here at AppsFlyer successfully migrated our legacy Kafka Clusters to our new and improved Kafka Over Kubernetes infrastructure in under two months.

This migration delivered a significant performance boost, in terms of “Produce” and “Consume” requests time, enhanced Disk I/O write performance, and fast and versatile recovery time for node failure incidents for our users. Additionally, these elements have been wrapped within Terraform, for effortless management, so my team can keep the clusters up-to-date and healthy.

This brings us to today, where we manage over 50 production Kafka clusters using this infrastructure. It cuts the number of CPU cores in half as compared to our previous setup, resulting in an astounding 30% cost reduction from our previous infrastructure.

Armed with our newfound knowledge and already established building blocks, we can now work on more types of stateful type applications over Kubernetes in a much shorter time.

Kafka on EKS with Local NVME Diagram

Containers