Containers

Deploying and scaling Apache Kafka on Amazon EKS

Introduction

Apache Kafka, a distributed streaming platform, has become a popular choice for building real-time data pipelines, streaming applications, and event-driven architectures. It is horizontally scalable, fault-tolerant, and performant. However, managing and scaling Kafka clusters can be challenging and often time-consuming. This is where Kubernetes, an open-source platform for automating deployment, scaling, and management of containerized applications, comes into play.

Kubernetes provides a robust platform for running Kafka. It offers features like rolling updates, service discovery, and load balancing which are beneficial for running stateful applications like Kafka. Moreover, Kubernetes’ ability to automatically heal applications ensures that Kafka brokers remain operational and highly available.

Scaling Kafka on Kubernetes involves increasing or decreasing the number of Kafka brokers or changing the resources (i.e., CPU, memory, and disk) available. Kubernetes provides mechanisms for auto-scaling applications based off resource utilization and workload, which can be applied to make sure that our level of resource allocation remains optimal at all times. This allows us to both ensure that we can respond to variable workloads while also optimizing cost by elastically autoscaling up and down as is necessary to meet performance demands.

You can install Kafka on a Kubernetes cluster using Helm or a Kubernetes Operator. Helm is a package manager for Kubernetes applications. It allows you to define, install, and manage complex Kubernetes applications using pre-configured packages called charts. A Kubernetes Operator is a custom controller that extends the Kubernetes API to manage and automate the deployment and management of complex, stateful applications. Operators are typically used for applications that require domain-specific knowledge or complex lifecycle management beyond what is provided by native Kubernetes resources.

In this post, we demonstrate how to build and deploy a Kafka cluster with Amazon Elastic Kubernetes Service (Amazon EKS) using Data on EKS (DoEKS). DoEKS’s open-source project empowers customers to expedite their data platform development on Kubernetes. DoEKS provides templates that incorporate best practices around security, performance, cost-optimization among others that can be used as is or fine-tuned to meet more unique requirements.

DoEKS makes it easy for users to begin utilizing these templates through the DoEKS blueprints project. This project provides blueprints, as well as practical examples tailored to various data frameworks, written as infrastructure-as-code in the form of Terraform or AWS Cloud Development Kit (CDK) that can be used to set up infrastructure designed with best practices and performance benchmarks in minutes without having to write any infrastructure-as-code yourself.

For deploying the Kafka cluster, we use the Strimzi Operator. Strimzi simplifies the process of running Apache Kafka in a Kubernetes cluster by providing container images and operators for running Kafka on Kubernetes. The operators simplify the process of deploying and running Kafka clusters, configuration and secure access, upgrading and managing Kafka and even managing topics and users within Kafka itself.

Solution overview

The following detailed diagram shows the design we will implement in this post.

Diagram illustrating the architecture of Apache Kafka running on an Amazon EKS cluster, showing the interaction between various components and services.

In this post, you will learn how to provision the resources depicted in the diagram using Terraform modules that leverage DoEKS blueprints. These resources are:

  • A sample Virtual Private Cloud (VPC) with three Private Subnets and three Public Subnets.
  • An internet gateway for Public Subnets and Network Address Translation (NAT) Gateway for Private Subnets, VPC endpoints for Amazon Elastic Container Registry (Amazon ECR), Amazon Elastic Compute Cloud (Amazon EC2), Security Token Service (STS), etc.
  • An Amazon Elastic Kubernetes Service (Amazon EKS) Cluster Control plane with public endpoint (for demonstration reasons only) with two managed node groups.
  • Amazon EKS Managed Add-ons: VPC_CNI, CoreDNS, Kube_Proxy, EBS_CSI_Driver.
  • A metrics server, Karpenter, Kube-Prometheus-Stack with a local Prometheus and optionally an Amazon Managed Prometheus instance.
  • Strimzi Operator for Apache Kafka deployed to strimzi-kafka-operator namespace. The operator by default watches and handles kafka in all namespaces.

Walkthrough

Prerequisites

You will need an AWS account, a profile with administrator permissions, and the following tools installed locally:

Deploy infrastructure

Clone the code repository to your local machine.

git clone https://github.com/awslabs/data-on-eks.git
cd data-on-eks/streaming/kafka
export AWS_REGION="us-west-2" # Select your own region

You need to setup your AWS credentials profile locally before running Terraform or AWS CLI commands. Run the following commands to deploy the blueprint. The deployment should take approximately 20-30 minutes.

chmod +x install.sh
./install.sh

Verify the deployment

List Amazon EKS nodes

The following command will update the kubeconfig on your local machine and allow you to interact with your Amazon EKS Cluster using kubectl to validate the deployment.

chmod +x helper.sh 
./helper.sh update-kubeconfig 

# Output should look like the following:
# Added new context arn:aws:eks:ca-central-1:xxxxx:cluster/kafka- on-eks to /Users/username/.kube/config

Deploy the Kafka cluster manifests

The Kafka cluster is deployed with multi-availability zone (AZ) configurations for high availability. This offers stronger fault tolerance by distributing brokers across multiple availability zones within an AWS Region. This ensures that a failed availability zone does not cause Kafka downtime. To achieve high availability, a minimum cluster of three brokers is required. Begin by running the following command to apply the cluster manifests:

./helper.sh apply-kafka-cluster-manifests

Check if the deployment has created six nodes. Three nodes for Core Node group and three for Kafka brokers across threes AZs.

./helper.sh get-nodes-core
# Output should look like below
NAME                                       STATUS   ROLES          AGE   VERSION
ip-10-1-0-100.us-west-2.compute.internal   Ready    <none>   62m   v1.27.7-eks-e71965b
ip-10-1-1-163.us-west-2.compute.internal   Ready    <none>   27m   v1.27.7-eks-e71965b
ip-10-1-2-209.us-west-2.compute.internal   Ready    <none>   63m   v1.27.7-eks-e71965b

./helper.sh get-nodes-kafka
# Output should look like below
NAME                                           STATUS   ROLES          AGE   
ip-10-1-2-103.us-west- 2.compute.internal      Ready             2m45s
ip-10-1-2-160.us-west- 2.compute.internal      Ready             2m7s
ip-10-1-2-189.us-west- 2.compute.internal      Ready             3m5s
ip-100-64-107-67.us-west- 2.compute.internal   Ready             2m55s
ip-100-64-164-142.us-west- 2.compute.internal  Ready             3m3s
ip-100-64-198-105.us-west- 2.compute.internal  Ready             2m56s
ip-100-64-93-225.us-west- 2.compute.internal   Ready             2m36s

Verify Kafka Broker pods

Verify the Kafka Broker pods created by the Strimzi Operator and the desired Kafka replicas. Because Strimzi is using custom resources to extend the Kubernetes APIs, you can also use kubectl with Strimzi to work with Kafka resources.

./helper.sh get-strimzi-pod-sets
# Output should look like below
NAME               PODS  READY PODS  CURRENT PODS  AGE
cluster-broker     3     3           3             14h
cluster-controller 3     3           3             14h

./helper.sh get-kafka-pods
# Output should look like below
NAME                                     READY STATUS  RESTARTS AGE
cluster-broker-0                         1/1   Running 0        14h
cluster-broker-1                         1/1   Running 0        14h
cluster-broker-2                         1/1   Running 0        14h
cluster-controller-3                     1/1   Running 0        14h
cluster-controller-4                     1/1   Running 0        14h
cluster-controller-5                     1/1   Running 0        14h
cluster-cruise-control-7fb6c986b9-8lxpl  1/1   Running 0        4m56s
cluster-entity-operator-78cd88c89c-dnhx2 2/2   Running 0        5m17s
cluster-kafka-exporter-7b4785f7b9-9sqvs  1/1   Running 0        4m26s

./helper.sh get-all-kafka-namespace
# Output should look like below
NAME                                     READY  STATUS   RESTARTS      AGE
cluster-broker-0                         1/1    Running  1 (6m16s ago) 73m
cluster-broker-1                         1/1    Running  0             73m
cluster-broker-2                         1/1    Running  0             73m
cluster-controller-3                     1/1    Running  0             4m37s
cluster-controller-4                     1/1    Running  0             73m
cluster-controller-5                     1/1    Running  0             73m
cluster-cruise-control-556cd84667-vdcr8  1/1    Running  0             4m34s
cluster-entity-operator-5875f7b885-tjp27 2/2    Running  0             4m56s
cluster-kafka-exporter-58c5b95f7c-b7zp5  1/1    Running  0             4m3s

(base) bcd0745cb0bd:kafka username$ ./helper.sh get-all-kafka-namespace

NAME                                         READY  STATUS   RESTARTS      AGE
pod/cluster-broker-0                         1/1    Running  1 (6m16s ago) 73m
pod/cluster-broker-1                         1/1    Running  0             73m
pod/cluster-broker-2                         1/1    Running  0             73m
pod/cluster-controller-3                     1/1    Running  0             4m37s
pod/cluster-controller-4                     1/1    Running  0             73m
pod/cluster-controller-5                     1/1    Running  0             73m
pod/cluster-cruise-control-556cd84667-vdcr8  1/1    Running  0             4m34s
pod/cluster-entity-operator-5875f7b885-tjp27 2/2    Running  0             4m56s
pod/cluster-kafka-exporter-58c5b95f7c-b7zp5  1/1    Running  0             4m3s

NAME                             TYPE       CLUSTER-IP     EXTERNAL-IP   PORTS                                         AGE
service/cluster-cruise-control   ClusterIP  172.20.33.159          9090/TCP                                      4m48s
service/cluster-kafka-bootstrap  ClusterIP  172.20.70.127          9091/TCP,9092/TCP,9093/TCP                    73m
service/cluster-kafka-brokers    ClusterIP  None                   9090/TCP,9091/TCP,8443/TCP,9092/TCP,9093/TCP  73m

NAME                                     READY  UP-TO-DATE  AVAILABLE  AGE
deployment.apps/cluster-cruise-control   1/1    1           1          4m48s
deployment.apps/cluster-entity-operator  1/1    1           1          5m10s
deployment.apps/cluster-kafka-exporter   1/1    1           1          4m17s

NAME                                               DESIRED  CURRENT  READY  AGE
replicaset.apps/cluster-cruise-control-556cd84667  1        1        1      4m48s
replicaset.apps/cluster-entity-operator-5875f7b885 1        1        1      5m10s
replicaset.apps/cluster-kafka-exporter-58c5b95f7c  1        1        1      4m17s

Create Kafka topics and run sample test

In Apache Kafka, a topic is a category used to organize messages. Each topic has a unique name across the entire Kafka cluster. Messages are sent to and read from specific topics. Topics are partitioned and replicated across brokers in our implementation.

The producers and consumers play a crucial role in the handling of data. Producers are source systems that write data to Kafka topics. They send multiple streams to the Kafka brokers. Consumers are target systems and read data from Kafka brokers. They can subscribe to one or more topics in the Kafka cluster and process the data.

In Kafka, producers and consumers are fully decoupled and agnostic of each other, which is a key design element to achieve high scalability.

You will create a Kafka topic my-topic and run a Kafka producer to publish messages to it. A Kafka streams will consume messages from the Kafka topic my-topic and will send them to another topic my-topic-reversed. A Kafka consumer will read messages from my-topic-reversed topic.

To make it easy to interact with the Kafka cluster you create a pod with Kafka CLI.

./helper.sh create-kafka-cli-pod

Create the Kafka topics

./helper.sh apply-kafka-topic

Verify the status of the Kafka topics:

./helper.sh get-kafka-topic
# Output should look like below
NAME        CLUSTER  PARTITIONS  REPLICATION FACTOR   READY
test-topic  cluster  12          3                    True

The cluster has a minimum of three brokers, each in a different AZ, with topics having a replication factor of three and a minimum In-Sync Replica (ISR) of two. This will allow a single broker to be down without affecting producers with acks=all. Fewer brokers will compromise either availability or durability.

Verify that the topic partitions are replicated across all three brokers.

./helper.sh describe-kafka-topic
# Output should look like below
Topic: test-topic TopicId: Sr59VXmzRNGfjcqH_BWGtg PartitionCount: 12 ReplicationFactor: 3 Configs: min.insync.replicas=2
Topic: test-topic Partition: 0 Leader: 2  Replicas: 2,1,0 Isr: 2,1,0
Topic: test-topic Partition: 1 Leader: 1  Replicas: 1,0,2 Isr: 1,0,2
Topic: test-topic Partition: 2 Leader: 0  Replicas: 0,2,1 Isr: 0,2,1
Topic: test-topic Partition: 3 Leader: 2  Replicas: 2,0,1 Isr: 2,0,1
Topic: test-topic Partition: 4 Leader: 0  Replicas: 0,1,2 Isr: 0,1,2
Topic: test-topic Partition: 5 Leader: 1  Replicas: 1,2,0 Isr: 1,2,0 
Topic: test-topic Partition: 6 Leader: 0  Replicas: 0,2,1 Isr: 0,2,1
Topic: test-topic Partition: 7 Leader: 2  Replicas: 2,1,0 Isr: 2,1,0
Topic: test-topic Partition: 8 Leader: 1  Replicas: 1,0,2 Isr: 1,0,2
Topic: test-topic Partition: 9 Leader: 1  Replicas: 1,0,2 Isr: 1,0,2
Topic: test-topic Partition: 10 Leader: 0 Replicas: 0,2,1 Isr: 0,2,1
Topic: test-topic Partition: 11 Leader: 2 Replicas: 2,1,0 Isr: 2,1,0

The Kafka topic my-topic has 12 partitions, each with different leader and replicas. All replicas of a partition exist on separate brokers for better scalability and fault tolerance. The replication factor of three ensures that each partition has three replicas, so that data is still available even if one or two brokers fail. The same is for Kafka topic my-topic-reversed.

Reading and writing messages to the cluster

Deploy a simple Kafka producer, Kafka streams and Kafka consumer.

./helper.sh deploy-kafka-consumer
# Output should look like below 
deployment.apps/java-kafka-producer created 
deployment.apps/java-kafka-streams created 
deployment.apps/java-kafka-consumer created

./helper.sh get-kafka-consumer-producer-steams-pods

# Output should look like below
NAME                                  READY  STATUS   RESTARTS  AGE
java-kafka-consumer-fb9fdd694-2dxh5   1/1    Running  0         4s
java-kafka-producer-6459bd5c47-69rjx  1/1    Running  0         5s
java-kafka-streams-7967f9c87f-lsckv   1/1    Running  0         5s

Verify the messages moving through your Kafka brokers. Split your terminal in three parts and run the commands below in each terminal.

./helper.sh verify-kafka-producer
# Output should look like below
... INFO KafkaProducerExample:58 - Sending messages "Hello world - 117"

./helper.sh verify-kafka-streams

# Output should look like below 
... INFO StreamThread:1074 - stream-thread [java-kafka-streams-0c063b2b-f15f-4f6c-a60f-7b6329f08d37-StreamThread-1] Processed 122 total records, ran 0 punctuators, and committed 4 total tasks since the last update 
... INFO ConsumerCoordinator:1508 - [Consumer clientId=java-kafka-streams-0a206cf3-4098-4d88-880c-a0a992acf0be-StreamThread-1-consumer, groupId=java-kafka-streams] Found no committed offset for partition my-topic-0 
... INFO SubscriptionState:407 - [Consumer clientId=java-kafka-streams-0a206cf3-4098-4d88-880c-a0a992acf0be-StreamThread-1-consumer, groupId=java-kafka-streams] Resetting offset for partition my-topic-0 to position FetchPosition{offset=0, offsetEpoch=Optional.empty, currentLeader=LeaderAndEpoch{leader=Optional[cluster-broker-2.cluster-kafka-brokers.kafka.svc:9092 (id: 2 rack: us-east-1a)], epoch=0}}.

./helper.sh verify-kafka-consumer

# Output should look like below 
... INFO KafkaConsumerExample:52 - Received message: 
... INFO KafkaConsumerExample:53 - partition: 0 
... INFO KafkaConsumerExample:54 - offset: 141 
... INFO KafkaConsumerExample:55 - value: "141 - dlrow olleH" 
... INFO KafkaConsumerExample:57 - headers:

Displaying the output of a Kafka producer, Kafka streams, and Kafka consumer interacting for given Kafka topic my-topic

Scaling your Kafka cluster

Scaling Kafka brokers is essential for a few key reasons. Firstly, scaling up the cluster by adding more brokers allows for the distribution of data and replication across multiple nodes. This reduces the impact of failures and ensures the system’s resilience. Secondly, with more broker nodes in the cluster, Kafka can distribute the workload and balance the data processing across multiple nodes, which leads to reduced latency. This means that data can be processed faster, improving the overall performance of the system.

Furthermore, Kafka is designed to be scalable. As your needs change, you can scale out by adding brokers to a cluster or scale in by removing brokers. In either case, data should be redistributed across the brokers to maintain balance. This flexibility allows you to adapt your system to meet changing demands and ensure optimal performance.

To scale the Kafka cluster, you need modify the Kafka.spec.kafka.replicas configuration to add or reduce the number of brokers.

./helper.sh update-kafka-replicas
kafka.kafka.strimzi.io/cluster patched

When checking the status of Kafka broker and controller pods, please note that after configuration changes or node failures, it may take approximately 2 to 5 minutes for the pods to restart and reach the Running status, for Kafka brokers to rejoin the cluster and synchronize data, and for controllers to re-elect leadership and stabilize the system.

You should now have four Kafka brokers.

./helper.sh verify-kafka-brokers
# Output should look like below
NAME                  READY   STATUS    RESTARTS  AGE
cluster-broker-0      1/1     Running   0         39m 
cluster-broker-1      1/1     Running   0         39m 
cluster-broker-2      1/1     Running   0         39m 
cluster-controller-3  1/1     Running   0         83s 
cluster-controller-4  1/1     Running   0         2m27s 
cluster-controller-5  1/1     Running   0         115s

However, the existing partitions from Kafka topics are still piled up on the initial brokers.

./helper.sh describe-kafka-topic-partitions 

# Output should look like below 
Topic: my-topic TopicId: ihXz88M9RaqL3YbXqaKAmA PartitionCount: 1 ReplicationFactor: 3 Configs: min.insync.replicas=4 
Topic: my-topic Partition: 0 Leader: 1 Replicas: 1,0,2 Isr: 1,0,2

You need to redistribute these partitions to balance our cluster out.

Cluster balancing

To balance our partitions across your brokers you can use Kafka Cruise Control. Cruise Control is a tool designed to fully automate the dynamic workload rebalance and self-healing of a Kafka cluster. It provides great value to Kafka users by simplifying the operation of Kafka clusters.

Cruise Control’s optimization approach allows it to perform several different operations linked to balancing cluster load. As well as rebalancing a whole cluster, Cruise Control also has the ability to do this balancing automatically when adding and removing brokers or changing topic replica values.

Kafka Cruise Control is already running in the kafka namespace and should appear as follows:

./helper.sh get-cruise-control-pods
# Output should look like below
NAME                                      READY   STATUS    RESTARTS   AGE
cluster-cruise-control-6689dddcd9-rxgm7   1/1     Running   0          9m25s

Strimzi provides a way of interacting with the Cruise Control API using the Kubernetes CLI, through KafkaRebalance resources. KafkaRebalance resources are deployed on the cluster as part of kafka installation. KafkaRebalance is used for creating optimization proposals and executing partition rebalances based on that optimization proposal.

The Cluster Operator will subsequently update the KafkaRebalance resource with the details of the optimization proposal for review.

./helper.sh describe-kafka-rebalance | grep "Proposal"
# Output should look like below
# Note: It takes ~2 minutes for Proposal Status to be ProposalReady.
Type:ProposalReady

When the proposal is ready and looks good, you can execute the rebalance based on that proposal by annotating the KafkaRebalance resource like this:

./helper.sh annotate-kafka-rebalance-pod

# Output should look like below
kafkarebalance.kafka.strimzi.io/my-rebalance annotated

Wait for Kafka Rebalance to be Ready:

./helper.sh describe-kafka-rebalance | grep "Type"
# Output should look like below
# Note: It takes ~3 minutes for Type to change from "Rebalancing" to "Ready".
Type:Ready

Check to see the partitions spread on all 4 Kafka brokers:

./helper.sh describe-kafka-partitions

# Output should look like below
Topic: my-topic TopicId: qOyjJRSuQTCrOAldQZB0yw PartitionCount: 12      ReplicationFactor: 3    Configs: min.insync.replicas=2,message.format.version=3.0-IV1
        Topic: my-topic Partition: 0    Leader: 3      Replicas: 3,0,1 Isr: 1,0,3
        Topic: my-topic Partition: 1    Leader: 0      Replicas: 0,2,3 Isr: 0,2,3
        Topic: my-topic Partition: 2    Leader: 2      Replicas: 2,0,3 Isr: 2,0,3
        Topic: my-topic Partition: 3    Leader: 3      Replicas: 3,0,2 Isr: 0,2,3
        Topic: my-topic Partition: 4    Leader: 0      Replicas: 0,2,3 Isr: 0,2,3
        Topic: my-topic Partition: 5    Leader: 2      Replicas: 2,0,3 Isr: 2,0,3
        Topic: my-topic Partition: 6    Leader: 3      Replicas: 3,0,1 Isr: 1,0,3
        Topic: my-topic Partition: 7    Leader: 3      Replicas: 3,1,0 Isr: 0,1,3
        Topic: my-topic Partition: 8    Leader: 1      Replicas: 1,0,3 Isr: 0,1,3
        Topic: my-topic Partition: 9    Leader: 3      Replicas: 3,0,1 Isr: 1,0,3
        Topic: my-topic Partition: 10   Leader: 0      Replicas: 0,2,3 Isr: 0,2,3
        Topic: my-topic Partition: 11   Leader: 2      Replicas: 2,0,3 Isr: 2,0,3

Benchmark your Kafka cluster

Kafka includes a set of tools namely the kafka-producer-perf-test.sh and kafka-consumer-perf-test.sh to run test performance and simulation of the expected load on the cluster. When benchmarking your Kafka cluster on Amazon EKS, choosing the right compute is key to
achieving cost-efficiency and high performance. AWS Graviton instances offer up to 30% better price-performance for Kafka workloads compared to x86-based instances, with improvements in throughput, latency, and energy efficiency. Real-world case studies show that migration to Graviton is straightforward and delivers significant cost savings, often with little to no code changes. Kafka and Strimzi already provide official arm64-based container images, so adopting Graviton means simply selecting AWS Graviton instance types for your EKS node groups.

Create a topic for performance test

First time you created a topic using Kubernetes constructs and Srimzi Custom Resource Definitions (CRDs). Now you create the topic using kafka-topics.sh script. Both options can be used.

./load-test.sh create-perf-test-topic

# Output should look like below
Created topic test-topic-perf.

Test the producer performance

Then run the producer performance test script with different configuration settings. The following example will use the topic created above to store 10 million messages with a size of 100 bytes each. The -1 value for –throughput means that messages are produced as quickly as possible, with no throttling limit. Kafka producer related configuration properties like acks and bootstrap.servers are mentioned as part of –producer-props argument.

./load-test.sh run-producer-perf-test

# Output should look similar to the following:
...
1825876 records sent, 365175.2 records/sec (34.83 MB/sec), 829.4ms avg latency, 930.0 ms max latency.
1804564 records sent, 360912.8 records/sec (34.42 MB/sec), 839.0ms avg latency, 1039.0 ms max latency.

In this example approximately 364k messages were produced per second with a maximum latency of 1 second.

Next, you test the consumer performance.

Test the consumer performance

The consumer performance test script will read 10 million messages from the test-topic-perf.

./load-test.sh run-consumer-perf-test

# Output should look like the following:
start.time: WARNING: Exiting before consuming the expected number of messages: timeout (10000 ms) exceeded. You can use the --timeout option: 2025-08-19 17:48:11:366t.
end.time: : 2025-08-19 17:48:30:578
data.consumed.in.MB: : 1639.7957
MB.sec: : 85.3527
data.consumed.in.nMsg: : 17194504
nMsg.sec: : 894987.7160
rebalance.time.ms: : 3359
fetch.time.ms: : 15853
fetch.MB.sec: : 103.4376
: : 1084621.4597

The output will show the total amount of 1639.7957 MB data consumed and 17194504 messages and the throughput 85.3527 MB/sec and 894987.7160 nMsg/sec.

View Kafka cluster in Grafana

The blueprint installs a local instance of Prometheus and Grafana, with predefined dashboards and has the option to create an instance of Amazon Managed Prometheus. The secret for login to Grafana Dashboard is stored in AWS Secrets Manager.

Login to Grafana dashboard by running the following command.

./helper.sh view-and-login-to-grafana-dashboard

Open browser with local Grafana Web UI.

Enter username as admin and password can be extracted from the output of the previous command.

Open Strimzi Kafka dashboard

Go to the Dashboards page at http://localhost:8080/dashboards, then click on General, and
then on Strimizi Kafka.
You should see the below builtin Kafka dashboards which was created during the deployment:

Grafana Dashboard showing Strimzi Kafka parameters

Grafana Dashboard showing Strimzi Kafka parameters

Test Kafka disruption

When you deploy Kafka on Amazon Elastic Compute Cloud (Amazon EC2) machines, you can configure storage in two primary ways: Amazon Elastic Block Storage (Amazon EBS) and instance storage. Amazon EBS consists of attaching a disk to an instance over a local network, whereas instance storage is directly attached to the instance.

Amazon EBS volumes offer consistent I/O performance and flexibility in deployment. This is crucial when considering Kafka’s built-in fault tolerance of replicating partitions across a configurable number of brokers. In the event of a broker failure, a new replacement broker fetches all data the original broker previously stored from other brokers in the cluster that hosts the other replicas. Depending on your application, this could involve copying tens of gigabytes or terabytes of data, which takes time and increases network traffic, potentially impacting the performance of the Kafka cluster. A properly designed Kafka cluster based on Amazon EBS storage can virtually eliminate re-replication overhead that would be triggered by an instance failure, as the Amazon EBS volumes can be reassigned to a new instance quickly and easily.

For instance, if an underlying Amazon EC2 instance fails or is terminated, the broker’s on-disk partition replicas remain intact and can be mounted by a new Amazon EC2 instance. By using Amazon EBS, most of the replica data for the replacement broker will already be in the Amazon EBS volume and hence won’t need to be transferred over the network.

It is recommended to use the broker ID from the failed broker in the replacement broker when replacing a Kafka broker. This approach is operationally simple as the replacement broker will automatically resume the work that the failed broker was doing.

You simulate a node failure for an Amazon EKS node from Kafka nodegroups hosting one of the brokers.

Create a topic and send some data to it:

./helper.sh create-kafka-failover-topic
# Output should look like below
Created topic test-topic-failover

./helper.sh describe-kafka-failover-topic
# Output should look like below
Topic: test-topic-failover 
TopicId: zpk0hKUkSM61sK-haPpaNA 
PartitionCount: 1        
ReplicationFactor: 3          
Configs: min.insync.replicas=2,message.format.version=3.0-IV1
Topic: test-topic-failover  Partition: 0  Leader: 0  Replicas: 0,2,3  Isr: 0,2,3

# The following Send and Read messages commands should be executed from two different shells:
# Execute from shell 1 to send the message
./helper.sh send-messages-to-kafka-failover-topic-from-producer
# Output should look like below
> This is a message for testing failover

# Open a new shell and execute the following command from the new shell to consume the message
./helper.sh read-messages-from-kafka-failover-topic-consumer
# Output should look like below
> This is a message for testing failover

Leader: 0 in the output above means the leader for topic test-topic-failover is broker 1, which corresponds to pod cluster-kafka-0.

Find which node is running pod cluster-kafka-0:

./helper.sh get-kafka-cluster-pod
# Output should look like below
NAME              READY   STATUS    RESTARTS    AGE   IP NODE      NOMINATED NODE                            READINESS GATES
cluster-kafka-0   1/1     Running   0           62m   10.1.2.168   ip-10-1-2-35.us-west-2.compute.internal   <none>           <none>

Note: Your nominated node will be different.

Drain the node where the pod cluster-kafka-0 is running, then get the instance ID and terminate EC2 as well:

./helper.sh create-node-failure # this includes ec2 instance termination

# Output should look like below
node/ip-10-1-2-160.us-west-2.compute.internal cordoned
Warning: ignoring DaemonSet-managed Pods: kube-prometheus-stack/kube-prometheus-stack-prometheus-node-exporter-mb4xp, kube-system/aws-node-kmf4f, kube-system/ebs-csi-node-7xxg2, kube-system/kube-proxy-szb98
evicting pod kafka/cluster-entity-operator-f4f6f8bd-x6ngb
evicting pod kafka/cluster-broker-0
pod/cluster-entity-operator-f4f6f8bd-x6ngb evicted
pod/cluster-broker-0 evicted
node/ip-10-1-2-160.us-west-2.compute.internal drained
EC2 Instance ID: i-0094f7b4d0e62d1e5

The cluster-kafka-0 pod will be recreated on a new node.

./helper.sh validate-kafka-cluster-pod

# Output should look like below
# Note: It takes ~5 minutes for cluster-kafka-0 pod to reach Running status.
NAME             READY   STATUS    RESTARTS   AGE     IP NODE      NOMINATED NODE                             READINESS  GATES
cluster-kafka-0  1/1     Running   0          4m19s   10.1.2.11   ip-10-1-2-100.us-west-2.compute.internal   <none>     <none>

Verify the message is still available under the test-topic-failover.

./helper.sh read-messages-from-kafka-failover-topic-consumer

# Output should look like below
This is message for testing failover

Cleaning up

Delete the resources creates in the blog to avoid incurring costs. You can run the below script:

./cleanup.sh

Conclusion

In this post, I showed you how to deploy Apache Kafka on Amazon EKS while ensuring high availability across multiple nodes with automatic failover. Additionally, I explored the benefits of using Data on EKS blueprints to quickly deploy infrastructure using Terraform. If you would like to learn more, then you can visit the Data on EKS GitHub repository to collaborate with others and access additional resources.