How to Unlock Real-Time Data Streams with CockroachDB and Amazon MSK

By David Joy, Sr. Engineer – Cockroach Labs
By Abbey Russell, Sr. Engineer – Cockroach Labs
By Pranav Deshmukh, Sr. Partner Solutions Architect – AWS

Cockroach Labs

Customers across sectors are accustomed to immediate service and personalized experiences. To deliver on those expectations, businesses need to monitor, respond to, and capture data in real time.

Apache Kafka, an open-source platform developed by the Apache Software Foundation, has become a popular choice because it can handle real-time data feeds with low latency.

Whether it’s clickstream data from a website, telemetry from Internet of Things (IoT) devices, or transaction data from a point of sale (POS) system, Kafka processes data as it arrives, ensuring insights can be gathered in real time.

However, managing a Kafka deployment can be complex and resource intensive, often requiring additional support.

Integrating Amazon Managed Streaming for Apache Kafka (Amazon MSK) and CockroachDB in your Kafka deployment enables a plethora of use cases, including real-time analytics, event-driven microservices such as inventory managemen), and the ability to archive data for audit logging.

In this post, Cockroach Labs offers a step-by-step guide to integrate Amazon MSK within the CockroachDB platform. The result is a robust, scalable, and fault-tolerant pipeline to move data from CockroachDB to MSK. We assume readers have prerequisites such as basic knowledge of Unix and CockroachDB commands, as well as a virtual private cloud (VPC), subnets, and key already created. If not, we’ve provided those links in case they need to be created.

Cockroach Labs is an AWS Specialization Partner and AWS Marketplace Seller with the Data and Analytics Competency. It is the creator of CockroachDB, a cloud-native, fully managed, distributed SQL database that’s been architected and built for scale.

Amazon MSK and CockroachDB Background

Amazon MSK is a fully managed service that allows you to use the power and capabilities of Apache Kafka to process and analyze streaming data—without the overhead of managing the infrastructure.

CockroachDB is an open-source, distributed SQL database designed to build, scale, and manage cloud services in a diverse environment with data stored around the world, providing several key features:

Distributed SQL: Scale data horizontally across servers, regions, and continents. Automatically replicate and redistribute data to optimize performance and resilience without manual sharding.
Transactional consistency: Data is correct and up to date across all nodes in the cluster, regardless of physical location.
High availability: Automatically replicates data across multiple nodes so that if one node goes down CockroachDB will continue to function.
Compatibility: Using standard SQL syntax that is wire-compatible with PostgreSQL, the platform works with existing PostgreSQL client libraries and tools for ease of use.
Cloud-native: Designed to work with public, private, or hybrid clouds, it can also be deployed on premises.

Now, let’s dive into how to integrate these two technologies.

Phase 1: Set Up Amazon MSK

First, you’ll need to set up an Amazon Virtual Private Cloud (VPC) to provide a secure environment for your MSK cluster. Note that the example in this link shows two AWS Availability Zones (AZs), two public subnets, and two private subnets. We recommend creating three of each while creating the VPC.

Go to the VPC dashboard in the AWS Management Console and follow the prompts to create a new VPC.

Once your VPC is ready, create your MSK cluster by navigating to the Amazon MSK service in the AWS console.

Go to Cluster Settings and click on Create cluster. Specify the cluster properties including name, Kafka version, number of broker nodes, and broker instance type, according to your requirements. For example:

Figure 1 – Cluster settings.

Below is the description of what selections are done as shown in above diagram:

Cluster creation method: Select “Custom create”
Cluster name: Add your cluster name
Cluster type: Select “Provisioned”
Apache Kafka version: 2.8.1 (recommended)
Brokers:
- Broker type – Kafka.m5.large or what applies to your use case
- Number of zones – 3
Brokers per zone: 1
Amazon EBS storage per broker: 1,000 GiB
Cluster storage mode: EBS storage only
Cluster configuration: Amazon MSK default configuration

These steps describe the selections for the networking, security, and monitoring screens:

For networking, select the VPC and Availability Zone you created earlier.
For subnets, choose the three private subnets in three different AZs. Enable traffic between the security groups of CockroachDB and your MSK cluster.
For security, select SCRAM as CockroachDB only supports MSK through SCRAM/SASL. For everything else, keep as default.
For monitoring, select Basic monitoring and then review and create your cluster.

Phase 2: Set Up AWS Secrets Manager

AWS Secrets Manager is designed to protect access to your applications, services, and IT resources.

First, navigate to the AWS Secrets Manager page on the AWS console and click on Store a new secret.

Figure 2 – Creating secret.

Choose Other types of secrets and input your MSK cluster’s credentials. Name the secret and configure the encryption and permissions. To do this, go to plaintext and add {“username”:”username”,”password”:”secret-password”}

Next, add a new symmetric encryption KMS key. Follow the steps to create new encryption key, add the secret name and description, and review and store the secret.

Phase 3: Authorization

Navigate to the Amazon MSK console and click on MSK Cluster. Go to the Properties tab and scroll to the Security settings section. Find and click the Associate secrets button, and then click Choose secrets from the next screen.

Select both secrets you created in previous steps and click Associate secret.

Phase 4: Create Client Machine

Follow the steps to create a client machine using Amazon Elastic Compute Cloud (Amazon EC2).

Note that “Number 7” in the linked list prompts you to select a IAM role; however, we will not select IAM because we are using SCRAM.

Install Kafka on Client Machine

Run the following commands on Unix to install Kafka:

sudo yum -y install java-11

wget https://archive.apache.org/dist/kafka/2.8.1/kafka_2.12-2.8.1.tgz

tar -xzf kafka_2.12-2.8.1.tgz

You can use the version suitable for you, but it’s better to match the client Kafka version with the version of the Kafka cluster.

Set Up Authorization SCRAM/SASL

Set up client properties files for users in the Kafka home directory. The username and password are the same you gave while creating the secret.

This is how we established the example username, djoy:

cd /home/ec2-user/kafka_2.12-2.8.1
echo -n "security.protocol=SASL_SSL
sasl.mechanism=SCRAM-SHA-512
sasl.jaas.config=org.apache.kafka.common.security.scram.ScramLoginModule required \\
 username="djoy" \\
 password="secret-password";
" > client.properties_djoy

Next, go to the /home/ec2-user/kafka directory and give the user permissions to create or delete topics Access Control Lists:

cd /home/ec2-user/kafka_2.12-2.8.1

Now, set up environment variables (you can give your own names):

cn=djoy
export dn="User:${cn}"

Then, set up the brokerssasl variable to point to your Kafka cluster broker endpoint:

export brokerssasl = ‘your kafka broker endpoint’

Broker endpoints can be found on Click on your cluster name, cluster summary, view client information. Give all three endpoints, separated by a comma and ending in port 9096. For example:

ec2-user@ip-10-0-18-219 bin]$ export brokerssasl=b-1.pdcrdbmsk.952j3c.c6.kafka.us-east-1.amazonaws.com:9096,b-3.pdcrdbmsk.952j3c.c6.kafka.us-east-1.amazonaws.com:9096,b-2.pdcrdbmsk.952j3c.c6.kafka.us-east-1.amazonaws.com:9096

Next, run the following command to grant permissions:

./kafka-acls.sh --bootstrap-server $brokerssasl --add --allow-principal $dn --operation All --cluster --command-config /bin/client.properties_djoy

Note that changing the secret key without dissociating it first can cause errors. Reference this document if you run into this issue.

Next, set up the zkconn variable to point to your Kafka cluster ZooKeeper connect string:

export zkconn = ‘your zookeeper connect string’

Zookeeper information can be found on Click on your cluster name, cluster summary, view client information.

Export the string as below:

export zkconn=z-1.pdcrdbmsk.952j3c.c6.kafka.us-east-1.amazonaws.com:2181,z-2.pdcrdbmsk.952j3c.c6.kafka.us-east-1.amazonaws.com:2181,z-3.pdcrdbmsk.952j3c.c6.kafka.us-east-1.amazonaws.com:2181

Run the following command to grant permissions:

./kafka-acls.sh --authorizer-properties zookeeper.connect=$zkconn --add --allow-principal user:ANONYMOUS --operation ALL --cluster

Result would look like this :

[ec2-user@ip-172-71-0-37 bin]$ ./kafka-acls.sh --bootstrap-server b-2.CockroachDBmsktest.wcblgr.c4.kafka.us-east-1.amazonaws.com:9096,b-1.CockroachDBmsktest.wcblgr.c4.kafka.us-east-1.amazonaws.com:9096,b-3.CockroachDBmsktest.wcblgr.c4.kafka.us-east-1.amazonaws.com:9096 --add --allow-principal User:djoytest --operation Create --operation Alter --cluster --command-config client.properties_djoy
Adding ACLs for resource `ResourcePattern(resourceType=CLUSTER, name=kafka-cluster, patternType=LITERAL)`: 
        (principal=User:djoytest, host=*, operation=CREATE, permissionType=ALLOW)
        (principal=User:djoytest, host=*, operation=ALTER, permissionType=ALLOW) 

Current ACLs for resource `ResourcePattern(resourceType=CLUSTER, name=kafka-cluster, patternType=LITERAL)`: 
        (principal=User:djoytest, host=*, operation=CREATE, permissionType=ALLOW)
        (principal=User:djoytest, host=*, operation=ALL, permissionType=ALLOW)
        (principal=User:djoytest, host=*, operation=ALTER, permissionType=ALLOW)

Next, create a topic and test it out:

cd /home/ec2-user/kafka_2.12-2.8.1
./kafka-topics.sh --create --bootstrap-server $brokerssasl --replication-factor 3 --partitions 1 --topic users --command-config client.properties_djoy

Create topic users, and then create a consumer for Amazon MSK:

./kafka-console-consumer.sh --bootstrap-server $brokerssasl --consumer.config client.properties_djoy --topic users --from-beginning

Keep this window open; it will help the user read the data from the Kafka topic and output it to the standard outputs.

Phase 5: Set Up CockroachDB

Setting up CockroachDB on AWS requires multiple steps to be accomplished. We recommend using the documentation to install CockroachDB. We also recommend running CockroachDB in secure mode for production use.

Phase 6: Set Up CockroachDB CDC

Change data capture (CDC) provides efficient, distributed, row-level change feeds into a configurable sink for downstream processing such as reporting, caching, or full-text indexing. The CockroachDB CDC feature supports stream updates from your database to a variety of sinks, including Kafka.

First, connect to your CockroachDB cluster and create your table if not already created.

Note that the use of change feed requires an enterprise license, and range feeds must be enabled for a change feed to work.

sql> SET CLUSTER SETTING cluster.organization = 'your-organization';
sql> SET CLUSTER SETTING enterprise.license = 'your-license--key';
sql> SET CLUSTER SETTING kv.rangefeed.enabled = true;

Run CREATE CHANGEFEED FOR TABLE your_table INTO “kafka://your_kafka_cluster?tls_enabled=true&sasl_enabled=true&sasl_user={username}&sasl_password={password}&sasl_mechanism=SCRAM-SHA-512" where your_table is the table you want to monitor and your_kafka_cluster is your Kafka cluster's address.

sql> CREATE CHANGEFEED FOR TABLE users INTO "kafka://broker.address.com:9096?tls_enabled=true&sasl_enabled=true&sasl_user={username}&sasl_password={password}&sasl_mechanism=SCRAM-SHA-512";

Show change feed jobs that you just created’

sql>SHOW CHANGE FEED JOBS;

Data will now start showing up in the client machine, as shown below.

Figure 3 – Data can be seen moving.

Conclusion

In this post, you learned how to successfully integrate Amazon MSK with CockroachDB and set up a fault-tolerant, scalable, real-time data pipeline. Remember to monitor your pipeline regularly and adjust configurations as needed to optimize performance.

With this integration, you’ll have access to several use cases, including real-time analytics, event-driven microservices for inventory management, and the ability to archive data for audit logging. These benefits enable businesses to deliver personalized attention and immediate service, enhancing the overall customer experience.

You can also learn more about CockroachDB in AWS Marketplace.

.

.

Cockroach Labs – AWS Partner Spotlight

Cockroach Labs is an AWS Partner and the creator of CockroachDB, which is a cloud-native distributed SQL database in use at some of the world’s largest enterprises and some of the largest companies in banking, retail, and media.

Contact Cockroach Labs | Partner Overview | AWS Marketplace

AWS Partner Network (APN) Blog