Analyze Streaming Data from Amazon Managed Streaming for Apache Kafka Using Snowflake
By Srinivas Kesanapally, Partner Solutions Architect at AWS
By Issac Kunen, Product Manager at Snowflake Computing
By Bosco Albuquerque, Sr. Partner Solutions Architect, Data and Analytics – AWS
By Frank Dallezotte, Sr. Solutions Architect – AWS
In modern data workflows, businesses are interested in learning about end users’ experiences and feedback in near real-time, and then acting on it expeditiously for better business outcomes and user experiences.
When streaming data comes in from a variety of sources—such as smartphones, Internet of Things (IoT) devices, or ad-click data—organizations should have the capability to ingest this data quickly and join it with other relevant business data to derive insights and provide positive experiences to customers.
In this post, we explain how organizations can build and run a fully managed Apache Kafka-compatible Amazon MSK to ingest streaming data. We’ll also explore how to (1) use a Kafka connect application or (2) MSK Connect to persist this data to Snowflake, enabling businesses to derive near real-time insights into end users’ experiences and feedback.
Amazon MSK is a fully-managed service that makes it easy for you to build and run applications that use Apache Kafka to process streaming data. Apache Kafka is an open-source platform for building real-time streaming data pipelines and applications.
Kafka Connect is a tool for streaming data between Apache Kafka and other data systems in a scalable and reliable manner. It can be used to quickly and easily define connectors that move large volumes of data in and out of Kafka, such as for stream processing with low latency. MSK Connect is the Amazon-managed version of Apache Kafka on AWS.
With Amazon MSK, you can use native Apache Kafka APIs to populate data lakes, stream changes to and from databases, and power machine learning (ML) and analytics applications.
The data flow from an Amazon MSK cluster to Snowflake data warehouse using Snowflake connector is as follows. Ingest flow for Kafka with Snowflake connector is shown in Figure 1.
- One or more applications publish JSON or Avro records to an Amazon MSK cluster. The records are split into one or more topic partitions.
If you use Avro records, read the documentation on Installing and Configuring the Kafka Connector to install and configure Avro parser and schema registry.
- The Snowflake connector for Kafka buffers messages from the Kafka topics. In this post, the connector is configured to run on a separate Amazon Elastic Compute Cloud (Amazon EC2) instance in a different AWS Availability Zone.
When a threshold (time or memory or number of messages) is reached, the connector writes the messages to a temporary file in the internal stage. It triggers Snowpipe to ingest the temporary file; Snowpipe copies a pointer to the data file into a queue.
- A Snowflake-managed virtual warehouse loads data from the staged file into the target table (for example, the table specified in the configuration file for the topic) via the pipe created for the Kafka topic partition.
- Not shown in Figure 1: The connector monitors Snowpipe and deletes each file in the internal stage after confirming the file data was loaded into the table. If a failure prevented the data from loading, the connector moves the file into the table stage and produces an error message.
- The connector repeats steps 2-4 above.
Figure 1 – Ingest flow for Kafka with Snowflake connector.
This post explains how to ingest real-time streaming data into Amazon MSK and persists it to Snowflake using the following architecture.
Figure 2 – Amazon MSK architecture with Snowflake.
You will build the Amazon MSK cluster using the provided AWS CloudFormation template. For simplicity, we are running a single cluster in one region across three AWS Availability Zones in three different subnets.
In the same region and same virtual private cloud (VPC), but in a different AWS Availability Zone, you will install a standalone Kafka Connect instance. This instance will be used to:
- Create a topic in the MSK cluster.
- Write test data to the Kafka topic from a producer.
- Push test data to Snowflake.
The following steps walk you through configuring the cluster with Snowflake connectivity using either (1a) MSK with Apache Connect or (1b) MSK with MSK Connect. Following (1a) or (1b) setup, we will generate test data using the Kafka Producer and show that this data is ingested into Snowflake..
(1a) Deploy the Amazon MSK Cluster and Apache Connect
To get started, launch the AWS CloudFormation template:
Be sure to choose the US East (N. Virginia) Region (us-east-1). Then, enter the appropriate stack and key names to log in to the Kafka Connect instance and SSH location.
Choose the defaults on the next two pages, acknowledge that CloudFormation is going to create resources in your account and click Next.
The CloudFormation template creates the following resources in your AWS account:
- Amazon MSK cluster in private subnet.
- Kafka Connect instance in public subnet.
- One VPC.
- Three private subnets and a private subnet.
Ensure the CloudFormation template ran successfully. Note the output values for the KafkaClientEc2Instance and for the MSKClusterARN, which we’ll use in the following steps to configure the Snowflake connector and Kafka topic.
The rest of the steps in the post will enable you to:
- Configure a Kafka connect instance with connectivity to MSK cluster.
- Configure connection to Snowflake database.
- Install Snowflake client tool, Snowsql, to check the connectivity to Snowflake from Kafka connect instance and configure secure private key based authentication to Snowflake.
Login to Kafka Connect Instance
Login to the Kafka Connect instance using the Kafka Connect instance value, KafkaClientEC2InstancePublicDNS, from the CloudFormation template’s output value.
Next, we’ll create a Kafka topic from Kafka Connect instance.
Get Zookeeper information using the MSKClusterArn from the CloudFormation template. A Kafka topic is created via Zookeeper.
Using the Zookeeper connection, create a Kafka topic called MSKSnowflakeTestTopic.
Using this Zookeeper connection string, create a Kafka topic called MSKSnowflakeTestTopic.
The following will be the output upon successful completion:
Created topic MSKSnowflakeTestTopic
Download required Jars and create a zip file.
The CloudFormation template will download the required jars and create a zip file for you containing these jars. The name of the zip file is snowflake-mskcon-1.zip and is located in the /home/ec2-user/jar directory along with the jars downloaded.
Create an S3 bucket and upload this zip file to it.
Generate public and private keys to connect to Snowflake.
The public and private keys needed to connect to Snowflake are created by the CloudFormation template. The names for the private and public key is rsa_key.p8 and rsa_key.pub, respectively. These files contain keys that may contain spaces and new line characters which need to be removed before they are copy/pasted into the MSK Connect config text-box in the MSK Console or in the setting up the public key in Snowflake.
Here are the steps on how to do this:
You can now copy/paste or add the contents of the rsa_key_p8.out file into the .snowflake-connections.properties file in the ec2-user home directory.
We will remove the spaces and newline characters in the public key, and set up the public key for your Snowflake user ID in the following steps in section (2).
As part of the CloudFormation template the cacerts file from your Java version will be copied to the kafka.client.truststore.jks file in the ec2-user home directory.
The CloudFormation template will also add the two parameters below to the client.properties file located in the Kafka bin directory (/home/ec2-user/kafka/kafka_2.12-2.2.1/bin/client.properties)
The CloudFormation template downloads the snowsql client and installs it under the ec2-user home directory in a directory called ‘snowflake’. You can setup your PATH environment variable so that it can find snowsql.
We can now set up and test key pair authentication with Snowflake using the snowsql client.
Log in to Snowflake using snowsql and set up the public key on the user’s account.
Test Snowflake access using the private key.
(2) Testing the KafKa Producer to Send a Message to Snowflake
Start generating test data from a producer running on the Kafka Client EC2 Instance. Open another terminal window connection to the Kafka Client EC2 Instance, and run the following commands:
Finally, verify that data is arriving in Snowflake by logging into your Snowflake account and verifying if a table with the same name as the Kafka topic MSKSnowflakeTestTopic exists in the Snowflake database you used in your setup.
You can also log into Snowflake using snowsql client from the Kafka Connect instance by opening another terminal window connection to the Kafka connect instance as follows:
Snowflake Computing – APN Partner Spotlight
Snowflake is an AWS Competency Partner. A modern cloud data warehouse, Snowflake has built an enterprise-class SQL data warehouse designed for the cloud and today’s data.
*Already worked with Snowflake? Rate this Partner
*To review an APN Partner, you must be an AWS customer that has worked with them directly on a project.