Stream Twitter data into Amazon Redshift using Amazon MSK and AWS Glue streaming ETL

Real-time analytics provide a point-in-time view for a variety of use cases. At the heart of any real-time solution is streaming data processing, especially when dynamic new content is being continually regenerated. Organizations might start using streaming data for simple analytics from logs or basic arithmetic dashboards, but eventually develop applications to perform more sophisticated forms of analysis, including machine learning, and extract deeper insights.

AWS Glue is a fully managed extract, transform, and load (ETL) service that makes it easy to prepare and load your data for analytics. In addition to capabilities such as a Data Catalog, automated schema discovery, automated code generation, and deduplication of data, AWS Glue is serverless, and you don’t have to provision resources while paying for what you use. AWS Glue recently released serverless streaming ETL, which makes it easy to set up continuous ingestion pipelines that stream data from various sources using Amazon Kinesis and Apache Kafka and load data to data lakes, data warehouses, and other data stores while cleaning and enriching the data as needed.

This post demonstrates how customers, system integrator (SI) partners, and developers can use the serverless streaming ETL capabilities of AWS Glue with Amazon Managed Streaming for Kafka (Amazon MSK) to stream data to a data warehouse such as Amazon Redshift. We also show you how to view Twitter streaming data on Amazon QuickSight via Amazon Redshift.

Background

Before AWS Glue streaming ETL, you had to stitch multiple components together. For example, to stream real-time data from a social media feed, you needed to use either Amazon MSK or Kinesis to load data, using a combination of AWS Lambda and Amazon Simple Storage Service (Amazon S3) with multiple staging buckets. With AWS Glue streaming ETL, you can now simplify your pipeline with reduced touchpoints that better allow you to focus on business outcomes rather than pipeline management.

For more information about how AWS Glue streaming ETL integrates with Amazon Kinesis, see New – Serverless Streaming ETL with AWS Glue.

The following architecture shows an end-to-end implementation of a streaming solution using Amazon MSK, AWS Glue streaming ETL, Amazon Redshift, and QuickSight.

To illustrate how to set up this architecture, we’ll walk you through the following steps:

Deploying an AWS CloudFormation template to launch a three-node MSK cluster and a Kafka client with an instance of Apache NiFi running on it
Creating a Kafka topic and accessing the Apache NiFi UI
Configuring data streams from Twitter to Kafka using Apache NiFi
Creating an Amazon Redshift cluster and a table to persist streaming data
Creating Amazon MSK and Amazon Redshift tables on AWS Glue Data Catalog tables
Authoring an AWS Glue streaming ETL job to load data to Amazon Redshift
Visualizing data from Amazon Redshift in QuickSight
Cleaning up your resources

By default, the CloudFormation template launches the cluster on kafka.m5.large nodes and the client instance on m5.2xlarge, but you may choose to configure it with the instance type appropriate for your use case.

Prerequisites

Make sure to complete the following steps as prerequisites:

Have an AWS account. For this post, you configure the required AWS resources using AWS CloudFormation in the us-east-1 If you haven’t signed up, complete the following tasks:
- Create an account. For instructions, see Sign Up For AWS.
- Create an AWS Identity and Access Management (IAM) user. For instructions, see Create IAM User.
Create and download a valid key to SSH into the instance, which you use to create the clusters. For this post, call it KeyPair.pem. For instructions, see Create a key pair using Amazon EC2.
Have an access token, access token secret, API key, and API secret key associated with your Twitter developer account. To create a developer account, see Get started with the Twitter developer platform.

Launching your CloudFormation stack

To create your resources for this use case, complete the following steps:

Launch your stack in us-east-1:
On the Quick create stack page, for Stack Name, enter Twitter-MSK-Glue-Redshift-Blog.
For KeyName, choose KeyPair.
For SSH location, enter your IP to log in to the Kafka client instance.

To find your IP, use checkip.amazonaws.com.

Choose Create stack.

The stack creation can take up to 15 minutes to complete.

When the stack creation is complete, on the Stack Outputs tab, record the values of the following:
1. KafkaNiFiEC2Instance
2. MSKSecurityGroupID
3. PrivateSubnetOne
4. RedshiftEndpoint

Creating a Kafka topic and accessing the NiFi UI

With the .pem key, you can now SSH to the NiFi node and do a local port forwarding to access the web interface on your local computer. We use Apache NiFi to easily configure and pull data from Twitter and publish it to Amazon MSK without dealing with coding. You may use any tool to poll data from Twitter and publish to Amazon MSK.

To access the NiFi UI from your local system, use the following command to set up an SSH tunnel into the NiFi instance (replace KafkaNiFiEC2Instance with the information from the AWS CloudFormation output) running on port 8888:

ssh -i ~/Downloads/KeyPair.pem -L 8888:Localhost:8888 ec2-user@: <KafkaNiFiEC2Instance>

This command allows you to access NiFi via a browser running on your local system. Leave that terminal open to remain connected.

To create a Kafka topic, enter the following code in the terminal (replace ZookeeperConnectString with your Amazon MSK cluster ZooKeeper URL):

/opt/kafka/bin/kafka-topics.sh --create --zookeeper <ZookeeperConnectString> --replication-factor 3 --partitions 1 --topic CovidTweets

For instructions on finding your ZooKeeper URL, see Getting the Apache ZooKeeper Connection String for an Amazon MSK Cluster.

The following screenshot shows the resulting output.

While the terminal is connected, launch your browser and use the following URL to access NiFi UI: http://localhost:8888/nifi/.

You should be able to view the NiFi cluster UI (see the following screenshot). You can see four processors already added on to the canvas.

NiFi supports user authentication via client certificates, username and password, Apache Knox, or OpenId Connect. For this post, we keep it open without security configuration, but make sure you have robust security in place for your NiFi instances when used for your own use cases. For more information, see Security Configuration.

Configuring data streams from Twitter to Amazon MSK using Apache NiFi

The following steps take you through connecting to Twitter and pulling data related to Twitter handles without coding.

Choose the GetTwitter
On the Properties tab, enter the following Twitter credentials:
1. API key
2. API secret key
3. Access token
4. Access token secret
Choose Apply.

For security purposes, you can’t read the credentials entered. Additionally, in these configurations, you filter Tweets based on the term COVID19. You can add a comma-separated list if you want to customize these values.

Choose the PublishKafka
On the Properties tab, for Kafka Brokers, enter your comma-separated SSL Amazon MSK bootstrap URL (running on port 9094).

For instructions on finding the broker URL, see Getting the Bootstrap Brokers for an Amazon MSK cluster.

The Topic Name value is already set to CovidTweets, but you can change the topic name if you prefer a different name.

Choose Apply.

You’re now ready to start the flow and stream data from the Twitter API into the MSK cluster. However, before you start streaming Twitter data, create the Data Catalog tables and author the AWS Glue streaming job.

Creating an Amazon Redshift cluster and target table

As part of the AWS CloudFormation deployment, you create a single-node Amazon Redshift cluster. To create the target table for storing relevant fields extracted from Tweets, connect to the cluster and complete the following steps:

On the Amazon Redshift console, connect to the query editor.
Enter the following credentials:
1. Cluster – Choose the cluster with endpoint noted earlier
2. Database name – streaming-data
3. Database user – awsuser
4. Database password – Str0ngPas$wd
On the query editor page, enter the following DDL command to create a table named msk_tweets:

create table msk_tweets(created_at VARCHAR(max),id_str VARCHAR(100),text VARCHAR(max), source VARCHAR(max),user_location VARCHAR(1000),hashtags1 VARCHAR(1000),hashtags2 VARCHAR(1000), lang VARCHAR(max))

You need to modify these tables and fields per the use case.

Creating the Amazon MSK and Amazon Redshift Data Catalog tables in AWS Glue

This section walks you through creating your connections to Amazon MSK and Amazon Redshift, crawling Amazon Redshift, and creating Data Catalog tables to use as the target for the AWS Glue streaming ETL job.

Creating the Amazon MSK connection

To create your Amazon MSK connection, complete the following steps:

On the AWS Glue console, choose Catalog.
Choose Connection.
Choose Add connection.
On the Set up your connection’s properties page, for Connection name, enter MSK_Connection.
For Connection type, choose Kafka.
For Kafka bootstrap server URLs, enter your Amazon MSK SSL bootstrap URL running on port 9094.

For instructions on finding your broker URL, see Getting the Bootstrap Brokers for an Amazon MSK Cluster.

Choose Next.

On the Set up access to your data store page, for VPC, choose the VPC containing the name MSK-GLUE-VPC.
For Subnet, choose the subnet containing the name MMPrivateSubnetOne.
For Security groups, select the group with the prefix MSK-Glue-Redshift-MSKSecurityGroup.
Choose Next.

Review the information and choose Finish.

Creating the Amazon Redshift connection

You’re now ready to create the Amazon Redshift connection.

On the AWS Glue Data Catalog Connection page, choose Add connection.
For Connection name, enter Redshift_Connection.
For Connection type, choose Amazon Redshift.
Choose Next.
On the next page, choose the cluster you created as part of the CloudFormation stack.
Enter the following information:
1. Database name – streaming-data
2. Username – awsuser
3. Password – Str0ngPas$wd
Choose Next.
Review the details and choose Finish.

You can test the connection when creation is complete.

Crawling the database

You can now create an AWS Glue Data Catalog for your Amazon Redshift table by crawling the database using the connection you just created.

On the AWS Glue Data Catalog Crawlers page, choose Add crawlers.
For Crawler name, enter Redshift_Crawler.
Choose Next.
Choose Data stores.
Chose Next.
On the Add a data store page, for Choose a data store, choose JDBC.
For Connection, choose Redshift_Connection.
For Include path, enter streaming-data.
Choose Next.

On the page asking if you want to add additional data stores, choose No.
Choose Next.
On the Choose IAM role page, choose Glue_Service Role (you created this as part of the CloudFormation stack).
Choose Next.
For Frequency, choose Run on demand.
Choose Next.
On the next page, select an AWS Glue database and choose Next.

If you don’t have a database, choose Add database and create one.

Review the information and choose Finish.

When you’re prompted to run the crawler, choose the prompt. It may take a few minutes for the crawler to finish, after which you can verify in the Data Catalog table section that you have a table called streaming_data_public_msk_tweets with Amazon Redshift classification.

Creating your table

You can now create a table for the Amazon MSK topic.

On the Catalog page, choose Tables.
Choose Add tables.
Choose Add tables manually.
For Table name, enter msk_covidtweets.
Choose the database you created earlier.
Choose Next.
On the Add a data store page, for Select the type of source, select Kafka.
For Topic name, enter CovidTweets.
For Connection, enter MSK_Connection.
Choose Next.

On the next page, for Classification, choose JSON.
Choose Next.
On the Define schema page, choose Add column.
Add the following commas, with Data type as string:
1. id_str
2. created_at
3. source
4. text
5. location
6. hashtags[0].text
7. hashtags[1].text
8. lang

Choose Next.
Review and choose Finish.

Authoring an AWS Glue streaming ETL job

In the following steps, you author a streaming ETL job.

On the AWS Glue console, choose Jobs.
Choose Add job.
For Name, enter MSK-Glue-Redshift.
For IAM role¸ choose Glue_Service_role.
For Type, choose Spark Streaming.
For This job runs, select A proposed script generated by AWS Glue.
Leave other fields at their default.
Choose Next.

On the Choose a data source page, select msk_covidtweets.

Choose Next.
On the Choose a data target page, select streaming_data_public_msk_tweets.

Choose Next.
On the map source columns to target, verify that the columns are mapped correctly.
Choose Save job and edit script.
On the next page, verify that on the last line of the script, windowSize is set to 5 seconds.
Choose Save.
Choose Run job.

The AWS Glue streaming ETL job may take a few minutes to start running, after which the streaming from Amazon MSK starts.

While you’re waiting, you can start the NiFi Twitter flow to publish messages to Amazon MSK.

Starting the NiFi flow to stream Twitter data

To start your NiFi flow to stream data from Twitter and publish it to Amazon MSK, complete the following steps:

Navigate back to the NiFi UI running on http://localhost:8888/nifi/.
Choose the canvas (right-click) and choose Start.

After the NiFi flow starts, you can see that the Twitter data is flowing from GetTwitter and is pushed to the MSK cluster using the PublishKafka processor. When the publish is successful, the data is pending in the success queue and is truncated after 60 seconds.

After the flow begins, data is published to Amaon MSK. The AWS Glue streaming ETL job loads the data into the Amazon Redshift table msk_tweets. You may notice that the data is being queued up in the success connection of Publish_to_MSK, indicating that the data was successfully published to Amazon MSK.

Visualizing Twitter data from Amazon Redshift using QuickSight

This section reviews the steps to visualize the data from the Twitter feed.

Create a new analysis in QuickSight.
Create a new dataset with Amazon Redshift as the source.
Choose msk_tweets as the Amazon Redshift table.
Choose the Custom SQL
Enter the following query:

select cast(created_at as timestamp) as create_timestamp, extract(minute from cast(created_at as timestamp)) as extracted_minutes,id_str,text,split_part(split_part(source,'>',2),'<',1) as formatted_source,user_location,hashtags1,hashtags2,lang from msk_tweets;

Choose the Directly query your data option to query real-time data directly from the database.
Choose Visualize.

You can select various visual types such as stacked area line chart, pie chart, hash cloud, and bar charts on QuickSight to build the following dashboard.

For instructions on building a QuickSight dashboard, see Tutorial: Create a Dashboard. For more information about improving dashboard performance, see Speed up your ELT and BI queries with Amazon Redshift materialized views.

Cleaning up

To clean up your resources, delete the AWS Glue database, tables, crawlers, and job, and service role.

Additionally, be sure to clean up all other AWS resources that you created using AWS CloudFormation. You can delete these resources on the AWS CloudFormation console or via the AWS Command Line Interface (AWS CLI) by deleting the stack named Twitter-MSK-Glue-Redshift-Blog.

Conclusion

In this post, we demonstrated a use case for building a serverless and cost-effective ETL pipeline for streaming, which allows you focus on the outcomes of your analytics. The CloudFormation template gives you an easy way to set up the process, which you can further modify to meet your specific use case needs. You can also modify your serverless AWS Glue ETL code with transformations and mappings to ensure that only valid data gets loaded to your data store. With this solution, you can use AWS Glue streaming as a mechanism to solve your streaming ETL use cases.

Please let us know if you have comments about this post!

About the Authors

Jobin George is a Sr. Partner Solutions Architect at AWS. He has more than a decade of experience with designing and implementing large scale Big Data and Analytics solutions. He provides technical guidance, design advice and thought leadership to some of the key AWS customers and Big Data partners.

Mahesh Goyal is a Data Architect in Big Data at AWS. He works with customers in their journey to the cloud with a focus on big data and data warehouses. In his spare time, Mahesh likes to listen to music and explore new food places with his family.

Dilip Rajan is a Partner Solutions Architect at AWS. His role is to help partners and customers design and build solutions at scale on AWS. Before AWS, he helped Amazon Fulfillment Operations migrate their Oracle Data Warehouse to Redshift while designing the next generation big data analytics platform using AWS technologies.