AWS Big Data Blog
Build an end-to-end serverless streaming pipeline with Apache Kafka on Amazon MSK using Python
The volume of data generated globally continues to surge, from gaming, retail, and finance, to manufacturing, healthcare, and travel. Organizations are looking for more ways to quickly use the constant inflow of data to innovate for their businesses and customers. They have to reliably capture, process, analyze, and load the data into a myriad of data stores, all in real time.
Apache Kafka is a popular choice for these real-time streaming needs. However, it can be challenging to set up a Kafka cluster along with other data processing components that scale automatically depending on your application’s needs. You risk under-provisioning for peak traffic, which can lead to downtime, or over-provisioning for base load, leading to wastage. AWS offers multiple serverless services like Amazon Managed Streaming for Apache Kafka (Amazon MSK), Amazon Data Firehose, Amazon DynamoDB, and AWS Lambda that scale automatically depending on your needs.
In this post, we explain how you can use some of these services, including MSK Serverless, to build a serverless data platform to meet your real-time needs.
Solution overview
Let’s imagine a scenario. You’re responsible for managing thousands of modems for an internet service provider deployed across multiple geographies. You want to monitor the modem connectivity quality that has a significant impact on customer productivity and satisfaction. Your deployment includes different modems that need to be monitored and maintained to ensure minimal downtime. Each device transmits thousands of 1 KB records every second, such as CPU usage, memory usage, alarm, and connection status. You want real-time access to this data so you can monitor performance in real time, and detect and mitigate issues quickly. You also need longer-term access to this data for machine learning (ML) models to run predictive maintenance assessments, find optimization opportunities, and forecast demand.
Your clients that gather the data onsite are written in Python, and they can send all the data as Apache Kafka topics to Amazon MSK. For your application’s low-latency and real-time data access, you can use Lambda and DynamoDB. For longer-term data storage, you can use managed serverless connector service Amazon Data Firehose to send data to your data lake.
The following diagram shows how you can build this end-to-end serverless application.
Let’s follow the steps in the following sections to implement this architecture.
Create a serverless Kafka cluster on Amazon MSK
We use Amazon MSK to ingest real-time telemetry data from modems. Creating a serverless Kafka cluster is straightforward on Amazon MSK. It only takes a few minutes using the AWS Management Console or AWS SDK. To use the console, refer to Getting started using MSK Serverless clusters. You create a serverless cluster, AWS Identity and Access Management (IAM) role, and client machine.
Create a Kafka topic using Python
When your cluster and client machine are ready, SSH to your client machine and install Kafka Python and the MSK IAM library for Python.
- Run the following commands to install Kafka Python and the MSK IAM library:
- Create a new file called
createTopic.py
. - Copy the following code into this file, replacing the
bootstrap_servers
andregion
information with the details for your cluster. For instructions on retrieving thebootstrap_servers
information for your MSK cluster, see Getting the bootstrap brokers for an Amazon MSK cluster.
- Run the
createTopic.py
script to create a new Kafka topic calledmytopic
on your serverless cluster:
Produce records using Python
Let’s generate some sample modem telemetry data.
- Create a new file called
kafkaDataGen.py
. - Copy the following code into this file, updating the
BROKERS
andregion
information with the details for your cluster:
- Run the
kafkaDataGen.py
to continuously generate random data and publish it to the specified Kafka topic:
Store events in Amazon S3
Now you store all the raw event data in an Amazon Simple Storage Service (Amazon S3) data lake for analytics. You can use the same data to train ML models. The integration with Amazon Data Firehose allows Amazon MSK to seamlessly load data from your Apache Kafka clusters into an S3 data lake. Complete the following steps to continuously stream data from Kafka to Amazon S3, eliminating the need to build or manage your own connector applications:
- On the Amazon S3 console, create a new bucket. You can also use an existing bucket.
- Create a new folder in your S3 bucket called
streamingDataLake
. - On the Amazon MSK console, choose your MSK Serverless cluster.
- On the Actions menu, choose Edit cluster policy.
- Select Include Firehose service principal and choose Save changes.
- On the S3 delivery tab, choose Create delivery stream.
- For Source, choose Amazon MSK.
- For Destination, choose Amazon S3.
- For Amazon MSK cluster connectivity, select Private bootstrap brokers.
- For Topic, enter a topic name (for this post,
mytopic
).
- For S3 bucket, choose Browse and choose your S3 bucket.
- Enter
streamingDataLake
as your S3 bucket prefix. - Enter
streamingDataLakeErr
as your S3 bucket error output prefix.
- Choose Create delivery stream.
You can verify that the data was written to your S3 bucket. You should see that the streamingDataLake
directory was created and the files are stored in partitions.
Store events in DynamoDB
For the last step, you store the most recent modem data in DynamoDB. This allows the client application to access the modem status and interact with the modem remotely from anywhere, with low latency and high availability. Lambda seamlessly works with Amazon MSK. Lambda internally polls for new messages from the event source and then synchronously invokes the target Lambda function. Lambda reads the messages in batches and provides these to your function as an event payload.
Lets first create a table in DynamoDB. Refer to DynamoDB API permissions: Actions, resources, and conditions reference to verify that your client machine has the necessary permissions.
- Create a new file called
createTable.py
. - Copy the following code into the file, updating the
region
information:
- Run the
createTable.py
script to create a table calleddevice_status
in DynamoDB:
Now let’s configure the Lambda function.
- On the Lambda console, choose Functions in the navigation pane.
- Choose Create function.
- Select Author from scratch.
- For Function name¸ enter a name (for example,
my-notification-kafka
). - For Runtime, choose Python 3.11.
- For Permissions, select Use an existing role and choose a role with permissions to read from your cluster.
- Create the function.
On the Lambda function configuration page, you can now configure sources, destinations, and your application code.
- Choose Add trigger.
- For Trigger configuration, enter
MSK
to configure Amazon MSK as a trigger for the Lambda source function. - For MSK cluster, enter
myCluster
. - Deselect Activate trigger, because you haven’t configured your Lambda function yet.
- For Batch size, enter
100
. - For Starting position, choose Latest.
- For Topic name¸ enter a name (for example,
mytopic
). - Choose Add.
- On the Lambda function details page, on the Code tab, enter the following code:
- Deploy the Lambda function.
- On the Configuration tab, choose Edit to edit the trigger.
- Select the trigger, then choose Save.
- On the DynamoDB console, choose Explore items in the navigation pane.
- Select the table
device_status
.
You will see Lambda is writing events generated in the Kafka topic to DynamoDB.
Summary
Streaming data pipelines are critical for building real-time applications. However, setting up and managing the infrastructure can be daunting. In this post, we walked through how to build a serverless streaming pipeline on AWS using Amazon MSK, Lambda, DynamoDB, Amazon Data Firehose, and other services. The key benefits are no servers to manage, automatic scalability of the infrastructure, and a pay-as-you-go model using fully managed services.
Ready to build your own real-time pipeline? Get started today with a free AWS account. With the power of serverless, you can focus on your application logic while AWS handles the undifferentiated heavy lifting. Let’s build something awesome on AWS!
About the Authors
Masudur Rahaman Sayem is a Streaming Data Architect at AWS. He works with AWS customers globally to design and build data streaming architectures to solve real-world business problems. He specializes in optimizing solutions that use streaming data services and NoSQL. Sayem is very passionate about distributed computing.
Michael Oguike is a Product Manager for Amazon MSK. He is passionate about using data to uncover insights that drive action. He enjoys helping customers from a wide range of industries improve their businesses using data streaming. Michael also loves learning about behavioral science and psychology from books and podcasts.