AWS Cloud
Get started with Streaming Data

Apache Kafka is an open source, distributed messaging system that enables you to build real-time applications using streaming data. You can send streaming data such as website clickstreams, financial transactions, and application logs to your Kafka cluster, and it buffers the data and serves it to stream processing applications built on frameworks including Apache Spark Streaming, Apache Storm, or Apache Samza.


Running your Kafka deployment on Amazon EC2 provides a high performance, scalable solution for ingesting streaming data. To deploy Kafka on Amazon EC2, you need to select and provision your EC2 instance types, install and configure the software components including Kafka and Apache Zookeeper, and then provision the block storage required to accommodate your streaming data throughput using Amazon Elastic Block Store (EBS). To help your Kafka cluster manage unexpected events like spikes in data volumes above the stream's capacity, you can build replication using Apache Zookeeper, which keeps track of the nodes in your Kafka cluster and coordinates distribution of processes across the nodes. Once Kafka is installed, you will need to deploy HTTPS, maintain certificate authorities, and configure the Kafka instances for SSL to ensure the security of your Kafka cluster.

Running Kafka clusters on Amazon EC2 provides a reliable and scalable infrastructure platform, however, it requires you to monitor, scale, and manage a fleet of servers, maintain the software stack, and manage the security of the cluster, which can be a significant administrative burden. Amazon Kinesis Streams solves this problem by providing a managed service purpose-built to make it easy to work with streaming data on AWS. It captures and stores streaming data reliably, and makes the data available in real time to stream processing applications. It only takes a few clicks in the Amazon Kinesis Console to provision a managed streaming data ingestion system with Amazon Kinesis Streams. Amazon Kinesis Streams automatically replicates your data across three Availability Zones, providing durability for your data. You can easily scale, secure, and manage your streams using the API and built-in integrations with other AWS services including AWS IAM, Amazon CloudWatch, and AWS CloudTrail.

You can process the data in your streams with processing applications built on Amazon Kinesis Analytics or other processing frameworks including Spark Streaming and Kinesis Client Library (KCL). You can use the processed data to power real-time dashboards, generate alerts, implement dynamic pricing, deliver highly targeted advertising, and more.

To learn more about Amazon Kinesis vs. Kafka, click here.


This post demonstrates how to set up Apache Kafka on EC2, use Spark Streaming on EMR to process data coming in to Apache Kafka topics, and query streaming data using Spark SQL on EMR.

Read the entire post »

In this post we use Twitter public streams to analyze the candidates’ performance, both Republican and Democrat, in a near real-time fashion. We show you how to integrate Amazon Kinesis Firehose, AWS Lambda (Python function), and Amazon Elasticsearch Service to create an end-to-end, near real-time discovery platform.

Read the entire post »

This blog post walks you through a simple and effective way to persist data to Amazon S3 from Amazon Kinesis Streams using AWS Lambda and Amazon Kinesis Firehose.

Read the entire post here »

To read more blog posts on streaming data and big data, visit the AWS big data blog »

It's easy to get started with Amazon Kinesis. Just sign in to the AWS Management Console, and launch Amazon Kinesis.


Get Started with Amazon Kinesis