What is Apache Kafka?

Build real time streaming data pipelines and applications that adapt to data streams

What is Apache Kafka?

Apache Kafka is a distributed data store optimized for ingesting and processing streaming data in real-time. Streaming data is data that is continuously generated by thousands of data sources, which typically send the data records in simultaneously. A streaming platform needs to handle this constant influx of data, and process the data sequentially and incrementally.

Kafka provides three main functions to its users:

  • Publish and subscribe to streams of records
  • Effectively store streams of records in the order in which records were generated
  • Process streams of records in real time

Kafka is primarily used to build real-time streaming data pipelines and applications that adapt to the data streams. It combines messaging, storage, and stream processing to allow storage and analysis of both historical and real-time data.  

Why would you use Kafka?

Kafka is used to build real-time streaming data pipelines and real-time streaming applications. A data pipeline reliably processes and moves data from one system to another, and a streaming application is an application that consumes streams of data. For example, if you want to create a data pipeline that takes in user activity data to track how people use your website in real-time, Kafka would be used to ingest and store streaming data while serving reads for the applications powering the data pipeline. Kafka is also often used as a message broker solution, which is a platform that processes and mediates communication between two applications.

How does Kafka work?

Kafka combines two messaging models, queuing and publish-subscribe, to provide the key benefits of each to consumers. Queuing allows for data processing to be distributed across many consumer instances, making it highly scalable. However, traditional queues aren’t multi-subscriber. The publish-subscribe approach is multi-subscriber, but because every message goes to every subscriber it cannot be used to distribute work across multiple worker processes. Kafka uses a partitioned log model to stitch together these two solutions. A log is an ordered sequence of records, and these logs are broken up into segments, or partitions, that correspond to different subscribers. This means that there can be multiple subscribers to the same topic and each is assigned a partition to allow for higher scalability. Finally, Kafka’s model provides replayability, which allows multiple independent applications reading from data streams to work independently at their own rate.

Queuing

product-page-diagram_Kafka_Queue

Publish-Subscribe

product-page-diagram_Kafka_PubSub

Benefits of Kafka's approach

Scalable

Kafka’s partitioned log model allows data to be distributed across multiple servers, making it scalable beyond what would fit on a single server. 

Fast

Kafka decouples data streams so there is very low latency, making it extremely fast. 

Durable

Partitions are distributed and replicated across many servers, and the data is all written to disk. This helps protect against server failure, making the data very fault-tolerant and durable. 

Dive deep into Kafka's architecture

Kafka remedies the two different models by publishing records to different topics. Each topic has a partitioned log, which is a structured commit log that keeps track of all records in order and appends new ones in real time. These partitions are distributed and replicated across multiple servers, allowing for high scalability, fault-tolerance, and parallelism. Each consumer is assigned a partition in the topic, which allows for multi-subscribers while maintaining the order of the data. By combining these messaging models, Kafka offers the benefits of both. Kafka also acts as a very scalable and fault-tolerant storage system by writing and replicating all data to disk. By default, Kafka keeps data stored on disk until it runs out of space, but the user can also set a retention limit. Kafka has four APIs:

  • Producer API: used to publish a stream of records to a Kafka topic.
  • Consumer API: used to subscribe to topics and process their streams of records.
  • Streams API: enables applications to behave as stream processors, which take in an input stream from topic(s) and transform it to an output stream which goes into different output topic(s).
  • Connector API: allows users to seamlessly automate the addition of another application or data system to their current Kafka topics.

Apache Kafka vs RabbitMQ

RabbitMQ is an open source message broker that uses a messaging queue approach. Queues are spread across a cluster of nodes and optionally replicated, with each message only being delivered to a single consumer.

Characteristics

Apache Kafka

RabbitMQ

Architecture

Kafka uses a partitioned log model, which combines messaging queue and publish subscribe approaches.

RabbitMQ uses a messaging queue.

Scalability

Kafka provides scalability by allowing partitions to be distributed across different servers.

Increase the number of consumers to the queue to scale out processing across those competing consumers.

Message retention

Policy based, for example messages may be stored for one day. The user can configure this retention window.

Acknowledgement based, meaning messages are deleted as they are consumed.

Multiple consumers

Multiple consumers can subscribe to the same topic, because Kafka allows the same message to be replayed for a given window of time.

Multiple consumers cannot all receive the same message, because messages are removed as they are consumed.

Replication

Topics are automatically replicated, but the user can manually configure topics to not be replicated.

Messages are not automatically replicated, but the user can manually configure them to be replicated.

Message ordering

Each consumer receives information in order because of the partitioned log architecture.

Messages are delivered to consumers in the order of their arrival to the queue. If there are competing consumers, each consumer will process a subset of that message.

Protocols

Kafka uses a binary protocol over TCP.

Advanced messaging queue protocol (AMQP) with support via plugins: MQTT, STOMP.

Learn more about Kafka on AWS

Read more on how to manually deploy Kafka on AWS here.

AWS also offers Amazon MSK, the most compatible, available, and secure fully managed service for Apache Kafka, enabling customers to populate data lakes, stream changes to and from databases, and power machine learning and analytics applications. With Amazon MSK, customers are able to spend less time managing infrastructure and more time building applications. Learn more about Amazon MSK.

Get started with Amazon MSK

Get set up for an Amazon MSK cluster
Get set up for an Amazon MSK cluster

Sign up for AWS and download libraries and tools.

Review the getting-started guide
Review the getting-started guide

Learn how to set up your Apache Kafka cluster on Amazon MSK in this step-by-step guide.

Run your Apache Kafka cluster
Run your Apache Kafka cluster

Start running your Apache Kafka cluster on Amazon MSK. Log in to the Amazon MSK console.