Getting started with Amazon Kinesis Data Streams

Get started

Amazon Kinesis Data Streams is a massively scalable, highly durable data ingestion and processing service optimized for streaming data. You can configure hundreds of thousands of data producers to continuously put data into a Kinesis data stream. Data will be available within milliseconds to your Amazon Kinesis applications, and those applications will receive data records in the order they were generated.

Amazon Kinesis Data Streams is integrated with a number of AWS services, including Amazon Kinesis Data Firehose for near real-time transformation and delivery of streaming data into an AWS data lake like Amazon S3Amazon Managed Service for Apache Flink for managed stream processing, AWS Lambda for event or record processing, AWS PrivateLink for private connectivity, Amazon Cloudwatch for metrics and log processing, and AWS KMS for server-side encryption.

Amazon Kinesis Data Streams is used as the gateway of a big data solution. Data from various sources is put into an Amazon Kinesis stream and then the data from the stream is consumed by different Amazon Kinesis applications. In this example, one application (in yellow) is running a real-time dashboard against the streaming data. Another application (in red) performs simple aggregation and emits processed data into Amazon S3. The data in S3 is further processed and stored in Amazon Redshift for complex analytics. The third application (in green) emits raw data into Amazon S3, which is then archived to Amazon Glacier for lower cost long-term storage. Notice all three of these data processing pipelines are happening simultaneously and in parallel.

Get started with Amazon Kinesis Data Streams

See What's New with Amazon Kinesis Data Streams

Request support for your proof-of-concept or evaluation

Videos

Use Kinesis Data Streams

After you sign up for Amazon Web Services, you can start using Amazon Kinesis Data Streams by:

  • Creating an Amazon Kinesis data stream through either Amazon Kinesis Management Console or Amazon Kinesis CreateStream API.
  • Configuring your data producers to continuously put data into your Amazon Kinesis data stream.
  • Building your Amazon Kinesis applications to read and process data from your Amazon Kinesis data stream.

Key concepts

A data producer is an application that typically emits data records as they are generated to a Kinesis data stream. Data producers assign partition keys to records. Partition keys ultimately determine which shard ingests the data record for a data stream.

A data consumer is a distributed Kinesis application or AWS service retrieving data from all shards in a stream as it is generated. Most data consumers are retrieving the most recent data in a shard, enabling real-time analytics or handling of data.

A data stream is a logical grouping of shards. There are no bounds on the number of shards within a data stream (request a limit increase if you need more). A data stream will retain data for 24 hours by default, or optionally up to 365 days.

A shard is the base throughput unit of an Amazon Kinesis data stream.

  • A shard is an append-only log and a unit of streaming capability. A shard contains an ordered sequence of records ordered by arrival time.
  • One shard can ingest up to 1000 data records per second, or 1MB/sec. Add more shards to increase your ingestion capability.
  • Add or remove shards from your stream dynamically as your data throughput changes using the AWS console, UpdateShardCount API, trigger automatic scaling via AWS Lambda, or using an auto scaling utility.
  • When consumers use enhanced fan-out, one shard provides 1MB/sec data input and 2MB/sec data output for each data consumer registered to use enhanced fan-out.
  • When consumers do not use enhanced fan-out, a shard provides 1MB/sec of input and 2MB/sec of data output, and this output is shared with any consumer not using enhanced fan-out.
  • You will specify the number of shards needed when you create a stream and can change the quantity at any time. For example, you can create a stream with two shards. If you have 5 data consumers using enhanced fan-out, this stream can provide up to 20 MB/sec of total data output (2 shards x 2MB/sec x 5 data consumers). When data consumer are not using enhanced fan-out this stream has a throughput of 2MB/sec data input and 4MB/sec data output. In all cases this stream allows up to 2000 PUT records per second, or 2MB/sec of ingress whichever limit is met first.
  • You can monitor shard-level metrics in Amazon Kinesis Data Streams.

A record is the unit of data stored in an Amazon Kinesis stream. A record is composed of a sequence number, partition key, and data blob. A data blob is the data of interest your data producer adds to a stream. The maximum size of a data blob (the data payload after Base64-decoding) is 1 megabyte (MB).

A partition key is typically a meaningful identifier, such as a user ID or timestamp. It is specified by your data producer while putting data into an Amazon Kinesis data stream, and useful for consumers as they can use the partition key to replay or build a history associated with the partition key. The partition key is also used to segregate and route data records to different shards of a stream. For example, assuming you have an Amazon Kinesis data stream with two shards (Shard 1 and Shard 2). You can configure your data producer to use two partition keys (Key A and Key B) so that all data records with Key A are added to Shard 1 and all data records with Key B are added to Shard 2.

A sequence number is a unique identifier for each data record. Sequence number is assigned by Amazon Kinesis Data Streams when a data producer calls PutRecord or PutRecords API to add data to an Amazon Kinesis data stream. Sequence numbers for the same partition key generally increase over time; the longer the time period between PutRecord or PutRecords requests, the larger the sequence numbers become.

Put data into streams

Data producers can put data into Amazon Kinesis data streams using the Amazon Kinesis Data Streams APIs, Amazon Kinesis Producer Library (KPL), or Amazon Kinesis Agent.

Put sample data into a Kinesis data stream or Kinesis data firehose using the Amazon Kinesis Data Generator.

Amazon Kinesis Data Streams provides two APIs for putting data into an Amazon Kinesis stream: PutRecord and PutRecords. PutRecord allows a single data record within an API call and PutRecords allows multiple data records within an API call.

Amazon Kinesis Producer Library (KPL) is an easy to use and highly configurable library that helps you put data into an Amazon Kinesis data stream. Amazon Kinesis Producer Library (KPL) presents a simple, asynchronous, and reliable interface that enables you to quickly achieve high producer throughput with minimal client resources.

Amazon Kinesis Agent is a pre-built Java application that offers an easy way to collect and send data to your Amazon Kinesis stream. You can install the agent on Linux-based server environments such as web servers, log servers, and database servers. The agent monitors certain files and continuously sends data to your stream.

Run applications or build your own

Run fully managed stream processing applications using AWS services or build your own.

Amazon Kinesis Data Firehose is the easiest way to reliably transform and load streaming data into data stores and analytics tools. You can use a Kinesis data stream as a source for a Kinesis data firehose.

With Amazon Managed Service for Apache Flink, you can easily query streaming data or build streaming applications using Apache Flink so that you can gain actionable insights and promptly respond to your business and customer needs. You can use a Kinesis data stream as a source and a destination for an Amazon Managed Service for Apache Flink application.

You can subscribe Lambda functions to automatically read records off your Kinesis data stream. AWS Lambda is typically used for record-by-record (also known as event-based) stream processing.

Amazon Kinesis Client Library (KCL) is a pre-built library that helps you easily build Amazon Kinesis applications for reading and processing data from an Amazon Kinesis data stream. KCL handles complex issues such as adapting to changes in stream volume, load-balancing streaming data, coordinating distributed services, and processing data with fault-tolerance. KCL enables you to focus on business logic while building Amazon Kinesis applications. Starting with KCL 2.0, you can utilize a low latency HTTP/2 streaming API and enhanced fan-out to retrieve data from a stream.

Amazon Kinesis Connector Library is a pre-built library that helps you easily integrate Amazon Kinesis with other AWS services and third-party tools. Amazon Kinesis Client Library (KCL) is required for using Amazon Kinesis Connector Library. The current version of this library provides connectors to Amazon DynamoDBAmazon RedshiftAmazon S3, and Amazon Elasticsearch Service. The library also includes sample connectors of each type, plus Apache Ant build files for running the samples.

Amazon Kinesis Storm Spout is a pre-built library that helps you easily integrate Amazon Kinesis Data Streams with Apache Storm. The current version of Amazon Kinesis Storm Spout fetches data from a Kinesis data stream and emits it as tuples. You will add the spout to your Storm topology to leverage Amazon Kinesis Data Streams as a reliable, scalable, stream capture, storage, and replay service.

Manage streams

You can privately access Kinesis Data Streams APIs from your Amazon Virtual Private Cloud (VPC) by creating VPC Endpoints. With VPC Endpoints, the routing between the VPC and Kinesis Data Streams is handled by the AWS network without the need for an Internet gateway, NAT gateway, or VPN connection. The latest generation of VPC Endpoints used by Kinesis Data Streams are powered by AWS PrivateLink, a technology that enables private connectivity between AWS services using Elastic Network Interfaces (ENI) with private IPs in your VPCs. For more information about PrivatLink, see the AWS PrivateLink documentation.

Enhanced fan-out provides allows customers to scale the number of consumers reading from a stream in parallel while maintaining performance. You can use enhanced fan-out and an HTTP/2 data retrieval API to fan-out data to multiple applications, typically within 70 milliseconds of arrival.

You can encrypt the data you put into Kinesis Data Streams using Server-side encryption or client-side encryption. Server-side encryption is a fully managed feature that automatically encrypts and decrypts data as you put and get it from a data stream. Alternatively, you can encrypt your data on the client-side before putting it into your data stream. To learn more, see the Security section of the Kinesis Data Streams FAQs.

Use a data stream as a source for a Kinesis Data Firehose to transform your data on the fly while delivering it to S3, Redshift, Elasticsearch, and Splunk. Attach a Amazon Managed Service for Apache Flink application to process streaming data in real time with standard SQL without having to learn new programming languages or processing frameworks.

Amazon Kinesis Data Streams integrates with Amazon CloudWatch so that you can easily collect, view, and analyze CloudWatch metrics for your Amazon Kinesis data streams and the shards within those data streams. For more information about Amazon Kinesis Data Streams metrics, see Monitoring Amazon Kinesis with Amazon CloudWatch.

Amazon Kinesis Data Streams integrates with AWS Identity and Access Management (IAM), a service that enables you to securely control access to your AWS services and resources for your users. For example, you can create a policy that only allows a specific user or group to put data into your Amazon Kinesis data stream. For more information about access management and control of your Amazon Kinesis data stream, see Controlling Access to Amazon Kinesis Resources using IAM.

Amazon Kinesis Data Streams integrates with AWS CloudTrail, a service that records AWS API calls for your account and delivers log files to you. For more information about API call logging and a list of supported Amazon Kinesis API, see Logging Amazon Kinesis API calls Using AWS CloudTrail.

You can tag your Amazon Kinesis data streams for easier resource and cost management. A tag is a user-defined label expressed as a key-value pair that helps organize AWS resources. For example, you can tag your Amazon Kinesis data streams by cost centers so that you can categorize and track your Amazon Kinesis Data Streams costs based on cost centers. For more information about, see Tagging Your Amazon Kinesis Data Streams.

Tutorials

This tutorial walks through the steps of creating an Amazon Kinesis data stream, sending simulated stock trading data in to the stream, and writing an application to process the data from the data stream.  

Featured presentations