Amazon Kinesis Data Streams is a massively scalable, highly durable data ingestion and processing service optimized for streaming data. You can configure hundreds of thousands of data producers to continuously put data into a Kinesis data stream. Data will be available within milliseconds to your Amazon Kinesis applications, and those applications will receive data records in the order they were generated.
Amazon Kinesis Data Streams is integrated with a number of AWS services, including Amazon Kinesis Data Firehose for near real-time transformation and delivery of streaming data into an AWS data lake like Amazon S3, Amazon Managed Service for Apache Flink for managed stream processing, AWS Lambda for event or record processing, AWS PrivateLink for private connectivity, Amazon Cloudwatch for metrics and log processing, and AWS KMS for server-side encryption.
In the following architectural diagram, Amazon Kinesis Data Streams is used as the gateway of a big data solution. Data from various sources is put into an Amazon Kinesis stream and then the data from the stream is consumed by different Amazon Kinesis applications. In this example, one application (in yellow) is running a real-time dashboard against the streaming data. Another application (in red) performs simple aggregation and emits processed data into Amazon S3. The data in S3 is further processed and stored in Amazon Redshift for complex analytics. The third application (in green) emits raw data into Amazon S3, which is then archived to Amazon Glacier for lower cost long-term storage. Notice all three of these data processing pipelines are happening simultaneously and in parallel.
A data producer is an application that typically emits data records as they are generated to a Kinesis data stream. Data producers assign partition keys to records. Partition keys ultimately determine which shard ingests the data record for a data stream.
A data consumer is a distributed Kinesis application or AWS service retrieving data from all shards in a stream as it is generated. Most data consumers are retrieving the most recent data in a shard, enabling real-time analytics or handling of data.
A data stream is a logical grouping of shards. There are no bounds on the number of shards within a data stream (request a limit increase if you need more). A data stream will retain data for 24 hours by default, or optionally up to 365 days.
A shard is the base throughput unit of an Amazon Kinesis data stream.
- A shard is an append-only log and a unit of streaming capability. A shard contains an ordered sequence of records ordered by arrival time.
- One shard can ingest up to 1000 data records per second, or 1MB/sec. Add more shards to increase your ingestion capability.
- Add or remove shards from your stream dynamically as your data throughput changes using the AWS console, UpdateShardCount API, trigger automatic scaling via AWS Lambda, or using an auto scaling utility.
- When consumers use enhanced fan-out, one shard provides 1MB/sec data input and 2MB/sec data output for each data consumer registered to use enhanced fan-out.
- When consumers do not use enhanced fan-out, a shard provides 1MB/sec of input and 2MB/sec of data output, and this output is shared with any consumer not using enhanced fan-out.
- You will specify the number of shards needed when you create a stream and can change the quantity at any time. For example, you can create a stream with two shards. If you have 5 data consumers using enhanced fan-out, this stream can provide up to 20 MB/sec of total data output (2 shards x 2MB/sec x 5 data consumers). When data consumer are not using enhanced fan-out this stream has a throughput of 2MB/sec data input and 4MB/sec data output. In all cases this stream allows up to 2000 PUT records per second, or 2MB/sec of ingress whichever limit is met first.
- You can monitor shard-level metrics in Amazon Kinesis Data Streams.
A record is the unit of data stored in an Amazon Kinesis stream. A record is composed of a sequence number, partition key, and data blob. A data blob is the data of interest your data producer adds to a stream. The maximum size of a data blob (the data payload after Base64-decoding) is 1 megabyte (MB).
A partition key is typically a meaningful identifier, such as a user ID or timestamp. It is specified by your data producer while putting data into an Amazon Kinesis data stream, and useful for consumers as they can use the partition key to replay or build a history associated with the partition key. The partition key is also used to segregate and route data records to different shards of a stream. For example, assuming you have an Amazon Kinesis data stream with two shards (Shard 1 and Shard 2). You can configure your data producer to use two partition keys (Key A and Key B) so that all data records with Key A are added to Shard 1 and all data records with Key B are added to Shard 2.
A sequence number is a unique identifier for each data record. Sequence number is assigned by Amazon Kinesis Data Streams when a data producer calls PutRecord or PutRecords API to add data to an Amazon Kinesis data stream. Sequence numbers for the same partition key generally increase over time; the longer the time period between PutRecord or PutRecords requests, the larger the sequence numbers become.
Use Kinesis Data Streams
After you sign up for Amazon Web Services, you can start using Amazon Kinesis Data Streams by:
- Creating an Amazon Kinesis data stream through either Amazon Kinesis Management Console or Amazon Kinesis CreateStream API.
- Configuring your data producers to continuously put data into your Amazon Kinesis data stream.
- Building your Amazon Kinesis applications to read and process data from your Amazon Kinesis data stream.
Put data into streams
Amazon Kinesis Data Generator
Put sample data into a Kinesis data stream or Kinesis data firehose using the Amazon Kinesis Data Generator.
Amazon Kinesis Data Streams API
Amazon Kinesis Data Streams provides two APIs for putting data into an Amazon Kinesis stream: PutRecord and PutRecords. PutRecord allows a single data record within an API call and PutRecords allows multiple data records within an API call.
Amazon Kinesis Producer Library (KPL)
Amazon Kinesis Producer Library (KPL) is an easy to use and highly configurable library that helps you put data into an Amazon Kinesis data stream. Amazon Kinesis Producer Library (KPL) presents a simple, asynchronous, and reliable interface that enables you to quickly achieve high producer throughput with minimal client resources.
Amazon Kinesis Agent
Amazon Kinesis Agent is a pre-built Java application that offers an easy way to collect and send data to your Amazon Kinesis stream. You can install the agent on Linux-based server environments such as web servers, log servers, and database servers. The agent monitors certain files and continuously sends data to your stream.
Run fully managed stream processing applications or build your own
Run fully managed stream processing applications using AWS services or build your own.
Amazon Kinesis Data Firehose
Amazon Kinesis Data Firehose is the easiest way to reliably transform and load streaming data into data stores and analytics tools. You can use a Kinesis data stream as a source for a Kinesis data firehose.
Amazon Managed Service for Apache Flink
With Amazon Managed Service for Apache Flink, you can easily query streaming data or build streaming applications using Apache Flink so that you can gain actionable insights and promptly respond to your business and customer needs. You can use a Kinesis data stream as a source and a destination for an Amazon Managed Service for Apache Flink application.
You can subscribe Lambda functions to automatically read records off your Kinesis data stream. AWS Lambda is typically used for record-by-record (also known as event-based) stream processing.
Amazon Kinesis Client Library (KCL)
Amazon Kinesis Client Library (KCL) is a pre-built library that helps you easily build Amazon Kinesis applications for reading and processing data from an Amazon Kinesis data stream. KCL handles complex issues such as adapting to changes in stream volume, load-balancing streaming data, coordinating distributed services, and processing data with fault-tolerance. KCL enables you to focus on business logic while building Amazon Kinesis applications. Starting with KCL 2.0, you can utilize a low latency HTTP/2 streaming API and enhanced fan-out to retrieve data from a stream.
Amazon Kinesis Connector Library
Amazon Kinesis Connector Library is a pre-built library that helps you easily integrate Amazon Kinesis with other AWS services and third-party tools. Amazon Kinesis Client Library (KCL) is required for using Amazon Kinesis Connector Library. The current version of this library provides connectors to Amazon DynamoDB, Amazon Redshift, Amazon S3, and Amazon Elasticsearch Service. The library also includes sample connectors of each type, plus Apache Ant build files for running the samples.
Amazon Kinesis Storm Spout
Amazon Kinesis Storm Spout is a pre-built library that helps you easily integrate Amazon Kinesis Data Streams with Apache Storm. The current version of Amazon Kinesis Storm Spout fetches data from a Kinesis data stream and emits it as tuples. You will add the spout to your Storm topology to leverage Amazon Kinesis Data Streams as a reliable, scalable, stream capture, storage, and replay service.
Accessing Kinesis Data Streams APIs privately from Amazon VPC
You can privately access Kinesis Data Streams APIs from your Amazon Virtual Private Cloud (VPC) by creating VPC Endpoints. With VPC Endpoints, the routing between the VPC and Kinesis Data Streams is handled by the AWS network without the need for an Internet gateway, NAT gateway, or VPN connection. The latest generation of VPC Endpoints used by Kinesis Data Streams are powered by AWS PrivateLink, a technology that enables private connectivity between AWS services using Elastic Network Interfaces (ENI) with private IPs in your VPCs. For more information about PrivatLink, see the AWS PrivateLink documentation.
Fan-out Kinesis Data Streams data without sacrificing performance
Enhanced fan-out provides allows customers to scale the number of consumers reading from a stream in parallel while maintaining performance. You can use enhanced fan-out and an HTTP/2 data retrieval API to fan-out data to multiple applications, typically within 70 milliseconds of arrival.
Encrypting your Kinesis Data Streams data
You can encrypt the data you put into Kinesis Data Streams using Server-side encryption or client-side encryption. Server-side encryption is a fully managed feature that automatically encrypts and decrypts data as you put and get it from a data stream. Alternatively, you can encrypt your data on the client-side before putting it into your data stream. To learn more, see the Security section of the Kinesis Data Streams FAQs.
Amazon Kinesis Data Firehose and Amazon Managed Service for Apache Flink integration
Use a data stream as a source for a Kinesis Data Firehose to transform your data on the fly while delivering it to S3, Redshift, Elasticsearch, and Splunk. Attach a Amazon Managed Service for Apache Flink application to process streaming data in real time with standard SQL without having to learn new programming languages or processing frameworks.
Amazon CloudWatch integration
Amazon Kinesis Data Streams integrates with Amazon CloudWatch so that you can easily collect, view, and analyze CloudWatch metrics for your Amazon Kinesis data streams and the shards within those data streams. For more information about Amazon Kinesis Data Streams metrics, see Monitoring Amazon Kinesis with Amazon CloudWatch.
AWS IAM integration
Amazon Kinesis Data Streams integrates with AWS Identity and Access Management (IAM), a service that enables you to securely control access to your AWS services and resources for your users. For example, you can create a policy that only allows a specific user or group to put data into your Amazon Kinesis data stream. For more information about access management and control of your Amazon Kinesis data stream, see Controlling Access to Amazon Kinesis Resources using IAM.
AWS CloudTrail integration
Amazon Kinesis Data Streams integrates with AWS CloudTrail, a service that records AWS API calls for your account and delivers log files to you. For more information about API call logging and a list of supported Amazon Kinesis API, see Logging Amazon Kinesis API calls Using AWS CloudTrail.
You can tag your Amazon Kinesis data streams for easier resource and cost management. A tag is a user-defined label expressed as a key-value pair that helps organize AWS resources. For example, you can tag your Amazon Kinesis data streams by cost centers so that you can categorize and track your Amazon Kinesis Data Streams costs based on cost centers. For more information about, see Tagging Your Amazon Kinesis Data Streams.
Analyze stock data with Amazon Kinesis Data Streams
This tutorial walks through the steps of creating an Amazon Kinesis data stream, sending simulated stock trading data in to the stream, and writing an application to process the data from the data stream.
Analyzing streaming data in real time with Amazon Kinesis (ABD301)
Amazon Kinesis makes it easy to collect process and analyze real-time streaming data so you can get timely insights and react quickly to new information. In this session we present an end-to-end streaming data solution using Kinesis Streams for data ingestion Kinesis Analytics for real-time processing and Kinesis Firehose for persistence. We review in detail how to write SQL queries using streaming data and discuss best practices to optimize and monitor your Kinesis Analytics applications. Lastly we discuss how to estimate the cost of the entire system.
Workshop: Building your first big data application on AWS (ABD317)
Want to ramp up your knowledge of AWS big data web services and launch your first big data application on the cloud? We walk you through simplifying big data processing as a data bus comprising ingest, store, process, and visualize. You build a big data application using AWS managed services, including Amazon Athena, Amazon Kinesis, Amazon DynamoDB, and Amazon S3. Along the way, we review architecture design patterns for big data applications and give you access to a take-home lab so that you can rebuild and customize the application yourself. You should bring your own laptop and have some familiarity with AWS services to get the most from this session.
Workshop: Don’t wait until tomorrow; How to use streaming data to gain real-time insights into your business (ABD321)
In recent years, there has been an explosive growth in the number of connected devices and real-time data sources. Because of this, data is being produced continuously and its production rate is accelerating. Businesses can no longer wait for hours or days to use this data. To gain the most valuable insights, they must use this data immediately so they can react quickly to new information. In this workshop, you learn how to take advantage of streaming data sources to analyze and react in near real-time. You are presented with several requirements for a real-world streaming data scenario and you're tasked with creating a solution that successfully satisfies the requirements using services such as Amazon Kinesis, AWS Lambda and Amazon SNS.
How Amazon Flex uses real-time analytics to deliver packages on time (ABD217)
Reducing the time to get actionable insights from data is important to all businesses and customers who employ batch data analytics tools are exploring the benefits of streaming analytics. Learn best practices to extend your architecture from data warehouses and databases to real-time solutions. Learn how to use Amazon Kinesis to get real-time data insights and integrate them with Amazon Aurora Amazon RDS Amazon Redshift and Amazon S3. The Amazon Flex team describes how they used streaming analytics in their Amazon Flex mobile app used by Amazon delivery drivers to deliver millions of packages each month on time. They discuss the architecture that enabled the move from a batch processing system to a real-time system overcoming the challenges of migrating existing batch data to streaming data and how to benefit from real-time analytics.
Real-time streaming applications on AWS: Use cases and patterns (ABD203)
To win in the marketplace and provide differentiated customer experiences, businesses need to be able to use live data in real time to facilitate fast decision making. In this session, you learn common streaming data processing use cases and architectures. First, we give an overview of streaming data and AWS streaming data capabilities. Next, we look at a few customer examples and their real-time streaming applications. Finally, we walk through common architectures and design patterns of top streaming data use cases.
The AWS Streaming Data Solution for Amazon Kinesis provides AWS CloudFormation templates where data flows through producers, streaming storage, consumers, and destinations. To support multiple use cases and business needs, this solution offers four AWS CloudFormation templates. The templates are configured to apply best practices to monitor functionality using dashboards and alarms, and to secure data.