Q: What is Amazon Kinesis?
Amazon Kinesis is a fully managed service for real-time processing of streaming data at massive scale. Amazon Kinesis can collect, and store in real-time hundreds of terabytes of data per hour from hundreds of thousands of sources, such as web site click-streams, operational logs, digital marketing data, and many more enabling you to easily write applications that process that information in real-time.
Amazon Kinesis also comes with a pre-built client library (Kinesis Client Library) that simplifies load balancing, coordination and fault tolerance, enabling you to focus on building core business application logic that you can deploy on high-performance Amazon Elastic Compute Cloud (Amazon EC2) processing instances. Additionally, you can use AWS Auto Scaling to create elastic processing clusters.
Q: What does Amazon Kinesis manage on my behalf?
Amazon Kinesis automatically manages the infrastructure, storage, networking, and configuration needed to collect and process your data at the level of throughput your streaming applications need. You do not have to worry about provisioning, deployment, ongoing-maintenance of hardware, software or other services to enable real-time capture and storage of large scale data. Amazon Kinesis also synchronously replicates data across three facilities in an AWS Region, providing high availability and data durability. Additionally, you can increase or decrease the capacity of the stream at any time according to your business or operational needs, without any interruption to ongoing stream processing. Amazon Kinesis also provides the Kinesis Client Library that enables you to focus on creating business logic for the processing applications by simplifying reading data from the data stream, and enabling distributed, fault-tolerant, at-least once processing.
Q: What is the Amazon Kinesis Client Libraray?
Amazon Kinesis provides you with client libraries to build and operate real-time streaming data processing applications. The Amazon Kinesis Client Library enables you to focus on business logic, letting the client library automatically handle complex issues like adapting to changes in stream volume, load-balancing streaming data, coordinating distributed services, and processing data with fault-tolerance.
Q: What are some example use cases for Amazon Kinesis?
Amazon Kinesis is useful wherever there is a need to move data rapidly off producers or data sources, as it is produced, and then continuously process it, be it to transform before emitting into another data store, drive realtime metrics and analytics, or derive more complex streams of data for downstream processing. The following are typical scenarios for using Amazon Kinesis.
- Accelerated log and data feed intake and processing: With Amazon Kinesis you can have producers push data directly into an Amazon Kinesis stream. For example, system and application logs can be submitted to Amazon Kinesis and be available for processing in seconds. This prevents the log data from being lost if the front end or application server fails. Amazon Kinesis provides accelerated data feed intake because you are not batching up the data on the servers before you submit them for intake.
- Real-time metrics and reporting: You can use data ingested into Amazon Kinesis for extracting metrics, and generating KPIs to power reports, and dashboards at real time speeds. For example, metrics and reporting for system and application logs ingested into the Amazon Kinesis stream are available in real time. This enables data-processing application logic to work on data as it is streaming in continuously, rather than wait for data batches to be sent to the data-processing applications.
- Real-time data analytics: Amazon Kinesis enables real-time data analytics on streaming data, such as analyzing website clickstream data, customer engagement analytics in real time.
- Complex stream processing: Amazon Kinesis enables you to create Directed Acyclic Graphs (DAGs) of Kinesis applications and data streams. In this scenario one or more Kinesis applications can put data into another Amazon Kinesis stream for processing by other different Kinesis applications. This enables successive stages of stream processing, with specific filtering
Q: Is Amazon Kinesis only for high-scale applications?
No. Amazon Kinesis offers seamless scaling so you can start small and scale up and down in line with your requirements. If you need fast performance at any scale then Amazon Kinesis may be the right choice for you.
Q: How do I get started with Amazon Kinesis?
Once you are signed up for Amazon Kinesis, you can begin interacting with Amazon Kinesis using either the AWS Management Console, or Amazon Kinesis APIs. To get started using the Amazon Kinesis service:
- Create an Amazon Kinesis stream with an appropriate number of shards.
- Configure the data sources / producers to continually push data into the Amazon Kinesis stream.
- Build a Kinesis application (leveraging the Kinesis Client Library optionally) to consume data from the Amazon Kinesis stream and connect the application to the Amazon Kinesis stream.
- Operate the Kinesis application. The Amazon Kinesis Client Library will help with scale out, and fault tolerant processing.
If you are using the AWS Management Console, you can create a stream and begin exploring with just a few clicks. The Amazon Kinesis Developer Guide describes these steps in more detail.
Q: How does Amazon Kinesis differ from Amazon SQS? Which one should I use?
Amazon Kinesis is a service for real-time processing of streaming big data. You can push data from many data producers, rapidly and continuously as it is generated into Amazon Kinesis, which offers a reliable, highly scalable service to capture, and store the data. In real-time this large scale data is delivered onto multiple processing applications that you can build with the help of the Amazon Kinesis Client Library. Furthermore, all data is stored in the stream for 24 hours, so you can replay old data back into your applications, should you need to for any reason. In this manner Amazon Kinesis enables you to perform managed real-time streaming data processing.
Amazon Simple Queue Service (Amazon SQS) offers a reliable, highly scalable hosted queue for storing messages as they travel between computers. By using Amazon SQS, you can simply move data between distributed application components performing different tasks, without losing messages or requiring each component to be always available. Amazon SQS makes it easy to build an automated workflow, working in conjunction with the Amazon Elastic Compute Cloud (Amazon EC2) and the other AWS infrastructure web services.
If your use cases are similar to those mentioned in the ‘General’ section above – then Amazon Kinesis might be a good fit.
Q: What is the provisioned throughput model for Amazon Kinesis?
Amazon Kinesis enables you to specify, and change at any time the effective throughput capacity of your stream. When creating a stream, simply specify how much throughput capacity you require in terms of shards. With each shard in Amazon Kinesis, you can capture up to 1 megabyte per second of data at 1,000 transactions per second. Your Amazon Kinesis applications can read data from each shard at up to 2 megabytes per second. If your throughput requirements change, simply update your stream's capacity by changing the number of shards you need using the Amazon Kinesis APIs. Your stream keeps flowing while scaling is underway, with new capacity available within seconds.
Q: How do I size an Amazon Kinesis stream?
Data is stored in an Amazon Kinesis stream, and a stream is composed of multiple shards. You must determine the size of the stream you need before you create the Amazon Kinesis stream. To determine the size of an Amazon Kinesis stream, you must determine how many shards you will need for the stream. You can dynamically resize your Amazon Kinesis stream or add and remove shards after the stream has been created and while it has a Kinesis application running that is consuming data from the stream.
To determine the initial size of an Amazon Kinesis stream, you need an estimate of the following input values.
- The average size of the data record written to the stream in kilobytes (KB), rounded up to the nearest 1 KB, the data size (average_data_size_in_KB).
- The number of data records written to and read from the stream per second, that is, transactions per second (number_of_transactions_per_second).
- The number of Kinesis applications that consume data concurrently and independently from the Amazon Kinesis stream, that is, the consumers (number_of_consumers).
- The incoming write bandwidth in KB (incoming_write_bandwidth_in_KB), which is equal to the average_data_size_in_KB multiplied by the number_of_transactions_per_seconds.
- The outgoing read bandwidth in KB (outgoing_read_bandwidth_in_KB), which is equal to the incoming_write_bandwidth_in_KB multiplied by the number_of_consumers.
You can calculate the initial number of shards (number_of_shards) that your Amazon Kinesis stream will need by using the input values you determine in the following formula:
number_of_shards = max (incoming_write_bandwidth_in_KB/1000, outgoing_read_bandwidth_in_KB/2000)
Q: How do I put data into a Kinesis stream?
Producers submit data records to the Amazon Kinesis stream. After you create the Amazon Kinesis stream, configure the producers to put data into the stream.
To put data into the stream, call the PutRecord operation for the Amazon Kinesis service on your Amazon Kinesis stream. Each of the PutRecord calls will need the name of the Amazon Kinesis stream, a partition key, and the data blob to be added to the Amazon Kinesis stream. The partition key is used to determine which shard in the stream the data record will be added to. For more information about PutRecord, see the Amazon Kinesis API Reference.
Q: What is the minimum throughput I can request for a single Kinesis stream?
The smallest throughput that can be requested for a Kinesis stream is a single shard. With each shard in Amazon Kinesis, you can capture, and store up to 1 megabyte per second of data at 1,000 transactions per second. Each shard can also deliver up to 2 megabytes per second of data for applications reading from it.
Q: What is the maximum throughput I can request for a single Kinesis stream?
Amazon Kinesis is designed to scale without limits. By contacting Amazon Web Services through the service limit request form online, you can request more shards for your subscriber account which can then be allocated to your streams.
Q: Will I always be able to acheive my level of provisioned throughput stream capacity?
For the core Amazon Kinesis service, barring externalities, and keeping in line with the definition of shards, you should always be able to achieve the level of provisioned stream capacity. Putting data into Amazon Kinesis requires you to provide a partition key as part of the PutRecord call. The partition key determines which shard in the stream the records get mapped into by performing an MD5 hash on the provided key. You can pick partition keys that are as evenly distributed as possible in order to most efficiently use your Amazon Kinesis stream capacity. On the other hand, the streaming application being developed with the help of the Amazon Kinesis Client library may want to exploit the fact that records for the same partition key get placed in the same shard. If your use case doesn't require this, then we recommend using random partition keys. If your use case does require that similar records go to the same shard, you should pick a definition of partition key that reasonably reflects the business logic, such as customer Id, name of application, type of log file, etc. and use the partition key accordingly.
Q: How do I change the provisioned throughput for an existing Kinesis stream?
The Amazon Kinesis service supports resharding, which enables you to adjust the number of shards in your stream in order to adapt to changes in the rate of data flow through the stream. There are two types of resharding operations: shard split and shard merge. In a shard split, you divide a single shard into two shards. In a shard merge, you combine two shards into a single shard. Splitting increases the number of shards in your stream and therefore increases the data capacity of the stream.
Resharding is always "pairwise" in the sense that you cannot split into more than two shards in a single operation, and you cannot merge more than two shards in a single operation. The shard or pair of shards that the resharding operation acts on are referred to as parent shards. The shard or pair of shards that result from the resharding operation are referred to as child shards. The SplitShard API action splits a shard into two new shards in the stream, to increase the stream's capacity to ingest and transport data. The MergeShard API action merges two adjacent shards in a stream and combines them into a single shard to reduce the stream's capacity to ingest and transport data.
Q: How am I charged for the shards in a stream?
Amazon Kinesis has simple pay as you go pricing, which is easy to estimate. There are no up-front costs, no minimum fees and you’ll only pay for the resources you consume. Amazon Kinesis lets you specify the throughput requirements of your data stream using shards of throughput capacity. Because you are charged on a per-shard basis, splitting increases the cost of your stream. Similarly, merging reduces the number of shards in your stream and therefore decreases the data capacity—and cost—of the stream. For a complete list of pricing dimensions, with examples, please see the Amazon Kinesis Pricing Page.
Q: How often can I change my provisioned throughput?
For a given stream, each change operation such as splitting or merging a shard, takes a few seconds, and you can only have 1 change occurring at a time. Thus, for a stream with just 1 shard, it takes a few seconds to double its capacity by splitting 1 shard. For a stream with 1K shards, it takes 30K seconds (8.3 hours) to double its capacity by splitting 1K shards.
Q: Does the item size affect the throughput rate?
Yes. With each shard in Amazon Kinesis, you can capture, and store up to 1 megabyte per second of data at 1,000 transactions per second. Therefore, if each individual data record put is less than 1 KB in size, then the per shard, 1000 write transactions per second limit, will not fully utilize the 1 megabyte per second throughput capacity of the shard.
Q: What happens if my application performs more writes than my provisioned stream capacity?
For writes, PutRecord() operations by a client above the provisioned stream capacity will be rejected with a ProvisionedThroughputExceeded exception. If this is due to a temporary rise in the stream’s incoming data rate, retry by the client will lead to the requests eventually completing. If this is due to a sustained rise in the stream’s incoming data rate, you should increase the number of shards in your stream to provide enough capacity for the client to consistently succeed. In both cases, Cloudwatch metrics allow you to learn about the change in the stream’s incoming data rate and the occurrence of ProvisionedThroughputExceeded exceptions.
Q: What happens if my application performs more reads than my provisioned stream capacity?
For reads, GetRecords() operations by a client above the provisioned stream capacity will be rejected with a ProvisionedThroughputExceeded exception. If this is due to a temporary rise in the stream’s outgoing data rate, then retries will lead to the client eventually catching up to the data most recently written to the stream. If this is due to a sustained rise in the stream’s outgoing data rate, you should increase the number of shards in your stream to provide enough capacity for the client to consistently succeed. In both cases, Cloudwatch metrics allow you to learn about the change in the stream’s outgoing data rate and the occurrence of ProvisionedThroughputExceeded exceptions.
Q: How do I know if I am exceeding my provisioned throughput capacity?
Amazon Kinesis displays key operational metrics for your streams in the AWS Management Console. The service also integrates with Amazon CloudWatch so you can see your throughput, utilization, and latency for each Amazon Kinesis stream, and easily track your resource consumption.
Q: How long does it take to change the number of shards in the stream?
In general, increases or decreases in overall stream throughput by increasing / decreasing number of shards in the stream respectively will only take a few seconds. We recommend that you do not try and schedule increases in throughput to occur at almost the same time when that extra throughput is needed.
Q: What is the data model for Amazon Kinesis?
The data model for Amazon Kinesis is as follows:
Data Blob: The data of interest that is put into the Kinesis stream as part of the PutRecord operation is a blob that is both opaque and immutable to the Amazon Kinesis service, which does not inspect, interpret, or change the data in the blob in any way. The maximum size of an individual data blob as part of a single PutRecord operation is 50 kilobytes (KB).
Shard: A uniquely identified group of data records in an Amazon Kinesis stream. A single shard can deliver 1 megabyte per second of ingest capacity at 1,000 writes per second, and up to 2 megabytes per second on egress.
Stream: A stream captures and transports data records that are continuously emitted from different data sources or producers. Scale-out within an Amazon Kinesis stream is explicitly supported by means of shards,
Q: Is there a limit on the size of an individual data record?
The total size of a single data record cannot exceed 50KB.
Q: What are the Amazon Kinesis APIs?
- CreateStream: Adds a new Amazon Kinesis stream to your AWS account. A stream captures and transports data records that are continuously emitted from different data sources or producers.
- DeleteStream: Deletes a stream and all of its shards and data.
- DescribeStream: Returns the useful information about the stream, enabling other actions. It returns information such as: the current status of the stream, the stream Amazon Resource Name (ARN), and an array of shard objects that comprise the stream.
- GetRecords: Returns one or more data records from a shard. A GetRecords operation request can retrieve up to 10 MB of data.
- GetShardIterator: Returns a shard iterator that specifies the position in the shard from which you want to start reading data records sequentially.
- ListStreams: Returns an array of the names of all the streams that are associated with the AWS account making the request.
- MergeShards: Merges two adjacent shards in a stream and combines them into a single shard to reduce the stream's capacity to ingest and transport data.
- PutRecord: Puts a data record into an Amazon Kinesis stream from a producer. This operation must be called to send data from the producer into the Amazon Kinesis stream for real-time ingestion and subsequent processing.
- SplitShard: Splits a shard into two new shards in the stream, to increase the stream's capacity to ingest and transport data.
Q: Is there a limit to how much data I can capture, store, and transport in Amazon Kinesis?
No. You can send any amount of data into an Amazon Kinesis stream. As the size of your overall data grows, Amazon Kinesis will respond to your request to increase its effective throughput capacity, and automatically spread your data over sufficient machine resources to meet your streaming data requirements.
Q: Does Amazon Kinesis remain available when I ask it to scale up or down by changing the number of shards?
Yes. Amazon Kinesis is designed to scale its throughput up or down while still remaining available. If your throughput requirements change, simply update your stream's capacity by changing the number of shards you need using the Amazon Kinesis APIs, while continuing to write and read from the stream.
Q: How highly available is Amazon Kinesis?
The service runs across Amazon’s proven, high-availability data centers. The service replicates data across three facilities in an AWS Region to provide fault tolerance in the event of a server failure or Availability Zone outage.
Q: How does Amazon Kinesis achieve high uptime and durability?
To achieve high uptime and durability, Amazon Kinesis synchronously replicates data across three facilities within an AWS Region.
Q: What is the Amazon Kinesis Client Library?
Amazon Kinesis provides you with client libraries to build and operate real-time streaming data processing applications. The Amazon Kinesis client library enables you to focus on business logic, letting the client library automatically handle complex issues like adapting to changes in stream volume, load-balancing streaming data, coordinating distributed services, and processing data with fault-tolerance.
The Amazon Kinesis Client Library acts as an intermediary between your business application which contains the specific logic needed to process Amazon Kinesis stream data, and the Amazon Kinesis service itself. The Amazon Kinesis Client Library uses the IRecordProcessor interface to communicate with your application. Your application implements this interface, and the Amazon Kinesis Client Library calls into your application code using the methods in this interface. For example, the Amazon Kinesis Client Library uses the Amazon Kinesis service API to get data from an Amazon Kinesis stream and then passes this data to the record processors for your application using the processRecords() method of IRecordProcessor.
Q: How does my Amazon Kinesis application work with the Amazon Kinesis Client Library?
The Amazon Kinesis Client Library acts as an intermediary between your business application which contains the specific logic needed to process Amazon Kinesis stream data, and the Amazon Kinesis service itself. The Amazon Kinesis Client Library uses the IRecordProcessor interface to communicate with your application. Your application implements this interface, and the Amazon Kinesis Client Library calls into your application code using the methods in this interface.
Each Kinesis application has a unique name and operates on one specific stream. At startup, the application calls into the Amazon Kinesis Client Library to instantiate a worker. This call provides the Amazon Kinesis Client Library with configuration information for the application, such as the stream name and AWS credentials. This call also passes a reference to an IRecordProcessorFactory implementation. The Amazon Kinesis Client Library uses this factory to create new record processor instances as needed to process data from the stream. The Amazon Kinesis Client Library communicates with these instances using the IRecordProcessor interface.
Each record processor instance processes exactly one shard from the stream. Since streams comprise of multiple shards, the worker instantiates multiple record processors to process the stream as a whole. To scale out your capacity to handle a stream that has a large data volume, you could create multiple instances of your application. These could run on a single computer or on multiple computers. We recommend that you run your application instances across a set of Amazon EC2 instances that are part of an Auto Scaling group. This enables you to automatically instantiate additional instances if the processing demands of the stream increase.
More information on building applications can be found in the Amazon Kinesis Developer Guide.
Q: How does the Amazon Kinesis Client Library keep track of shards being processed by the application?
The Amazon Kinesis Client Library creates a DynamoDB table with the Amazon Kinesis application name, and uses it to maintain state information—such as resharding events in the stream, checkpoints etc. for the application. Each application has its own DynamoDB table. Because the application name is used to name the table, you should pick an application name that doesn't conflict with any existing DynamoDB tables in the same account and region. For this reason the application name—specified in the code and passed to the Amazon Kinesis Client Library—is significant. All workers associated with this application name are assumed to be working together on the same stream. If you run an additional instance of the same application code, but with a different application name, the Amazon Kinesis Client Library treats the second instance as an entirely separate application also operating on the same stream. Please note that your account will be charged for the costs associated with this Amazon DynamoDB table in addition to the costs associated with Amazon Kinesis itself.
For more information about how the Amazon Kinesis Client Library uses Amazon DynamoDB see Application State is Managed in Amazon DynamoDB.
Q: What languages is the Amazon Kinesis Client Library available in?
The Amazon Kinesis client library is currently available in Java. We will be looking to add support for other languages.
Q: Do I have to use the Amazon Kinesis Client Library to read data from an Amazon Kinesis stream?
No. You can use the Amazon Kinesis service API to get data from an Amazon Kinesis stream. We recommend using the Amazon Kinesis Client Library where applicable because the design patterns and other benefits make it more productive to develop business applications, while having the client library perform other heavy-lifting associated with distributed stream processing.
Q: How do I monitor stream size and performance?
Amazon Kinesis displays key operational metrics for your streams in the AWS Management Console. The service also integrates with Amazon CloudWatch so you can see your throughput, utilization, and latency for each Amazon Kinesis stream, and easily track your resource consumption.
Q: Does Amazon Kinesis support AWS Identity and Access Management (IAM) permissions?
Amazon Kinesis integrates with AWS Identity and Access Management (IAM), a service that enables capabilities such as creating users and groups under your AWS account, easily sharing your AWS resources between the users in your AWS account, assigning unique security credentials to each user, and more.
For example, with IAM permissions, you can create a policy that allows a specific user or group to use the PutRecord operation with any of the streams associated with the account, or apply policies to a specific set of users who should be able to get data from the stream for processing.
More details can be found here in the Amazon Kinesis API reference guide on controlling access to Amazon Kinesis Resources with IAM.
Q: How will I be charged for my use of Amazon Kinesis?
- Amazon Kinesis has simple pay as you go pricing, which is easy to estimate. There are no up-front costs, no minimum fees and you’ll only pay for the resources you consume.
- Amazon Kinesis lets you specify the throughput requirements of your data stream using shards of throughput capacity. Behind the scenes, the service handles the provisioning of storage resources to achieve the requested throughput rate.
- A single shard of throughput will allow you to capture 1MB per second of data, at up to 1,000 PUT transactions per second (the ingest rate) at up to 50KB per PUT, and enable your processing applications to read data at up to 2 MB per second (the egress rate).
|Hourly Shard Rate (1MB/second ingest rate, 2MB/second egress rate)
|Per 1,000,000 PUT transactions
Amazon Kinesis is available in the US East (Northern Virginia) region.
Inbound data transfer is free, and you don’t pay for transfer from the Amazon Kinesis Stream to your Amazon EC2-based Kinesis applications. EC2 instance charges for your Amazon Kinesis processing applications apply.
Q: What are some pricing examples?
Let’s assume that the front end servers - i.e. the data producers - perform 1,000 PUTs per second in aggregate, each of 5KB size. For simplicity, we assume that the workload is relatively constant throughout the day and that data records are more or less uniform in size (i.e. 5 KB). Please note that you can easily scale up, and down to manage variable workloads and adjust for larger items, at any time. In this example, the processing applications run in the same region, i.e. US East.
First, we calculate the number of shards needed in the stream to ingest all of the incoming data. The Amazon Kinesis console helps you estimate with the help of the stream creation wizard. The calculation proceeds as follows: 1,000 (PUTs per second) * 5 KB (per PUT) = 4.9 megabytes per second (MB/sec). Each shard in Amazon Kinesis can capture up to 1 MB/sec of data at up to 1000 PUTs/sec in terms of ingest, and can support 2 MB/sec for egress to drive applications. Also, every shard in the stream is stored durably for a maximum of 24 hours. In order to accommodate the workload described, you need 5 shards for a total stream ingress capacity of 5 MB/sec that supports up to 5,000 PUTs/sec in aggregate.
Using Kinesis pricing in the US-East Region,
- Price for shards: A shard costs $0.015 per hour. On a daily basis a single shard costs $ 0.36. So the total cost of provisioning 5 shards in your stream for an entire day is = $1.80. For the month of October with 31 days, the total cost incurred for shards is US $55.80. Please note that this includes the storage cost associated with the 24 hour retention period for each shard.
- Price for Requests: The cost for 1 million PUT operations is $0.028. In the example, the infrastructure servers PUT data at the rate of 1,000 puts per second in aggregate into the Amazon Kinesis stream. On a daily basis that is 86,400,000 PUTs, and on a monthly basis that amounts to 2,678,400,000 PUTs. The total cost for PUTs for the entire month of October is therefore $74.99.
- Network Data Transfer
Network Data Transfer In: Data Transfer is $0.00 per GB for streams in the US Standard regions.
Network Data Transfer Out: In this example data is processed by EC2 instances running within the same region as the Amazon Kinesis stream.
Adding the two pricing dimensions above; for $4.22 per day ($130.80 per month), you have a managed, real-time infrastructure with an ingest throughput of 10MB/sec, continuously ingesting over 400 gigabytes of data per day, in a durable, elastic manner while simultaneously feeding 2 real-time streaming data processing applications. Please note that Amazon Kinesis applications run on your own EC2 instances and are billed at standard EC2 rates.