How do I detect and troubleshoot ReadProvisionedThroughputExceeded exceptions in Kinesis Data Streams?
Last updated: 2020-06-16
I'm encountering a ReadProvisionedThroughputExceeded error in Amazon Kinesis Data Streams. Why is this happening and how do I troubleshoot this?
The ReadProvisionedThroughputExceeded error occurs when GetRecords calls are throttled by Kinesis Data Streams over a duration of time.
Your Amazon Kinesis data stream can throttle if the following limits are breached:
- Each shard can support up to five read transactions per second (or 5 GetRecords calls/second for each shard).
- Each shard can support up to a maximum read rate of 2 MiB/second.
- GetRecords can retrieve up to 10 MiB of data per call from a single shard, and up to 10,000 records per call. If a call to GetRecords returns 10 MiB of data, subsequent calls made within the next five seconds result in an error.
If you encounter a ReadProvisionedThroughputExceeded error, consider the following approaches:
- Identify the root cause of your issue.
- Identify a possible microburst.
- Follow Data Streams best practices.
Identify the root cause of your issue
To identify the root cause of the ReadProvisionedThroughputExceeded error in Data Streams, monitor the Amazon Kinesis Data Streams Service with Amazon CloudWatch. Pay attention to the following metrics in CloudWatch:
- GetRecords.Bytes: The number of bytes retrieved from the Kinesis data stream, measured over a specified time period.
- GetRecords.Records: The number of records retrieved from the Kinesis data stream over a specified time period.
- ReadProvisionedThroughputExceeded: The number of GetRecords calls that are throttling in your Kinesis data stream.
Set up your CloudWatch dashboard to display your statistics as a Sum with the time period set to one minute. Then, divide Sum by 60 seconds to get an average value.
For example, if use the GetRecords.Records metric value, divide Sum by 60 seconds to calculate the average number of records sent per second. Then, check if the average value is less than the records sent per second for the limit that is set for your Kinesis data stream. For more information about shard limits, see Kinesis Data Streams quotas and limits.
Note: You can enable the enhanced monitoring feature to ensure that the load is being evenly distributed across all of your shards.
You can also use the GetRecords.Records metric with the statistic viewed as a SampleCount and the time period set to one minute. Divide the SampleCount value by 60 seconds to calculate the average number of GetRecords calls made per second for each shard. If the average value is around five GetRecords calls per second and you are getting a ReadProvisionedThroughputExceeded error, verify that your consumers aren't exceeding shard limits. If they're not exceeding shard limits, then the ReadProvisionedThroughputExceeded error could be a result of your consumers making more than five GetRecords calls/second.
Finally, check if there is disparity between the ReadProvisionedThroughputExceeded value of your shards. If the distribution of shards is uneven, or if one shard receives more or less data than the other, then there can be a distribution imbalance. To resolve this shard distribution imbalance and to avoid hot shards, use UUID as a partition key in the putRecords API call.
Identify a possible microburst
Although rare, metric values can be below shard limits, causing a Kinesis data stream to throttle during a read.
For example, consider a scenario where GetRecords.Bytes Sum:1min represents 10 MiB of data read for one minute. At one second, the GetRecords.Bytes call reads 2 MiB of data without any throttling. Then, at two seconds, the GetRecords.Bytes call reads 8 MiB of data. At three seconds, there might not be any read operations or any throttling. Although the shard limit for the minute hasn't been reached (2MiB*60 = 120MiB of data), you might receive a ReadProvisionedThroughputExceeded error. If you notice a sudden spike in the metric values, look for the microburst that is causing the ReadProvisionedThroughputExceeded exception.
Follow Data Streams best practices
To mitigate ReadProvisionedThroughputExceeded exceptions, apply these best practices:
- Reshard your stream to increase the number of shards in the stream.
- Reduce the size of the GetRecords requests. You can do this by configuring the limit parameter or by reducing the frequency of GetRecords requests.
Note: If the consumer is Amazon Kinesis Data Firehose, then the Kinesis data stream adjusts to the frequency of the GetRecords calls that are being made. If the consumer is an AWS Lambda function with event source mapping, then the stream is polled once every second. The polling frequency cannot be modified. If the consumer is an Amazon Kinesis Client Library (KCL) application, then adjust the polling frequency by modifying the value of the DEFAULT_IDLETIME_BETWEEN_READS_MILLIS parameter. For more information about how to modify this value in the KCL, see Amazon Web Services - Labs on the GitHub website.
- Distribute read and write operations as evenly as possible across all of the shards in Data Streams.
- Use consumers with enhanced fan-out. For more information about enhanced fan-out, see Developing custom consumers with dedicated throughput (enhanced fan-out).
Note: If your Kinesis data stream uses more than five consumers, it's a best practice to use consumers with enhanced fan-out.
- Use an error retry and exponential backoff mechanism in the consumer logic if ReadProvisionedThroughputExceeded exceptions are encountered. For consumer applications that use an AWS SDK, the requests are retried by default.