How do I troubleshoot throttling errors in Kinesis Data Streams?

Last updated: 2020-03-30

My Amazon Kinesis data stream is throttling. However, the stream didn't exceed the data limits. How do I detect "Rate Exceeded" or "WriteProvisionedThroughputExceeded" errors? 

Short Description

You can detect and troubleshoot throttling errors in your Kinesis data stream by doing the following:

  • Enable enhanced monitoring and compare IncomingBytes values.
  • Log full records to perform stream count and size checks.
  • Use random partition keys.
  • Check for obscure metrics or micro spikes in Amazon CloudWatch metrics.

Resolution

To prevent "Rate Exceeded" or "WriteProvisionedThroughputExceeded" errors in your Kinesis data stream, try the following:

Enable enhanced monitoring and compare IncomingBytes values

To verify whether you have hot shards, enable enhanced monitoring on your Kinesis data stream. When you enable shard level monitoring in a Kinesis data stream, then you can investigate the shards individually. You can examine the stream on a per shard basis to identify which shards are receiving more traffic or breaching any service limits.

Note: Hot shards are often excluded from the Kinesis data stream metrics when the enhanced monitoring setting is disabled. For more information about hot shards, see Strategies for Resharding.

You can also compare the IncomingBytes average and maximum values to verify whether there are hot shards in your stream. If enhanced monitoring is enabled, you can also see which specific shards deviate from the average.

Log full records to perform stream count and size checks

To identify micro spikes or obscure metrics that breach stream limits, log full records or custom code to perform stream count and size checks. Then, evaluate the number and size of records that are sent to the Kinesis data stream. This can help you identify any spikes that breach data limits.

You can also take the value from a one minute data point and divide it by 60. This gives you an average value per second to help determine if throttling is present within the time period specified. If the successful count doesn't breach the limits, then add the IncomingRecords metric to the WriteProvisionedThroughputExceeded metric, and retry the calculation. The IncomingRecords metric signals successful or accepted records, whereas the WriteProvisionedThroughputExceeded metric indicates how many records were throttled.

Note: Check the sizes and number of records that are sent from the producer. If the combined total of the incoming and throttled records are greater than the stream limits, then consider changing the number of records.

The PutRecord.Success metric is also a good indicator for operations that are failing. When there is a dip in the success metric, then investigate the data producer logs to find the root causes of the failures. If throttling occurs, establish logging on the data producer side to determine the total amount and size of submitted records. If the total number of records in the PutRecord.Success metric breaches the stream limits, then your Kinesis data stream throttles. For more information about Kinesis stream limits, see Kinesis Data Streams Quotas.

Use random partition keys

If there are hot shards in your Kinesis data stream, use a random partition key to ingest your records. If the operations already use a random partition key, then adjust the key to correct the distribution. Then, monitor the key for changes in metrics such as IncomingBytes and IncomingRecords. If the maximum and average patterns are close together, then there are no hot shards.

Check for obscure metrics or micro spikes in Amazon CloudWatch metrics

If there are CloudWatch metrics that don't clearly indicate breaches or micro spikes in the data, try the following solutions:

For more information about stream quotas, see Kinesis Data Stream Quotas.