Why does the IteratorAgeMilliseconds value in Kinesis Data Streams keep increasing?

Last updated: 2022-06-22

The IteratorAgeMilliseconds metric keeps increasing in Amazon Kinesis Data Streams. Why is this happening?

Short description

The IteratorAgeMilliseconds metric in Kinesis Data Streams can increase for the following reasons:

  • Slow record processing
  • Read throttles
  • AWS Lambda function error
  • Connection timeout

Resolution

Slow record processing

An overload of consumer processing logic can contribute to slow record processing. If the consumer is built using the Amazon Kinesis Client Library (KCL), then check for the following root causes:

  • Insufficient physical resources: Check to see if your instance has adequate amounts of physical resources such as memory or CPU utilization during peak demand.
  • Failure to scale: Consumer record processing logic can fail to scale with the increased load of the Amazon Kinesis data stream. This can be verified by monitoring the other custom Amazon CloudWatch metrics emitted by KCL. These metrics are associated to the following operations: processTask, RecordProcessor.processRecords.Time, Success, and RecordsProcessed. You can also check the overall throughput of the Kinesis data stream by monitoring the CloudWatch metrics IncomingBytes and IncomingRecords. For more information about KCL and custom CloudWatch metrics, see Monitoring the Kinesis Client Library with Amazon CloudWatch. However, if the processing time cannot be reduced, then consider upscaling the Kinesis stream by increasing the number of shards.
  • Overlapping processing increases: Consider checking the record processing logic of the consumer. If you see an increase in the processRecords.Time value that does not correlate with the increased traffic load, then check your record processing logic. Your record processing logic might be making synchronous blocking calls that can cause delays in consumer record processing. Another way to mitigate this issue is to increase the number of shards in your Kinesis Data Streams. For more information about the number of shards needed, see Resharding, Scaling, and Parallel Processing.
  • Insufficient GetRecords requests: If the consumer isn't sending the GetRecords requests frequently enough, then the consumer application can fall behind. To verify, check the Amazon Kinesis Client Library (KCL) configurations: withMaxRecords and withIdleTimeBetweenReadsInMillis.
  • IteratorAgeMilliseconds: If the consumer is using a Lambda function, then slow processing can be caused by configurations in the Lambda trigger (for example, low batch size). Slow record processing can also be caused by the calls being blocked, or Lambda memory provision. For more information and troubleshooting, see Configuring Lambda function options.
  • Insufficient Throughput or High MillisBehindLatest: If you're using Amazon Kinesis Data Analytics for SQL, see Insufficient Throughput or High MillisBehindLatest for troubleshooting steps.

If the consumers fall behind and there is a risk of data expiration, then increase the retention period of the stream. By default, the retention period is 24 hours and it can be configured for up to one year. For more information about data retention periods, see Changing the data retention period.

Read throttles

Check the ReadProvisionedThroughputExceeded metric to see if there are read throttles on the stream.

Read throttles can be caused by:

  • one or more consumers breaching the 5 GetRecords calls per second limit
  • one or more consumers breaching the 1000 records per second limit
  • one or more consumers breaching the 1 MiB per second limit

For more information about read throttles on Kinesis streams, see How do I detect and troubleshoot ReadProvisionedThroughputExceeded exceptions in Kinesis Data Streams?

Lambda function error

In Amazon CloudWatch, review the Lambda functions for the stream where the IteratorAgeMilliseconds count keeps increasing. You can identify the errors that are causing an increase in the IteratorAgeMilliseconds value by reviewing the Errors summary in CloudWatch. Check to see if the timestamp of the Lambda function error matches the time of the IteratorAgeMilliseconds metric increase of your Kinesis data stream. The match in timestamp confirms the cause of the increase.

Note: A Lambda function can throw an error because it is getting retried. The Lambda function gets retried because it doesn't skip the records as a consumer of Kinesis. As these records are retried, the delays in processing are also increased. Your consumer then falls behind the stream, causing the IteratorAgeMilliseconds metric to increase.

Intermittent connection timeout

Your consumer application can experience a connection timeout issue while pulling records from the Kinesis data stream. Intermittent connection timeout errors can cause a significant increase in the IteratorAgeMilliseconds count.

To verify whether the increase is related to a connection timeout, check the GetRecords.Latency and GetRecords.Success metrics. If both metrics are also impacted, then your IteratorAgeMilliseconds count stops increasing after the connection is restored.

Uneven data distribution among shards

Some shards in your Kinesis data stream might receive more records than others because the partition key used in Put operations isn't equally distributing the data across the shards. This uneven data distribution results in fewer parallel GetRecords calls to the Kinesis data stream shards, causing an increase in the IteratorAgeMilliseconds count.

If data is unevenly distributed across the shards in your Kinesis stream, then use random partition keys. Random partition keys allow the customer applications to read records faster.