Why does the IteratorAgeMilliseconds value in Kinesis Data Streams keep increasing?
Last updated: 2020-05-06
The IteratorAgeMilliseconds metric keeps increasing in Amazon Kinesis Data Streams. Why is this happening?
The IteratorAgeMilliseconds metric in Kinesis Data Streams can increase for the following reasons:
- Slow record processing
- Read throttles
- AWS Lambda function error
- Connection timeout
Slow record processing
An overload of consumer processing logic can contribute to slow record processing. If the consumer is built using the Amazon Kinesis Client Library (KCL), then check for the following root causes:
- Insufficient physical resources: Check to see if your instance has adequate amounts of physical resources such as memory or CPU utilization during peak demand.
- Failure to scale: Consumer record processing logic can fail to scale with the increased load of the Amazon Kinesis data stream. This can be verified by monitoring the other custom Amazon CloudWatch metrics emitted by KCL. These metrics are associated to the following operations: processTask, RecordProcessor.processRecords.Time, Success, and RecordsProcessed. You can also check the overall throughput of the Kinesis data stream by monitoring the CloudWatch metrics IncomingBytes and IncomingRecords. For more information about KCL and custom CloudWatch metrics, see Monitoring the Kinesis Client Library with Amazon CloudWatch. However, if the processing time cannot be reduced, then consider upscaling the Kinesis stream by increasing the number of shards.
- Data expiration: If the consumers are falling behind and there is a risk of data expiration, then increase the retention period of the stream. By default, the retention period is 24 hours and it can be configured for up to 7 days. For more information about data retention periods, see Changing the Data Retention Period.
- Overlapping processing increases: Consider checking the record processing logic of the consumer. If you see an increase in the processRecords.Time value that does not correlate with the increased traffic load, then check your record processing logic. Your record processing logic might be making synchronous blocking calls that can cause delays in consumer record processing. Another way to mitigate this issue is to increase the number of shards in your Kinesis Data Streams. For more information about the number of shards needed, see Resharding, Scaling, and Parallel Processing.
- Insufficient GetRecords requests: If the consumer isn't sending the GetRecords requests frequently enough, then the consumer application can fall behind.
Check the ReadProvisionedThroughputExceeded metric to see if there are read throttles on the stream. For more information about read throttles on Kinesis streams, see Monitoring the Amazon Kinesis Data Streams Service with Amazon CloudWatch.
Lambda function error
In Amazon CloudWatch, review the Lambda functions for the stream where the IteratorAgeMilliseconds count keeps increasing. You can identify the errors that are causing an increase in the IteratorAgeMilliseconds value by reviewing the Errors summary in CloudWatch. Check to see if the timestamp of the Lambda function error matches the time of the IteratorAgeMilliseconds metric increase of your Kinesis data stream. The match in timestamp confirms the cause of the increase.
Note: A Lambda function can throw an error because it is getting retried. The Lambda function gets retried because it doesn't skip the records as a consumer of Kinesis. As these records are retried, the delays in processing are also increased. Your consumer then falls behind the stream, causing the IteratorAgeMilliseconds metric to increase.
Your Lambda function can experience a connection timeout issue while pulling records from the Kinesis data stream. A connection timeout can cause a significant increase in the IteratorAgeMilliseconds count.
To verify whether the increase is related to a connection timeout, check the GetRecords.Latency and GetRecords.Success metrics. If both metrics are also impacted, then your IteratorAgeMilliseconds count stops increasing after the connection is restored.