Why does the IteratorAgeMilliseconds value in Kinesis Data Streams continue to increase?

5 minute read

The IteratorAgeMilliseconds metric continues to increase in Amazon Kinesis Data Streams.

Short description

The IteratorAgeMilliseconds metric in Kinesis Data Streams can increase for the following reasons:

Slow record processing
Read throttles
AWS Lambda function error
Connection timeout
Uneven data distribution among shards

Resolution

Slow record processing

An overload of consumer processing logic can contribute to slow record processing. If the Amazon Kinesis Client Library (KCL) is used to build the consumer, then check for the following root causes:

Insufficient physical resources: During peak demand, check to see if your instance has adequate physical resources, such as memory or CPU utilization.
Failure to scale: Consumer record processing logic can fail to scale with the increased load of the Amazon Kinesis data stream. To verify scale failures, you can monitor the other custom Amazon CloudWatch metrics from the KCL. These metrics are associated with the processTask, RecordProcessor.processRecords.Time, Success, and RecordsProcessed operations . You can also monitor the IncomingBytes and IncomingRecords CloudWatch metrics to check the overall throughput of the Kinesis data stream. For more information about the KCL and custom CloudWatch metrics, see Monitoring the Kinesis Client Library with Amazon CloudWatch. If you can't reduce the processing time, then increase the number of shards to upscale the Kinesis stream.
Overlapping processing increases: If you see an increase in the processRecords.Time value that doesn't correlate with the increased traffic load, then check your consumer record processing logic. Your record processing logic might make synchronous blocking calls that can cause delays in consumer record processing. You can also increase the number of shards in your Kinesis Data Streams to resolve this issue. For more information about the number of shards that you need, see Resharding, scaling, and parallel processing.
Insufficient GetRecords requests: If the consumer doesn't send the GetRecords requests frequently enough, then the consumer application can fall behind. Check the withMaxRecords and withIdleTimeBetweenReadsInMillis KCL configurations.
Insufficient Throughput or High MillisBehindLatest: If you're using Amazon Managed Service for Apache Flink for SQL, then see Insufficient throughput or High MillisBehindLatest or Consumer record processing falling behind.

If the consumers fall behind and there's a risk of data expiration, then increase the retention period of the stream. By default, the retention period is 24 hours, and you can configure the period for up to one year. For more information about data retention periods, see Changing the data retention period.

Read throttles

Check the ReadProvisionedThroughputExceeded metric to see if there are read throttles on the stream.

A consumer that exceeds the quota for five GetRecords calls per second can cause read throttles. For more information about read throttles on Kinesis streams, see How do I detect and troubleshoot ReadProvisionedThroughputExceeded exceptions in Kinesis Data Streams?

Lambda function error

In CloudWatch, review the Lambda functions for the stream where the IteratorAgeMilliseconds count continues to increase. To identify the errors that cause an increase in the IteratorAgeMilliseconds value, review the Errors summary in CloudWatch. Configurations in the Lambda trigger, blocked calls, or Lambda memory provision can cause the Lambda function to slow. Check to see if the timestamp of the Lambda function error matches the time of the IteratorAgeMilliseconds metric increase of your Kinesis data stream. The match in timestamp confirms the cause of the increase. For more information, see Configuring Lambda function options.

Note: A Lambda function can give an error because it's getting retried. The Lambda function gets retried because it doesn't skip the records as a consumer of Kinesis. As these records are retried, the process delays also increase. Your consumer then falls behind the stream and causes the IteratorAgeMilliseconds metric to increase.

Intermittent connection timeout

Your consumer application can experience a connection timeout issue when the application pulls records from the Kinesis data stream. Intermittent connection timeout errors can cause a significant increase in the IteratorAgeMilliseconds count.

To verify whether the increase is related to a connection timeout, check the GetRecords.Latency and GetRecords.Success metrics. If both metrics are also affected, then your IteratorAgeMilliseconds count doesn't increase after the connection is restored.

Uneven data distribution among shards

Some shards in your Kinesis data stream might receive more records than others. This is because the partition key that's used in PUT operations isn't equally distributing the data across the shards. This uneven data distribution results in fewer parallel GetRecords calls to the Kinesis data stream shards and causes an increase in the IteratorAgeMilliseconds count.

You can use random partition keys to distribute data evenly over the shards of the stream. Random partition keys can help the consumer application to read records faster.

Related information

Lambda event source mappings

Troubleshooting Kinesis Data Streams consumers

Using AWS Lambda with Amazon Kinesis

Topics

Analytics Internet of Things (IoT)

Relevant content

Streaming events from Kinesis Data Stream in account A to Kinesis Data Firehose in account B
Aryeh Radle
asked 7 months ago
Kinesis data stream iterator age spikes
Bob
asked 2 years ago
Kinesis Transformation Buffering from Data Stream
Accepted Answer
EY
asked 2 years ago
Kinesis data streams limits
Accepted Answer
AWS-User-5147650
asked 5 years ago
Kinesis Data Stream. Scaling and filtering
rePost-User-4906584
asked 10 months ago
Why are my Kinesis Data Streams throttling?
AWS OFFICIALUpdated 24 days ago
What metrics can I use to monitor and troubleshoot Kinesis Data Streams issues?
AWS OFFICIALUpdated 5 months ago
Why do I experience high latency issues with Kinesis Data Streams?
AWS OFFICIALUpdated 5 months ago
How do I change the number of open shards in Kinesis Data Streams?
AWS OFFICIALUpdated 5 months ago
Amazon Kinesis Video Streams with WebRTC operation returned status code: 0x5600000f
EXPERT
golderic
published a year ago