Why does the IteratorAgeMilliseconds value in Kinesis Data Streams continue to increase?

5 分钟阅读

The IteratorAgeMilliseconds metric continues to increase in Amazon Kinesis Data Streams.

Short description

The IteratorAgeMilliseconds metric in Kinesis Data Streams can increase for the following reasons:

Slow record processing
Read throttles
AWS Lambda function error
Connection timeout
Uneven data distribution among shards

Resolution

Slow record processing

An overload of consumer processing logic can contribute to slow record processing. If the Amazon Kinesis Client Library (KCL) is used to build the consumer, then check for the following root causes:

Insufficient physical resources: During peak demand, check to see if your instance has adequate physical resources, such as memory or CPU utilization.
Failure to scale: Consumer record processing logic can fail to scale with the increased load of the Amazon Kinesis data stream. To verify scale failures, you can monitor the other custom Amazon CloudWatch metrics from the KCL. These metrics are associated with the processTask, RecordProcessor.processRecords.Time, Success, and RecordsProcessed operations . You can also monitor the IncomingBytes and IncomingRecords CloudWatch metrics to check the overall throughput of the Kinesis data stream. For more information about the KCL and custom CloudWatch metrics, see Monitoring the Kinesis Client Library with Amazon CloudWatch. If you can't reduce the processing time, then increase the number of shards to upscale the Kinesis stream.
Overlapping processing increases: If you see an increase in the processRecords.Time value that doesn't correlate with the increased traffic load, then check your consumer record processing logic. Your record processing logic might make synchronous blocking calls that can cause delays in consumer record processing. You can also increase the number of shards in your Kinesis Data Streams to resolve this issue. For more information about the number of shards that you need, see Resharding, scaling, and parallel processing.
Insufficient GetRecords requests: If the consumer doesn't send the GetRecords requests frequently enough, then the consumer application can fall behind. Check the withMaxRecords and withIdleTimeBetweenReadsInMillis KCL configurations.
Insufficient Throughput or High MillisBehindLatest: If you're using Amazon Managed Service for Apache Flink for SQL, then see Insufficient throughput or High MillisBehindLatest or Consumer record processing falling behind.

If the consumers fall behind and there's a risk of data expiration, then increase the retention period of the stream. By default, the retention period is 24 hours, and you can configure the period for up to one year. For more information about data retention periods, see Changing the data retention period.

Read throttles

Check the ReadProvisionedThroughputExceeded metric to see if there are read throttles on the stream.

A consumer that exceeds the quota for five GetRecords calls per second can cause read throttles. For more information about read throttles on Kinesis streams, see How do I detect and troubleshoot ReadProvisionedThroughputExceeded exceptions in Kinesis Data Streams?

Lambda function error

In CloudWatch, review the Lambda functions for the stream where the IteratorAgeMilliseconds count continues to increase. To identify the errors that cause an increase in the IteratorAgeMilliseconds value, review the Errors summary in CloudWatch. Configurations in the Lambda trigger, blocked calls, or Lambda memory provision can cause the Lambda function to slow. Check to see if the timestamp of the Lambda function error matches the time of the IteratorAgeMilliseconds metric increase of your Kinesis data stream. The match in timestamp confirms the cause of the increase. For more information, see Configuring Lambda function options.

Note: A Lambda function can give an error because it's getting retried. The Lambda function gets retried because it doesn't skip the records as a consumer of Kinesis. As these records are retried, the process delays also increase. Your consumer then falls behind the stream and causes the IteratorAgeMilliseconds metric to increase.

Intermittent connection timeout

Your consumer application can experience a connection timeout issue when the application pulls records from the Kinesis data stream. Intermittent connection timeout errors can cause a significant increase in the IteratorAgeMilliseconds count.

To verify whether the increase is related to a connection timeout, check the GetRecords.Latency and GetRecords.Success metrics. If both metrics are also affected, then your IteratorAgeMilliseconds count doesn't increase after the connection is restored.

Uneven data distribution among shards

Some shards in your Kinesis data stream might receive more records than others. This is because the partition key that's used in PUT operations isn't equally distributing the data across the shards. This uneven data distribution results in fewer parallel GetRecords calls to the Kinesis data stream shards and causes an increase in the IteratorAgeMilliseconds count.

You can use random partition keys to distribute data evenly over the shards of the stream. Random partition keys can help the consumer application to read records faster.