Why is the Lambda IteratorAge metric increasing for my Amazon DynamoDB streams?

Last updated: 2022-09-12

When consuming records from my Amazon DynamoDB stream, I see a spike in the AWS Lambda IteratorAge metric. Why is my function's iterator age increasing, and how can I resolve this?

Short description

The Lambda IteratorAge metric measures the latency between when a record is added to a DynamoDB stream, and when the function processes that record. When IteratorAge increases, this means that Lambda isn't efficiently processing records that are written to the DynamoDB stream.

These are the main reasons that IteratorAge increases:

  • Invocation errors
  • Throttle occurrence
  • Low Lambda throughput

Resolution

Invocation errors

Lambda is designed to process batches of records in sequence and retry on errors. So, if a function returns an error every time it's invoked, then Lambda keeps retrying. It does this until the records expire or exceed the maximum age that you configure on the event source mapping. The DynamoDB stream retention period is 24. Lambda keeps retrying for up to one day, until the record expires, and then it moves on to the next batch of records.

Check the Lambda errors metric to confirm if an invocation error is the root cause of the IteratorAge spike. If so, check the Lambda logs to debug the error, and then modify your code. Make sure to include a try-catch statement in your code when you handle the error.

There are three parameters in the event source mapping configuration that can help you prevent IteratorAge spikes:

  • Retry attempts: The maximum number of times that Lambda retries when the function returns an error.
  • Maximum age of record – The maximum age of a record that Lambda sends to your function. This helps you discard records that are too old.
  • Split batch on error – When the function returns an error, split the batch into two before retrying. Retrying with smaller batches isolates bad records and works around timeout issues. Note: Splitting a batch doesn't count towards the retry quota.

To retain discarded events, configure the event source mapping to send details about failed batches to an Amazon Simple Queue Service (Amazon SQS) queue. Or, configure the even source mapping to send details to an Amazon Simple Notification Service (Amazon SNS) topic. To do this, use the On-failure destination parameter.

Throttle occurrences

Because event records are read sequentially, Lambda functions can't progress to the next record if the current invocation is throttled.

When you use DynamoDB streams, don't configure more than two consumers on the same stream shard. If you have more than two readers per shard, this can cause throttling. If you need more than two readers on a single stream shard, then use a fan-out pattern. Configure the Lambda function to consume records from the stream, and then forward them to other downstream Lambda functions or Amazon Kinesis streams.

On the Lambda end, use a concurrency limit to prevent throttling.

Lambda throughput

Runtime duration

If the Duration metric of a Lambda function is high, then this decreases the function's throughput and increase the IteratorAge.

To decrease your function's runtime duration, use one or both of these methods:

1.    Increase the amount of memory allocated to the function.

2.    Optimize your function code so that less time is needed to process records.

Concurrent Lambda runs

The maximum number of concurrent Lambda runs are calculated as follows:

Concurrent Runs = Number of Shards x Concurrent batches per shard (parallelization Factor)

  • Number of Shards - In a DynamoDB stream, there is 1<>1 mapping between the number of partitions of the table and the number of stream shards. The number of partitions is determined by the size of the table and its throughput. Each partition on the table can serve up to 3,000 read request units or 1,000 write request units or the linear combination of both. So, to increase concurrency, increase the number of shards by increasing the table's provisioned capacity.
  • Concurrent batches per shard (Parallelization Factor) - You can configure the number of concurrent batches per shard in the event source mapping. The default is 1 and can be increased up to 10.

For example, if the table has 10 partitions, and the Concurrent batches per shard is set to 5, then you can have up to 50 concurrent runs.

Note: To process item level modification in the right order at any given time, items with the same partition key go to the same batch. So, make sure that your table partition key has high cardinality, and that your traffic doesn't generate hot keys. For example, if you set Concurrent batches per shard value to 10, and your write traffic is targeting one single partition key, then you can have only one concurrent run per shard.

Batch size

Tuning the batch size value helps you increase Lambda throughput. If you process a low number of records per batch, then this slows down the processing of the stream.

On the other hand, if you have a high number of records per batch, this might increase the duration of the function run. So, test with multiple values to find the best value for your use case.

If your function's runtime duration is independent from the number of records in an event, then increasing your function's batch size decreases the function's iterator age.