Which metrics should I use to monitor and troubleshoot Kinesis Data Streams issues?

Last updated: 2020-06-30

I want to monitor incoming and outgoing data for Amazon Kinesis Data Streams. Which metrics should I use?

Resolution

Using stream-level metrics

You can use Amazon CloudWatch metrics to continuously monitor the performance of your Amazon Kinesis data stream and its throughput. The following metrics can help you monitor producer and consumer issues:

  • GetRecords.IteratorAgeMilliseconds: Measures the age in milliseconds of the last record in the stream for all GetRecords requests. A value of zero for this metric indicates that the records are current within the stream. A lower value is preferred. To monitor any performance issues, increase the number of consumers for your stream, so that the data is processed more quickly. To optimize your application code, increase the number of consumers to reduce the delay in processing records.
  • ReadProvisionedThroughputExceeded: Measures the count of GetRecords calls that throttled during a given time period, exceeding the service or shard limits for Kinesis Data Streams. A value of zero indicates that the data consumers aren't exceeding service quotas. Any other value indicates that the throughput limit is exceeded, requiring additional shards. This metric confirms that there are no more than five reads/second/shard or 2 MB/second/shard in the stream. You can enable enhanced monitoring to validate that there are no hot shards in the stream.
  • WriteProvisionedThroughputExceeded: Measures the PUT or data producer (like ReadProvisionedThroughputExceeded) to help determine if the stream is being throttled. This exceeds the service quotas for Data Streams when writing into a shard. Be sure that the PUT requests don't exceed 1 MB/second/shard or 1,000 records/shard/second. Be sure that the partition key is evenly distributed and that enhanced monitoring is enabled to verify hot shards in the stream. Depending on shard saturation, consider updating shard count in the stream to allow for increased throughput.
  • PutRecord.Success and PutRecords.Success: Measure the count of successful records of PutRecords request over a given period by data producers into the stream. This metric confirms effective retry logic for failed records.
  • GetRecords.Success: Measures the count of successful GetRecords requests for a given time period in the stream. It confirms effective retry logic for failing records.
  • GetRecords.Latency: Measures the time taken for each GetRecords operation on the stream over a specified time period. Confirms sufficient physical resources or record processing logic for increased stream throughput. Processes larger batches of data to reduce network and other downstream latencies in your application. For the Kinesis Client Library (KCL), investigate the ProcessTask.Time metric to monitor the processing time of the application that is falling behind. The GetRecords.Latency metric confirms that the IDLE_TIME_BETWEEN_READS_IN_MILLIS setting is set to keep up with stream processing.
  • PutRecords.Latency: Measures the time taken for each PutRecords operation on the stream over a specified time period. If the PutRecords.Latency value is high, aggregate records into a larger file to put batch data into the Kinesis data stream. You can also use multiple threads to write data. Throttling and retry logic on the PutRecords API can impact latency and the time taken for each PutRecords operation on the stream.

Then, use the Average statistic for the listed metrics to monitor performance and throughput of the stream.

Note: For GetRecords.IteratorAgeMilliseconds, the Maximum statistic should be used to reduce the risk of data loss for consumers that are lagging behind any read operations. Configure a CloudWatch alarm to trigger any data points to be evaluated for a metric. For more information about CloudWatch alarms, see Using Amazon CloudWatch alarms.

If you are using the enhanced fan-out feature, use the following metrics to monitor Kinesis Data Streams:

  • SubscribeToShard.RateExceeded: Measures the number of calls per second exceeded that are allowed for the operation or when a subscription attempt fails because an active subscription already exists.
  • SubscribeToShard.Success: Verifies whether the SubscribeToShard operation succeeds.
  • SubscribeToShardEvent.Success: Verifies the successful publication of an event for active subscription.
  • SubscribeToShardEvent.Bytes: Measures the number of bytes received in the shards over the specified time period.
  • SubscribeToShardEvent.Records: Measures the number of records received in the shards over the specified time period.
  • SubscribeToShardEvent.MillisBehindLatest: Measures the difference of current time and last record of the SubscribeToShard event written to the stream.

Enabling enhanced shard-level metrics

Enable shard-level metrics in CloudWatch to monitor specific tasks and to troubleshot data producers and consumers. For example, enabling shard-level metrics can help you identify issues like uneven workload distributions.

To enable enhanced monitoring, perform the following steps:

Note: You can also use the EnableEnhancedMonitoring API request or enable-enhanced-monitoring AWS CLI.

1.    Open the Kinesis console.

2.    Choose a specific Region.

3.    From the navigation pane, choose Data Streams.

4.    Under Data Stream Name, select your Kinesis data stream.

5.    Choose Configuration.

6.    Choose Edit under Enhanced (shard-level) metrics.

7.    From the dropdown menu, select your metrics for enhanced monitoring.

8.    Choose Save Changes to apply your configuration settings.

Additional troubleshooting with API calls

You can also use the following API calls to read or write data from Kinesis Data Streams:

  • CreateStream: Limit of five transactions per second per account.
  • DeleteStream: Limit of five transactions per second per account.
  • ListStreams: Limit of five transactions per second per account.
  • GetShardIterator: Limit of five transactions per second per account per open shard.
  • MergeShards: Limit of five transactions per second per account.
  • DescribeStream: Limit of ten transactions per second per account.
  • DescribeStreamSummary: Limit of twenty transactions per second per account.

When you use these API calls, you can monitor any throttling in the AWS CloudTrail logs. For more information about Kinesis Data Streams API calls and CloudTrail, see Logging Amazon Kinesis Data Streams API calls with AWS CloudTrail.