How do I troubleshoot a blocked or stuck KCL application for Kinesis Data Streams?

Last updated: 2020-05-06

My Amazon Kinesis Client Library (KCL) application is stuck and is unable to process any Amazon Kinesis Data Streams records. How do I troubleshoot this issue?

Short Description

The KCL application can get stuck or blocked for the following reasons:

  • The record processor (a user implemented method) does a blocking operation or is taking longer than normal.
  • There are no data records put to the shard.
  • The KCL gets stuck while retrieving a record.
  • The KCL is unable to schedule processing or fails to checkpoint.

You can detect and troubleshoot the KCL issues by doing the following:

  • Analyze KCL metrics.
  • Analyze the Amazon DynamoDB table for the KCL application.
  • Check the KCL configurations.
  • Enable the KCL warning logs.
  • Enable the KCL debug logs.

Resolution

Analyze KCL metrics

Monitor the RecordProcessor.processRecords.Time metric and confirm whether the time taken by the record processor’s processRecords method is greater than 60 seconds. If your processRecords method is blocked, then the KCL must wait. After your record processor completes its job, try optimizing your processRecords method.

Check KCL configurations

Check the number of KCL fleets and note the number of shards in the Kinesis data stream. If the number of shards are increased, then increase the maxLeasesPerWorker parameter according to the number of shards in the KCL.

Analyze DynamoDB table for KCL application

Every KCL application creates a DynamoDB table with the name same as the KCL application to track the application's state. To troubleshoot the KCL application, analyze the columns in the DynamoDB table.

If the checkpoint column in the table isn't updated, then the processRecords method logic is stuck. If both the checkpoint and leaseCounter columns aren't updated, then the maxLeasesPerWorker=1 parameter is preventing other workers from taking up the lease. To unblock the processRecords method, increase the parameter value.

Enable Advanced KCL Warning logs

To verify whether the record processor is blocked, set the logWarningForTaskAfterMillis value for the KCL configuration to milliseconds. The KCL then waits for a record processor to complete before emitting a warning message to the log about processing time. If warning messages are logged, capturing successive stack dumps from the JVM can help discover what is blocked. You can use the jstack command to capture any stack traces.

For more information about the logWarningForTaskAfterMillis value, see Amazon Web Services - Labs in GitHub.

Enable the KCL debug logs

You can enable the KCL debug logs to identify issues that caused the KCL to stop consuming data from Kinesis Data Streams. It's also a best practice to restart the KCL application to clear any other application issues.

If you restarted the KCL and it is still stuck, there could be an issue caused by the transfer of shard ownership. This also causes an issue where the KCL doesn't have the logs for data that you are trying to reproduce. You can resolve this issue by enabling the logging feature on the KCL fleet.

To enable logs, perform the following steps:

1.    Choose a logger.

2.    Create a log4.properties file in the src/main/resources folder to redirect log messages to the console:

log4j.appender.stdout=org.apache.log4j.ConsoleAppender
log4j.appender.stdout.Target=System.out
log4j.appender.stdout.layout=org.apache.log4j.PatternLayout
log4j.appender.stdout.layout.ConversionPattern=%d{yyyy-MM-dd HH:mm:ss} %-5p %c{1}:%L - %m%n
log4j.logger.httpclient.wire=DEBUG

Note: In this example, we are using log4j to debug logs in Java.

3.    Redirect the log messages to a log file:

log4j.appender.file=org.apache.log4j.RollingFileAppender
log4j.appender.file.File=/Users/harshdev/Desktop/logfolder/    <== Give the log location where you want to create log files
log4j.appender.file.MaxFileSize=5MB
log4j.appender.file.MaxBackupIndex=10
log4j.appender.file.layout=org.apache.log4j.PatternLayout
log4j.appender.file.layout.ConversionPattern=%d{yyyy-MM-dd HH:mm:ss} %-5p %c{1}:%L - %m%n
log4j.rootLogger=DEBUG, stdout, file

4.    Include the log4j dependency in your POM file:

<dependency>
        <groupId>log4j</groupId>
        <artifactId>log4j</artifactId>
        <version>1.2.17</version>
</dependency>