How do I troubleshoot Amazon Kinesis Agent issues on a Linux machine?

Last updated: 2022-05-05

I'm trying to use Amazon Kinesis Agent on a Linux machine. However, I'm encountering an issue. How do I resolve this?

Short description

This article covers the following issues:

  • Kinesis Agent is sending duplicate events.
  • Kinesis Agent is causing write throttles and failed records on my Amazon Kinesis stream.
  • Kinesis Agent is unable to read or stream log files.
  • My Amazon Elastic Computing (Amazon EC2) server keeps failing because of insufficient Java heap size.
  • My Amazon EC2 CPU utilization is very high.

Resolution

Kinesis Agent is sending duplicate events

If you receive duplicates whenever you send logs from Kinesis Agent, there's likely a file rotation in place where the match pattern isn't correctly qualified. Whenever you send a log, Kinesis Agent checks the latestUpdateTimestamp of each file that matches the file pattern. By default, Kinesis Agent chooses the most recently updated file, identifying an active file that matches the rotation pattern. If more than one file is updated at the same time, Kinesis Agent can't determine the active file to track. Therefore, Kinesis Agent begins to tail the updated files from the beginning, causing several duplicates.

To avoid this issue, create different file flows for each individual file, making sure that your file pattern tracks the rotations instead.

Note: If you're tracking a rotation, it's a best practice to use either the create or rename log rotate settings, instead of copytruncate.

For example, you can use a file flow that's similar to this one:

"flows": [
        {
            "filePattern": "/tmp/app1.log*",
            "kinesisStream": "yourkinesisstream1"
        },
        {
            "filePattern": "/tmp/app2.log*",
            "kinesisStream": "yourkinesisstream2"
        }
    ]

Kinesis Agent also retries any records that it fails to send back when there are intermittent network issues. If Kinesis Agent fails to receive server-side acknowledgement, it tries again, creating duplicates. In this example, the downstream application must de-duplicate.

Duplicates can also occur when the checkpoint file is tempered or removed. If a checkpoint file is stored in /var/run/aws-kinesis-agent, then the file might get cleaned up during a reinstallation or instance reboot. When you run Kinesis Agent again, the application fails as soon as the file is read, causing duplicates. Therefore, keep the checkpoint in the main Agent directory and update the Kinesis Agent configuration with a new location.

For example:

"checkpointFile": "/aws-kinesis-agent-checkpoints/checkpoints"

Kinesis Agent is causing write throttles and failed records on my Amazon Kinesis data stream

By default, Kinesis Agent tries to send the log files as quickly as possible, breaching Kinesis' throughput thresholds. However, failed records are re-queued, and are continuously retried to prevent any data loss. When the queue is full, Kinesis Agent stops tailing the file, which can cause the application to lag.

For example, if the queue is full, your log looks similar to this:

com.amazon.kinesis.streaming.agent.Agent [WARN] Agent: Tailing is 745.005859 MB (781195567 bytes) behind.

Note: The queue size is determined by the publishQueueCapacity parameter (with the default value set to "100").

To investigate any failed records or performance issues on your Kinesis data stream, try the following:

  • Monitor the RecordSendErrors metric in Amazon CloudWatch.
  • Review your Kinesis Agent logs to check if any lags occurred. The ProvisionedThroughputExceededException entry is visible only under the DEBUG log level. During this time, Kinesis Agent's record sending speed can be slower if most of the CPU is used to parse and transform data.
  • If you see that Kinesis Agent is falling behind, then consider scaling up your Amazon Kinesis delivery stream.

Kinesis Agent is unable to read or stream log files

Make sure that the Amazon EC2 instance that your Kinesis Agent is running on has proper permissions to access your destination Kinesis delivery stream. If Kinesis Agent fails to read the log file, then check whether Kinesis Agent has read permissions for that file. For all files matching this pattern, read permission must be granted to aws-kinesis-agent-user. For the directory containing the files, read and execute permissions must also be granted to aws-kinesis-agent-user. Otherwise, you get an Access Denied error or Java Runtime Exception.

My Amazon EC2 server keeps failing because of insufficient Java heap size

If your Amazon EC2 server keeps failing because of insufficient Java heap size, then increase the heap size allotted to Amazon Kinesis Agent. To configure the amount of memory available to Kinesis Agent, update the “start-aws-kinesis-agent” file. Increase the set values for the following parameters:

  • JAVA_START_HEAP
  • JAVA_MAX_HEAP

Note: On Linux, the file path for “start-aws-kinesis-agent” is “/usr/bin/start-aws-kinesis-agent”.

My Amazon EC2 CPU utilization is very high

CPU utilization can spike if Kinesis Agent is performing sub-optimized regex pattern matching and log transformation. If you already configured Kinesis Agent, try removing all the regular expression (regex) pattern matches and transformations. Then, check whether you're still experiencing CPU issues.

If you still experience CPU issues, then consider tuning the threads and records that are buffered in memory. Or, update some of the default parameters in the /etc/aws-kinesis/agent.json configuration settings file. You can also lower several parameters in the Kinesis Agent configuration file.

Here are the general configuration parameters that you can try lowering:

  • sendingThreadsMaxQueueSize: The workQueue size of the threadPool for sending data to destination. The default value is 100.
  • maxSendingThreads: The number of threads for sending data to destination. The minimum value is 2. The default value is 12 times the number of cores for your machine.
  • maxSendingThreadsPerCore: The number of threads per core for sending data to destination. The default value is 12.
Here are the flow configuration parameters that you can try lowering:
  • publishQueueCapacity: The maximum number of buffers of records that can be queued up before they are sent to the destination. The default value is 100.
  • minTimeBetweenFilePollsMillis: The time interval when the tracked file is polled and the new data begins to parse. The default value is 100.

Did this article help?


Do you need billing or technical support?