Why is my Kinesis data stream returning a 500 Internal Server Error?

Last updated: 2020-06-04

My Amazon Kinesis data stream is returning a 500 Internal Server Error or a 503 Service Unavailable Error. How do I detect and troubleshoot these errors within Amazon Kinesis Data Streams?

Short Description

If you are producing to a Kinesis data stream, one of the following internal errors can occur:

  • PutRecord or PutRecords returns an AmazonKinesisException 500 or AmazonKinesisException 503 error with a rate above 1% for several minutes.
  • SubscribeToShard.Success or GetRecords returns an AmazonKinesisException 500 or AmazonKinesisException 503 error with a rate above 1% for several minutes.

You can troubleshoot these internal errors by doing the following:

  • Calculate your error rate.
  • Implement a retry mechanism.

Resolution

Calculate your error rate

Look for significant drops in the time windows of either PutRecord.Success or GetRecord.Success under the Monitoring tab. If you notice significant drops, calculate the error rate to determine the severity of your Kinesis data stream issue.

To calculate your error rate, compute the average value of PutRecord.Success and GetRecord.Success.

Implement a retry mechanism

After you've calculated your error rate, confirm whether the error rate falls below 0.1%. Kinesis Data Streams allows for high throughput writes with a low error rate. Average error rates are typically below 0.01%.

If you wrote your own consumer or producer, implement a retry mechanism in your application code. For more information about retry mechanism implementations, see the Retries section in Implementing efficient and reliable producers with the Amazon Kinesis Producer Library.

If your error rate exceeds 1% for several minutes, contact AWS Support. Provide the following information:

  • Applications used to read or write data to/from Data Streams
  • Number of shards in your Kinesis data stream
  • Server-side encryption settings
  • Specific shard IDs that are impacted
  • Time frame where drops in success rates are observed
  • Request IDs that are reporting internal failures

Did this article help you?

Anything we could improve?


Need more help?