How do I troubleshoot timeout errors when writing from Flink to Kinesis Data Streams?

Last updated: 2020-05-06

I'm trying to write data from Flink to Amazon Kinesis Data Streams, but I receive a timeout or exception error. Why is this happening and how do I troubleshoot these errors?

Short Description

Flink applications that use FlinkKinesisProducer can produce one of the following error messages:

Caused by: org.apache.flink.kinesis.shaded.org.apache.http.conn.ConnectTimeoutException: Connect to kinesis.us-east-1.amazonaws.com:443 [kinesis.us-east-1.amazonaws.com/xxx.xxx.xxxx.xxx] failed: connect timed out
[AWS Log: ERROR](CurlHttpClient)Curl returned error code 28

These two timeout errors are caused by network problems and a lack of system resources in the environment where the Flink application is running.

Resolution

Unable to connect to Kinesis Data Streams service endpoint

The following error occurs when the Flink application is unable to connect to the Data Streams service endpoint:

Caused by: org.apache.flink.kinesis.shaded.org.apache.http.conn.ConnectTimeoutException: Connect to kinesis.us-east-1.amazonaws.com: 443 [ kinesis.us-east-1.amazonaws.com/xxx.xxxx.xxx] failed:connect timed out

If this error repeatedly occurs, then there could be a problem with your network configuration.

To resolve this issue, perform the following steps:

1.    Verify that the Flink application can connect to the internet.

2.    If your Flink application is running on AWS resources in a virtual private cloud (VPC), verify that the following VPC features are configured correctly:
       Route Table
       Security Groups
       Network Access Control Lists (ACL)

3.    (Optional) You can also use Data Stream's VPC endpoint to communicate within your VPC.

Response for the submitted request wasn't returned within the configured timeout period

The following Curl 28 error indicates that the response for the submitted request was not returned within the configured timeout period. Therefore, a timeout occurred:

[AWS Log: ERROR](CurlHttpClient)Curl returned error code 28

The timeout occurred because of a temporary network issue. The timeout might also be caused by too many pending requests to Data Streams, where records are sent to the Kinesis Producer Library (KPL) daemon. Records are sent to the KPL because FlinkKinesisProducer uses the KPL to send data from a Flink stream into an Amazon Kinesis stream.

To resolve this issue, change the following configuration parameter of the FlinkKinesisProducer object:

Request timeout period: producerConfig.put (“RequestTimeout”, “****”); I
  • Internal Queue Size: FlinkKinesisProducer #setQueueLimit (queueLimit)

It's also a best practice to update the following parameters to avoid data loss:

Internal Queue Size: FlinkKinesisProducer #setQueueLimit (queueLimit)
time-to-live on records: producerConfig.put("RecordTtl", "*****");

For more information about calculating the value of setQueueLimit, see Backpressure on the Apache website.


Did this article help you?

Anything we could improve?


Need more help?