How do I troubleshoot issues when connecting to my Amazon MSK cluster?

Last updated: 2022-03-01

I'm experiencing issues when I try to connect to my Amazon Managed Streaming for Apache Kafka (Amazon MSK) cluster.

Resolution

When you try to connect to an Amazon MSK cluster, you might get the following types of errors:

  • Errors that are not specific to the authentication type of the cluster
  • Errors that are specific to TLS client authentication

When you try to connect to your Amazon MSK cluster, you might get one of the following errors irrespective of the authentication type enabled for your cluster:

Timed out waiting for connection while in state: CONNECTING

You might get this error when the client is trying to connect to the Amazon MSK cluster through the Apache ZooKeeper string, and the connection can't be established. This error might also result when the Apache ZooKeeper string is wrong.

You get the following error when you use the incorrect Apache ZooKeeper string to connect to the cluster:

./kafka-topics.sh --zookeeper z-1.encryption.3a3zuy.c7.kafka.us-east-1.amazonaws.com:2181,z-2.encryption.3a3zuy.c7.kafka.us-east-1.amazonaws.com:2181,z-3.encryption.3a3zuy.c7.kafka.us-east-1.amazonaws.com:2181 --list
[2020-04-10 23:58:47,963] WARN Client session timed out, have not heard from server in 10756ms for sessionid 0x0 (org.apache.zookeeper.ClientCnxn)
[2020-04-10 23:58:58,581] WARN Client session timed out, have not heard from server in 10508ms for sessionid 0x0 (org.apache.zookeeper.ClientCnxn)
[2020-04-10 23:59:08,689] WARN Client session timed out, have not heard from server in 10004ms for sessionid 0x0 (org.apache.zookeeper.ClientCnxn)
Exception in thread "main" kafka.zookeeper.ZooKeeperClientTimeoutException: Timed out waiting for connection while in state: CONNECTING
at kafka.zookeeper.ZooKeeperClient.$anonfun$waitUntilConnected$3(ZooKeeperClient.scala:259)
at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
at kafka.utils.CoreUtils$.inLock(CoreUtils.scala:253)
at kafka.zookeeper.ZooKeeperClient.waitUntilConnected(ZooKeeperClient.scala:255)
at kafka.zookeeper.ZooKeeperClient.<init>(ZooKeeperClient.scala:113)
at kafka.zk.KafkaZkClient$.apply(KafkaZkClient.scala:1858)
at kafka.admin.TopicCommand$ZookeeperTopicService$.apply(TopicCommand.scala:321)
at kafka.admin.TopicCommand$.main(TopicCommand.scala:54)
at kafka.admin.TopicCommand.main(TopicCommand.scala)

To resolve this error, do the following:

  • Verify that the Apache ZooKeeper string used is correct.
  • Be sure that the security group for your Amazon MSK cluster allows inbound traffic from the client's security group on the Apache ZooKeeper ports.

Topic 'topicName' not present in metadata after 60000 ms. or Connection to node - ( / : ) could not be established. Broker may not be available. (org.apache.kafka.clients.NetworkClient)

You might get this error under either of the following conditions:

  • The producer or consumer is unable to connect to the broker host and port.
  • The broker string is not valid.

If you get this error even though the connectivity of the client or broker was working initially, then the broker might be down.

You get the following error when you try to access the cluster from outside the virtual private cloud (VPC) using the broker string for producing data:

./kafka-console-producer.sh --broker-list b-2.encryption.3a3zuy.c7.kafka.us-east-1.amazonaws.com:9092,b-1.encryption.3a3zuy.c7.kafka.us-east-1.amazonaws.com:9092 --topic test
>a[2020-04-10 23:51:57,668] ERROR Error when sending message to topic test with key: null, value: 1 bytes with error: (org.apache.kafka.clients.producer.internals.ErrorLoggingCallback)
org.apache.kafka.common.errors.TimeoutException: Topic test not present in metadata after 60000 ms.

You get the following error when you try to access the cluster from outside the VPC for consuming data using broker string:

./kafka-console-consumer.sh --bootstrap-server b-2.encryption.3a3zuy.c7.kafka.us-east-1.amazonaws.com:9092,b-1.encryption.3a3zuy.c7.kafka.us-east-1.amazonaws.com:9092 --topic test
[2020-04-11 00:03:21,157] WARN [Consumer clientId=consumer-console-consumer-88994-1, groupId=console-consumer-88994] Connection to node -1 (b-2.encryption.3a3zuy.c7.kafka.us-east-1.amazonaws.com/172.31.6.19:9092) could not be established. Broker may not
be available. (org.apache.kafka.clients.NetworkClient)
[2020-04-11 00:04:36,818] WARN [Consumer clientId=consumer-console-consumer-88994-1, groupId=console-consumer-88994] Connection to node -2 (b-1.encryption.3a3zuy.c7.kafka.us-east-1.amazonaws.com/172.31.44.252:9092) could not be established. Broker may
not be available. (org.apache.kafka.clients.NetworkClient)
[2020-04-11 00:05:53,228] WARN [Consumer clientId=consumer-console-consumer-88994-1, groupId=console-consumer-88994] Connection to node -1 (b-2.encryption.3a3zuy.c7.kafka.us-east-1.amazonaws.com/172.31.6.19:9092) could not be established. Broker may not be available. (org.apache.kafka.clients.NetworkClient)

To troubleshoot these errors, do the following:

  • Be sure that the correct broker string and port are used.
  • If the error is caused due to the broker being down, check the Amazon CloudWatch metric ActiveControllerCount to verify that the controller was active throughout the period. The value of this metric must be 1. Any other value might indicate that one of the brokers in the cluster is unavailable. Also, check the metric ZooKeeperSessionState to confirm that the brokers were constant communicating with the Apache ZooKeeper nodes. To understand why the broker failed, view the metric KafkaDataLogsDiskUsed metric and check if the broker ran out of storage space. For more information on Amazon MSK metrics and the expected values, see Amazon MSK metrics for monitoring with CloudWatch.
  • Be sure that the error is not caused by the network configuration. Amazon MSK resources are provisioned within the VPC. Therefore, by default, clients are expected to connect to the Amazon MSK cluster or produce and consume from the cluster over a private network in the same VPC. If you access the cluster from outside the VPC, then you might get these errors. For information on troubleshooting errors when the client is in the same VPC as the cluster, see Unable to access cluster from within AWS: networking issues. For information on accessing the cluster from outside the VPC, see How do I connect to my Amazon MSK cluster outside of the VPC?

Errors that are specific to TLS client authentication

You might get the following errors when you try to connect to that cluster that's TLS client authentication enabled. These errors might be caused due to issues with SSL related configuration.

Bootstrap broker :9094 (id: - rack: null) disconnected

You might get this error when the producer or consumer tries to connect to a TLS-encrypted cluster over TLS port 9094 without passing the SSL configuration.

You might get the following error when the producer tries to connect to the cluster:

/kafka-console-producer.sh --broker-list b-2.encryption.3a3zuy.c7.kafka.us-east-1.amazonaws.com:9094,b-1.encryption.3a3zuy.c7.kafka.us-east-1.amazonaws.com:9094 --topic test
>[2020-04-10 18:57:58,019] WARN [Producer clientId=console-producer] Bootstrap broker b-1.encryption.3a3zuy.c7.kafka.us-east-1.amazonaws.com:9094 (id: -2 rack: null) disconnected (org.apache.kafka.clients.NetworkClient)
[2020-04-10 18:57:58,342] WARN [Producer clientId=console-producer] Bootstrap broker b-1.encryption.3a3zuy.c7.kafka.us-east-1.amazonaws.com:9094 (id: -2 rack: null) disconnected (org.apache.kafka.clients.NetworkClient)
[2020-04-10 18:57:58,666] WARN [Producer clientId=console-producer] Bootstrap broker b-2.encryption.3a3zuy.c7.kafka.us-east-1.amazonaws.com:9094 (id: -1 rack: null) disconnected (org.apache.kafka.clients.NetworkClient)

You might get the following error when the consumer tries to connect to the cluster:

./kafka-console-consumer.sh --bootstrap-server b-2.encryption.3a3zuy.c7.kafka.us-east-1.amazonaws.com:9094,b-1.encryption.3a3zuy.c7.kafka.us-east-1.amazonaws.com:9094 --topic test
[2020-04-10 19:09:03,277] WARN [Consumer clientId=consumer-console-consumer-79102-1, groupId=console-consumer-79102] Bootstrap broker b-2.encryption.3a3zuy.c7.kafka.us-east-1.amazonaws.com:9094 (id: -1 rack: null) disconnected (org.apache.kafka.clients.NetworkClient)
[2020-04-10 19:09:03,596] WARN [Consumer clientId=consumer-console-consumer-79102-1, groupId=console-consumer-79102] Bootstrap broker b-2.encryption.3a3zuy.c7.kafka.us-east-1.amazonaws.com:9094 (id: -1 rack: null) disconnected (org.apache.kafka.clients.NetworkClient)
[2020-04-10 19:09:03,918] WARN [Consumer clientId=consumer-console-consumer-79102-1, groupId=console-consumer-79102] Bootstrap broker b-1.encryption.3a3zuy.c7.kafka.us-east-1.amazonaws.com:9094 (id: -2 rack: null) disconnected (org.apache.kafka.clients.NetworkClient)

To resolve this error, set up the SSL configuration. For more information, see How do I get started with encryption?

If client authentication is enabled for your cluster, then you must add additional parameters related to your ACM Private CA certificate. For more information, see Mutual TLS authentication.

ERROR Modification time of key store could not be obtained:

-or-

Failed to load keystore

If there is an issue with the truststore configuration, then this error can occur when truststore files are loaded for the producer and consumer. You might view information similar to the following in the logs:

./kafka-console-consumer --bootstrap-server b-2.encryption.3a3zuy.c7.kafka.us-east-1.amazonaws.com:9094,b-1.encryption.3a3zuy.c7.kafka.us-east-1.amazonaws.com:9094 --topic test --consumer.config /home/ec2-user/ssl.config
[2020-04-11 10:39:12,194] ERROR Modification time of key store could not be obtained: /home/ec2-ser/certs/kafka.client.truststore.jks (org.apache.kafka.common.security.ssl.SslEngineBuilder)
java.nio.file.NoSuchFileException: /home/ec2-ser/certs/kafka.client.truststore.jks
[2020-04-11 10:39:12,253] ERROR Unknown error when running consumer: (kafka.tools.ConsoleConsumer$)
Caused by: org.apache.kafka.common.KafkaException: org.apache.kafka.common.KafkaException: org.apache.kafka.common.KafkaException: Failed to load SSL keystore /home/ec2-ser/certs/kafka.client.truststore.jks of type JKS

In this case, the logs indicate an issue with loading the truststore file. The path to the truststore file is wrongly configured in the SSL configuration. You can resolve this error by providing the correct path for the truststore file in the SSL configuration.

This error might also occur due the following conditions:

  • Your truststore or key store file is corrupted.
  • The password of the truststore file is incorrect.

Error when sending message to topic test with key: null, value: 0 bytes with error: (org.apache.kafka.clients.producer.internals.ErrorLoggingCallback)

org.apache.kafka.common.errors.SslAuthenticationException: SSL handshake failed

-or-

Connection to node - ( / :9094) failed authentication due to: SSL handshake failed (org.apache.kafka.clients.NetworkClient)

You might get the following error when there is an issue with the key store configuration of the producer leading to the authentication failure:

./kafka-console-producer -broker-list b-2.tlscluster.5818ll.c7.kafka.us-east-1.amazonaws.com:9094,b-1.tlscluster.5818ll.c7.kafka.us-east-1.amazonaws.com:9094,b-4.tlscluster.5818ll.c7.kafka.us-east-1.amazonaws.com:9094 --topic example --producer.config/home/ec2-user/ssl.config
>[2020-04-11 11:13:19,286] ERROR [Producer clientId=console-producer] Connection to node -3 (b-4.tlscluster.5818ll.c7.kafka.us-east-1.amazonaws.com/172.31.6.195:9094) failed authentication due to: SSL handshake failed (org.apache.kafka.clients.NetworkClient)

You might get the following error when there is an issue with the key store configuration of the consumer leading to the authentication failure:

./kafka-console-consumer --bootstrap-server b-2.tlscluster.5818ll.c7.kafka.us-east-1.amazonaws.com:9094,b-1.tlscluster.5818ll.c7.kafka.us-east-1.amazonaws.com:9094,b-4.tlscluster.5818ll.c7.kafka.us-east-1.amazonaws.com:9094 --topic example --consumer.config/home/ec2-user/ssl.config
[2020-04-11 11:14:46,958] ERROR [Consumer clientId=consumer-1, groupId=console-consumer-46876] Connection to node -1 (b-2.tlscluster.5818ll.c7.kafka.us-east-1.amazonaws.com/172.31.15.140:9094) failed authentication due to: SSL handshake failed (org.apache.kafka.clients.NetworkClient)
[2020-04-11 11:14:46,961] ERROR Error processing message, terminating consumer process: (kafka.tools.ConsoleConsumer$)
org.apache.kafka.common.errors.SslAuthenticationException: SSL handshake failed

To resolve this error, be sure that you have correctly configured the key store related configuration.

java.io.IOException: keystore password was incorrect

You might get this error when the password for the key store or truststore is incorrect.

To troubleshoot this error, do the following:

Check whether the keystore or truststore password is correct by running the following command:

keytool -list -keystore kafka.client.keystore.jks
Enter keystore password:
Keystore type: PKCS12
Keystore provider: SUN
Your keystore contains 1 entry
schema-reg, Jan 15, 2020, PrivateKeyEntry,
Certificate fingerprint (SHA1): 4A:F3:2C:6A:5D:50:87:3A:37:6C:94:5E:05:22:5A:1A:D5:8B:95:ED

If the password for the key store or truststore is incorrect, then you might see the following error:

keytool -list -keystore kafka.client.keystore.jks
Enter keystore password:
keytool error: java.io.IOException: keystore password was incorrect

You can view the verbose output of the above command by adding the -v flag:

keytool -list -v -keystore kafka.client.keystore.jks

You can also use these commands to check if the key store is corrupted.

You might also get this error when the secret key associated with the alias is incorrectly configured in the SSL configuration of the producer and consumer. To verify this root cause, run the following command:

keytool -keypasswd -alias schema-reg -keystore kafka.client.keystore.jks
Enter keystore password:
Enter key password for <schema-reg>
New key password for <schema-reg>:
Re-enter new key password for <schema-reg>:

If your password for the secret of the alias (example: schema-reg) is correct, then the command asks you to enter a new password for the secret key else. Otherwise, the command fails with the following message:

keytool -keypasswd -alias schema-reg -keystore kafka.client.keystore.jks
Enter keystore password:
Enter key password for <schema-reg>
keytool error: java.security.UnrecoverableKeyException: Get Key failed: Given final block not properly padded. Such issues can arise if a bad key is used during decryption.

You can also verify if a particular alias is part of the key store by running the following command:

keytool -list -keystore kafka.client.keystore.jks -alias schema-reg
Enter keystore password:
schema-reg, Jan 15, 2020, PrivateKeyEntry,
Certificate fingerprint (SHA1): 4A:F3:2C:6A:5D:50:87:3A:37:6C:94:5E:05:22:5A:1A:D5:8B:95:ED