如何排查连接到 Amazon MSK 集群时出现的问题?

上次更新日期:2022 年 3 月 1 日

我在尝试连接到 Amazon Managed Streaming for Apache Kafka (Amazon MSK) 集群时遇到了问题。

解决方法

您尝试连接到 Amazon MSK 集群时,可能会出现以下类型的错误:

  • 非特定于集群身份验证类型的错误
  • 特定于 TLS 客户端身份验证的错误

当您尝试连接到您的 Amazon MSK 集群时,无论为您的集群启用的身份验证类型如何,您都可能会遇到以下错误之一:

处于状态时等待连接超时:CONNECTING(正在连接)

当客户端尝试通过 Apache ZooKeeper 字符串连接到 Amazon MSK 集群,并且无法建立连接时,您可能会收到此错误。当 Apache ZooKeeper 字符串错误时,也可能导致此错误。

您使用不正确的 Apache ZooKeeper 字符串连接到集群时,会出现以下错误:

./kafka-topics.sh --zookeeper z-1.encryption.3a3zuy.c7.kafka.us-east-1.amazonaws.com:2181,z-2.encryption.3a3zuy.c7.kafka.us-east-1.amazonaws.com:2181,z-3.encryption.3a3zuy.c7.kafka.us-east-1.amazonaws.com:2181 --list
[2020-04-10 23:58:47,963] WARN Client session timed out, have not heard from server in 10756ms for sessionid 0x0 (org.apache.zookeeper.ClientCnxn)
[2020-04-10 23:58:58,581] WARN Client session timed out, have not heard from server in 10508ms for sessionid 0x0 (org.apache.zookeeper.ClientCnxn)
[2020-04-10 23:59:08,689] WARN Client session timed out, have not heard from server in 10004ms for sessionid 0x0 (org.apache.zookeeper.ClientCnxn)
Exception in thread "main" kafka.zookeeper.ZooKeeperClientTimeoutException: Timed out waiting for connection while in state: CONNECTING
at kafka.zookeeper.ZooKeeperClient.$anonfun$waitUntilConnected$3(ZooKeeperClient.scala:259)
at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
at kafka.utils.CoreUtils$.inLock(CoreUtils.scala:253)
at kafka.zookeeper.ZooKeeperClient.waitUntilConnected(ZooKeeperClient.scala:255)
at kafka.zookeeper.ZooKeeperClient.<init>(ZooKeeperClient.scala:113)
at kafka.zk.KafkaZkClient$.apply(KafkaZkClient.scala:1858)
at kafka.admin.TopicCommand$ZookeeperTopicService$.apply(TopicCommand.scala:321)
at kafka.admin.TopicCommand$.main(TopicCommand.scala:54)
at kafka.admin.TopicCommand.main(TopicCommand.scala)

要解决此错误,请执行以下操作:

  • 验证使用的 Apache ZooKeeper 字符串是否正确。
  • 请确保您的 Amazon MSK 集群的安全组允许 Apache ZooKeeper 端口上来自客户端安全组的入站流量。

60000 毫秒后元数据中不存在主题“topicName”或无法建立与节点- / : )的连接。代理或许不可用。(org.apache.kafka.clients.networkClient)

在以下任一情况下,您可能会收到此错误:

  • 生产者或使用者无法连接到代理主机和端口。
  • 代理字符串无效。

如果即使客户端或代理的连接最初正常,您也收到了此错误,则代理可能已关闭。

当您尝试使用代理字符串从 virtual private cloud(VPC)外部访问集群以生成数据时,您会收到以下错误:

./kafka-console-producer.sh --broker-list b-2.encryption.3a3zuy.c7.kafka.us-east-1.amazonaws.com:9092,b-1.encryption.3a3zuy.c7.kafka.us-east-1.amazonaws.com:9092 --topic test
>a[2020-04-10 23:51:57,668] ERROR Error when sending message to topic test with key: null, value: 1 bytes with error: (org.apache.kafka.clients.producer.internals.ErrorLoggingCallback)
org.apache.kafka.common.errors.TimeoutException: Topic test not present in metadata after 60000 ms.

当您尝试使用代理字符串从 VPC 外部访问集群以使用数据时,您会收到以下错误:

./kafka-console-consumer.sh --bootstrap-server b-2.encryption.3a3zuy.c7.kafka.us-east-1.amazonaws.com:9092,b-1.encryption.3a3zuy.c7.kafka.us-east-1.amazonaws.com:9092 --topic test
[2020-04-11 00:03:21,157] WARN [Consumer clientId=consumer-console-consumer-88994-1, groupId=console-consumer-88994] Connection to node -1 (b-2.encryption.3a3zuy.c7.kafka.us-east-1.amazonaws.com/172.31.6.19:9092) could not be established. Broker may not
be available. (org.apache.kafka.clients.NetworkClient)
[2020-04-11 00:04:36,818] WARN [Consumer clientId=consumer-console-consumer-88994-1, groupId=console-consumer-88994] Connection to node -2 (b-1.encryption.3a3zuy.c7.kafka.us-east-1.amazonaws.com/172.31.44.252:9092) could not be established. Broker may
not be available. (org.apache.kafka.clients.NetworkClient)
[2020-04-11 00:05:53,228] WARN [Consumer clientId=consumer-console-consumer-88994-1, groupId=console-consumer-88994] Connection to node -1 (b-2.encryption.3a3zuy.c7.kafka.us-east-1.amazonaws.com/172.31.6.19:9092) could not be established. Broker may not be available. (org.apache.kafka.clients.NetworkClient)

要进一步排查这些错误,请执行以下操作:

  • 请确保使用了正确的代理字符串和端口。
  • 如果错误是由代理关闭引起的,请查看 Amazon CloudWatch 指标 ActiveControllerCoun 以验证控制器在整个期间是否处于活动状态。此指标的值必须为 1。任何其他值都可能表明集群中的某个代理不可用。此外,请检查指标 ZooKeeperSessionState 以确认代理一直在与 Apache ZooKeeper 节点通信。要了解代理失败的原因,请查看指标 KafkaDataLogsDiskUsed,并检查代理存储空间是否已用完。有关 Amazon MSK 指标和预期值的更多信息,请参阅使用 CloudWatch 进行监控的 Amazon MSK 指标
  • 请确保错误不是由网络配置引起的。Amazon MSK 资源是在 VPC 内预置的。因此,默认情况下,客户端应连接到 Amazon MSK 集群或通过同一 VPC 中的私有网络从集群中生产和消费。如果您从 VPC 外部访问集群,则可能会出现这些错误。有关客户端与集群位于同一 VPC 中时排除错误的信息,请参阅无法从 AWS 内部访问集群:联网问题。有关从 VPC 外部访问集群的信息,请参阅如何在 VPC 外部连接到我的 Amazon MSK 集群?

特定于 TLS 客户端身份验证的错误

尝试连接到已启用 TLS 客户端身份验证的集群时,您可能会收到以下错误提示。这些错误可能是由于 SSL 相关配置问题引起的。

Bootstrap 框架代理 :9094(id: - rack: null)已断开连接

生产者或消费者尝试通过 TLS 端口 9094 连接到 TLS 加密集群而不传递 SSL 配置时,您可能会收到此错误提示。

当生产者尝试连接到集群时,您可能会收到以下错误提示:

/kafka-console-producer.sh --broker-list b-2.encryption.3a3zuy.c7.kafka.us-east-1.amazonaws.com:9094,b-1.encryption.3a3zuy.c7.kafka.us-east-1.amazonaws.com:9094 --topic test
>[2020-04-10 18:57:58,019] WARN [Producer clientId=console-producer] Bootstrap broker b-1.encryption.3a3zuy.c7.kafka.us-east-1.amazonaws.com:9094 (id: -2 rack: null) disconnected (org.apache.kafka.clients.NetworkClient)
[2020-04-10 18:57:58,342] WARN [Producer clientId=console-producer] Bootstrap broker b-1.encryption.3a3zuy.c7.kafka.us-east-1.amazonaws.com:9094 (id: -2 rack: null) disconnected (org.apache.kafka.clients.NetworkClient)
[2020-04-10 18:57:58,666] WARN [Producer clientId=console-producer] Bootstrap broker b-2.encryption.3a3zuy.c7.kafka.us-east-1.amazonaws.com:9094 (id: -1 rack: null) disconnected (org.apache.kafka.clients.NetworkClient)

当使用者尝试连接到集群时,您可能会收到以下错误提示:

./kafka-console-consumer.sh --bootstrap-server b-2.encryption.3a3zuy.c7.kafka.us-east-1.amazonaws.com:9094,b-1.encryption.3a3zuy.c7.kafka.us-east-1.amazonaws.com:9094 --topic test
[2020-04-10 19:09:03,277] WARN [Consumer clientId=consumer-console-consumer-79102-1, groupId=console-consumer-79102] Bootstrap broker b-2.encryption.3a3zuy.c7.kafka.us-east-1.amazonaws.com:9094 (id: -1 rack: null) disconnected (org.apache.kafka.clients.NetworkClient)
[2020-04-10 19:09:03,596] WARN [Consumer clientId=consumer-console-consumer-79102-1, groupId=console-consumer-79102] Bootstrap broker b-2.encryption.3a3zuy.c7.kafka.us-east-1.amazonaws.com:9094 (id: -1 rack: null) disconnected (org.apache.kafka.clients.NetworkClient)
[2020-04-10 19:09:03,918] WARN [Consumer clientId=consumer-console-consumer-79102-1, groupId=console-consumer-79102] Bootstrap broker b-1.encryption.3a3zuy.c7.kafka.us-east-1.amazonaws.com:9094 (id: -2 rack: null) disconnected (org.apache.kafka.clients.NetworkClient)

要解决此错误,请设置 SSL 配置。有关更多信息,请参阅如何开始使用加密?

如果您的集群启用了客户端身份验证,则您必须添加与您的 ACM 私有 CA 证书相关的其他参数。有关更多信息,请参阅双向 TLS 身份验证

ERROR 无法获取密钥库的修改时间:

-或者-

加载密钥库失败

如果信任库配置存在问题,则在为生产者和消费者加载信任库文件时可能会发生此错误。您可能会在日志中查看类似于以下内容的信息:

./kafka-console-consumer --bootstrap-server b-2.encryption.3a3zuy.c7.kafka.us-east-1.amazonaws.com:9094,b-1.encryption.3a3zuy.c7.kafka.us-east-1.amazonaws.com:9094 --topic test --consumer.config /home/ec2-user/ssl.config
[2020-04-11 10:39:12,194] ERROR Modification time of key store could not be obtained: /home/ec2-ser/certs/kafka.client.truststore.jks (org.apache.kafka.common.security.ssl.SslEngineBuilder)
java.nio.file.NoSuchFileException: /home/ec2-ser/certs/kafka.client.truststore.jks
[2020-04-11 10:39:12,253] ERROR Unknown error when running consumer: (kafka.tools.ConsoleConsumer$)
Caused by: org.apache.kafka.common.KafkaException: org.apache.kafka.common.KafkaException: org.apache.kafka.common.KafkaException: Failed to load SSL keystore /home/ec2-ser/certs/kafka.client.truststore.jks of type JKS

在这种情况下,日志表明加载信任库文件时出现问题。SSL 配置中的信任库文件路径配置错误。您可以通过在 SSL 配置中提供信任库文件的正确路径来解决此错误。

在以下情况下,也可能发生此错误:

  • 您的信任库或密钥库文件已损坏。
  • 信任库文件的密码不正确。

使用密钥向主题测试发送消息时出错:null, value: 0 bytes with error: (org.apache.kafka.clients.producer.internals.ErrorLoggingCallback)

org.apache.kafka.common.errors.SslAuthenticationException: SSL 握手失败

-或者-

与节点的连接 - ( /: 9094 ) 身份验证失败,原因是:SSL 握手失败 (org.apache.kafka.clients.NetworkClient)

生产者的密钥库配置出现问题导致身份验证失败时,您可能会收到以下错误提示:

./kafka-console-producer -broker-list b-2.tlscluster.5818ll.c7.kafka.us-east-1.amazonaws.com:9094,b-1.tlscluster.5818ll.c7.kafka.us-east-1.amazonaws.com:9094,b-4.tlscluster.5818ll.c7.kafka.us-east-1.amazonaws.com:9094 --topic example --producer.config/home/ec2-user/ssl.config
>[2020-04-11 11:13:19,286] ERROR [Producer clientId=console-producer] Connection to node -3 (b-4.tlscluster.5818ll.c7.kafka.us-east-1.amazonaws.com/172.31.6.195:9094) failed authentication due to: SSL handshake failed (org.apache.kafka.clients.NetworkClient)

使用者的密钥库配置出现问题导致身份验证失败时,您可能会收到以下错误提示:

./kafka-console-consumer --bootstrap-server b-2.tlscluster.5818ll.c7.kafka.us-east-1.amazonaws.com:9094,b-1.tlscluster.5818ll.c7.kafka.us-east-1.amazonaws.com:9094,b-4.tlscluster.5818ll.c7.kafka.us-east-1.amazonaws.com:9094 --topic example --consumer.config/home/ec2-user/ssl.config
[2020-04-11 11:14:46,958] ERROR [Consumer clientId=consumer-1, groupId=console-consumer-46876] Connection to node -1 (b-2.tlscluster.5818ll.c7.kafka.us-east-1.amazonaws.com/172.31.15.140:9094) failed authentication due to: SSL handshake failed (org.apache.kafka.clients.NetworkClient)
[2020-04-11 11:14:46,961] ERROR Error processing message, terminating consumer process: (kafka.tools.ConsoleConsumer$)
org.apache.kafka.common.errors.SslAuthenticationException: SSL handshake failed

要解决此错误,请确保您已正确配置密钥库相关配置。

java.io.IOException: 密钥库密码不正确

密钥库或信任库的密码不正确时,您可能会收到此错误。

要排查此错误,请执行以下操作:

通过运行以下命令检查密钥库或信任库密码是否正确:

keytool -list -keystore kafka.client.keystore.jks
Enter keystore password:
Keystore type: PKCS12
Keystore provider: SUN
Your keystore contains 1 entry
schema-reg, Jan 15, 2020, PrivateKeyEntry,
Certificate fingerprint (SHA1): 4A:F3:2C:6A:5D:50:87:3A:37:6C:94:5E:05:22:5A:1A:D5:8B:95:ED

如果密钥库或信任库的密码不正确,您可能会看到以下错误:

keytool -list -keystore kafka.client.keystore.jks
Enter keystore password:
keytool error: java.io.IOException: keystore password was incorrect

您可以通过添加 -v 标志来查看上述命令的详细输出:

keytool -list -v -keystore kafka.client.keystore.jks

您还可以使用这些命令来检查密钥库是否已损坏。

当与别名关联的密钥在生产者和消费者的 SSL 配置中配置不正确时,您也可能会收到此错误提示。要验证此根本原因,请运行以下命令:

keytool -keypasswd -alias schema-reg -keystore kafka.client.keystore.jks
Enter keystore password:
Enter key password for <schema-reg>
New key password for <schema-reg>:
Re-enter new key password for <schema-reg>:

如果您的别名密码(例如:schema-reg)的密码是正确的,则该命令会要求您输入密钥的新密码。否则,该命令将失败并显示以下消息:

keytool -keypasswd -alias schema-reg -keystore kafka.client.keystore.jks
Enter keystore password:
Enter key password for <schema-reg>
keytool error: java.security.UnrecoverableKeyException: Get Key failed: Given final block not properly padded. Such issues can arise if a bad key is used during decryption.

您还可以通过运行以下命令来验证特定别名是否是密钥库的一部分:

keytool -list -keystore kafka.client.keystore.jks -alias schema-reg
Enter keystore password:
schema-reg, Jan 15, 2020, PrivateKeyEntry,
Certificate fingerprint (SHA1): 4A:F3:2C:6A:5D:50:87:3A:37:6C:94:5E:05:22:5A:1A:D5:8B:95:ED