How do I troubleshoot Lambda triggers that poll from MSK and self-managed Kafka clusters?
Last updated: 2023-01-18
My AWS Lambda function is designed to process records from my Amazon Managed Streaming for Apache Kafka (Amazon MSK) cluster or self-managed Kafka cluster. However, the Lambda trigger is displaying an error message.
An event source mapping (ESM) is an AWS Lambda resource that reads from an event source and invokes a Lambda function. To invoke a Lambda function, a Lambda-Kafka ESM must be able to perform the following actions:
- Communicate with the cluster.
- Poll records from the topic.
- Communicate with the Lambda Invoke API.
- Communicate with the AWS Security Token Service (AWS STS) API.
If an ESM's networking, authentication, or authorization settings prevent communication with the cluster, then setup fails before it can invoke a function. The trigger then displays an error message that helps troubleshoot the root cause.
When a Lambda function is configured with an Amazon MSK trigger or a self-managed Kafka trigger, an ESM resource is automatically created. An ESM is separate from the Lambda function, and it continuously polls records from the topic in the Kafka cluster. The ESM bundles those records into a payload. Then, it calls the Lambda Invoke API to deliver the payload to your Lambda function for processing.
Important: Lambda-Kafka ESMs don’t inherit the VPC network settings of the Lambda function. This is true for both MSK triggers and self-managed Kafka triggers. An MSK ESM uses the subnet and security group settings that are configured on the target MSK cluster. A self-managed Kafka trigger has WAN access by default, but can be configured with network access to a VPC in the same account and AWS Region. Because the network configuration is separated, a Lambda function can run code within a network that doesn’t have a route to the Kafka cluster.
Understand the ESM setup process
Before the ESM can invoke the associated Lambda function, the ESM automatically completes the following steps:
1. The ESM calls AWS STS APIs to get a security token.
2. If SourceAccessConfiguration contains a secret, then grab that secret from AWS Secrets Manager API.
3. For self-managed Kakfa ESM: Lambda resolves the IP address of the cluster endpoints from the hostname that's configured in the ESM, under selfManagedEventSourceEndPoints.
For MSK ESM: Get the MSK cluster's subnet and security group configuration.
4. For self-managed Kakfa ESM: Establish a network connection to the broker endpoint.
For MSK ESM: Create a hyperplane elastic network interface with the MSK cluster's security group in each of the MSK cluster's subnets.
- If TLS auth is activated, then check the SSL certificate that's presented by the broker endpoint.
- Log in to the broker.
- Make sure that the topic exists in the cluster. Ask the cluster brokers if the topic that's configured in the ESM's Topics parameter exists in the cluster.
- Create a consumer group in the cluster, using the ESM's UUID as the consumer group ID.
7. Poll records from the topic.
8. Bundle the records into a payload that's smaller than 6 MB. This is the limit for Lambda invocation payloads.
9. The ESM invokes the associated Lambda function with the payload of records. To do this, make a synchronous call to the Lambda Invoke API.
Troubleshoot network security issues
When the ESM sends a request to the broker endpoints and doesn't receive a response, the ESM considers the request as timed out. When a timeout to the broker endpoint occurs, the trigger displays the following error message:
"PROBLEM: Connection error. Please check your event source connection configuration. If your event source lives in a VPC, try setting up a new Lambda function or EC2 instance with the same VPC, Subnet, and Security Group settings. Connect the new device to the Kafka cluster and consume messages to ensure that the issue is not related to VPC or Endpoint configuration. If the new device is able to consume messages, please contact Lambda customer support for further investigation."
To troubleshoot the issue, follow the steps that are noted in the preceding error message. Also, note the networking configurations in the following sections to be sure that your ESM is properly configured.
Note: Timed out requests from the ESM might also occur in situations where the cluster is out of system resources to handle the request. Or, timed out requests might occur when the wrong security settings are configured on the ESM or cluster. If you receive this error and there aren't any issues with the network configuration, then check the cluster broker's access logs for additional information.
Networking configuration that a self-managed Kafka ESM uses
A self-managed Kafka ESM's network configuration is similar to a Lambda function. By default, the ESM has access to the WAN but isn’t configured for access within a VPC. It can be manually configured with specific subnets and security groups to access a Kafka cluster. However, it can only access a cluster that’s reachable from a VPC in the account that contains the Lambda function. As a result, you can create a self-managed Kafka ESM for a Kafka cluster that’s in the following locations:
- An on-premises data center
- Another cloud provider
- The MSK brokers of a Kafka cluster that’s located in the VPC of a different account
Note: It’s possible to create a self-managed Kafka trigger that consumes from an MSK cluster in another account. However, there are some downsides. Unlike an MSK trigger, AWS Identity and Access Management (IAM) authentication isn’t available for self-managed Kafka triggers. Also, connecting to the MSK cluster over a VPC-peered connection requires specific VPC workarounds. For more information, see How Goldman Sachs builds cross-account connectivity to their Amazon MSK clusters with AWS PrivateLink.
Networking configuration of a Lambda-MSK ESM
To communicate with the MSK cluster, an MSK ESM creates a hyperplane elastic network interface inside each subnet that’s used by the cluster. This is similar to how a Lambda function operates within a VPC.
An MSK ESM doesn’t use the Lambda function's VPC settings. Instead, the ESM automatically uses the subnet and security group settings that are configured on the target MSK cluster. The MSK ESM then creates a network interface inside each of the subnets that are used by the MSK cluster. These network intefaces use the same security group that’s used by the MSK cluster. The security groups and ingress or egress rules that are used by the MSK ESM can be found with the following CLI commands:
1. Use the AWS CLI MSK command describe-cluster to list the security groups and subnets used by the MSK cluster.
2. Use the describe-security-groups command on the security groups that are listed in the output of describe-cluster.
Grant access to traffic
The MSK cluster's security group must include a rule that grants ingress traffic from itself and egress traffic to itself. The traffic must also be granted over one of the following open authentication ports that’s used by the broker:
- 9092 for plaintext
- 9094 for TLS
- 9096 for SASL
- 443 for all configurations
Troubleshoot issues that can occur during initialization, polling, and invocation
"PROBLEM: Connection error. Your VPC must be able to connect to Lambda and STS, as well as Secrets Manager if authentication is required. You can provide access by configuring PrivateLink or a NAT Gateway."
The preceding error occurs for any of the following reasons:
- The ESM is configured in a VPC, and calls to the STS API fail or timeout.
- The ESM is configured in a VPC, and the Secrets Manager API connection attempts fail or timeout.
- The trigger can access your Kafka cluster, but it times out when invoking your function over the Lambda API.
These issues might be due to incorrect VPC settings that prevent your ESM from reaching other services like AWS STS and AWS Secrets Manager. Follow the steps in Setting up AWS Lambda with an Apache Kafka cluster within a VPC to properly configure your VPC settings.
If calls to the STS API fail or timeout, then your VPC settings prevent your ESM from reaching the Regional Lambda endpoint on port 443. To resolve this issue, see Setting up AWS Lambda with an Apache Kafka cluster within a VPC.
If SourceAccessConfiguration contains a secret, then be sure to retrieve that secret from Secrets Manager.
"PROBLEM: Certificate and/or private key must be in PEM format."
The preceding error occurs if you have a secret that's not in a format that can be deciphered by the ESM.
To troubleshoot this, check the format of your secret. Note that Secrets Manager only supports X.509 certificate files in .pem format. For more information, see Provided certificate or private key is not valid (Amazon MSK) or Provided certificate or private key is not valid (Kafka).
"PROBLEM: The provided Kafka broker endpoints cannot be resolved."
The preceding error occurs when your ESM is unable to translate the hostname into an IP address.
To resolve this error, make sure that the ESM is able to reach a DNS server that can translate the hostname. If the endpoint's hostname is within a private network, then configure the ESM to use a VPC with DNS settings that can resolve the hostname.
"PROBLEM: Server failed to authenticate Lambda or Lambda failed to authenticate server."
The preceding error occurs when the server that your ESM is connected to isn’t the server that you configured in the ESM settings.
To troubleshoot this, verify that you configured your ESM’s settings for the server that you’re connecting to.
"PROBLEM: SASL authentication failed."
The preceding error occurs when your server login attempt fails.
AWS Lambda functions that are triggered from an Amazon MSK topic can access user names and passwords that are secured by AWS Secrets Manager using SASL/SCRAM. You receive an error when your user name and password aren’t recognized as valid.
To resolve this error, log in to the broker and check the access logs.
"PROBLEM: Cluster failed to authorize Lambda."
The preceding error occurs when the ESM logs in to the broker, but the ESM user doesn't have permission to poll records from the topic. To troubleshoot this issue, see Cluster failed to authorize Lambda (Amazon MSK) or Cluster failed to authorize Lambda (Kafka).