AWS Compute Blog

Improved failure recovery for Amazon EventBridge

Today we’re announcing two new capabilities for Amazon EventBridgedead letter queues and custom retry policies. Both of these give you greater flexibility in how to handle any failures in the processing of events with EventBridge. You can easily enable them on a per target basis and configure them uniquely for each.

Dead letter queues (DLQs) are a common capability in queuing and messaging systems that allow you to handle failures in event or message receiving systems. They provide a way for failed events or messages to be captured and sent to another system, which can store them for future processing. With DLQs, you can have greater resiliency and improved recovery from any failure that happens.

You can also now configure a custom retry policy that can be set on your event bus targets. Today, there are two attributes that can control how events are retried: maximum number of retries and maximum event age. With these two settings, you could send events to a DLQ sooner and reduce the retries attempted.

For example, this could allow you to recover more quickly if an event bus target is overwhelmed by the number of events received, causing throttling to occur. The events are placed in a DLQ and then processed later.

Failures in event processing

Currently, EventBridge can fail to deliver an event to a target in certain scenarios. Events that fail to be delivered to a target due to client-side errors are dropped immediately. Examples of this are when EventBridge does not have permission to a target AWS service or if the target no longer exists. This can happen if the target resource is misconfigured or is deleted by the resource owner.

For service-side issues, EventBridge retries delivery of events for up to 24 hours. This can happen if the target service is unavailable or the target resource is not provisioned to handle the incoming event traffic and the target service is throttling the requests.

EventBridge failures

EventBridge failures

Previously, when all attempts to deliver an event to the target were exhausted, EventBridge published a CloudWatch metric indicating a failed target invocation. However, this provides no visibility into which events failed to be delivered and there was no way to recover the event that failed.

Dead letter queues

EventBridge’s DLQs are made possible today with Amazon Simple Queue Service (SQS) standard queues. With SQS, you get all of the benefits of a fully serverless queuing service: no servers to manage, automatic scalability, pay for what you consume, and high availability and security built in. You can configure the DLQs for your EventBridge bus and pay nothing until it is used, if and when a target experiences an issue. This makes it a great practice to follow and standardize on, and provides you with a safety net that’s active only when needed.

Optionally, you could later configure an AWS Lambda function to consume from that DLQ. The function is only invoked when messages exist in the queue, allowing you to maintain a serverless stack to recover from a potential failure.

DLQ configured per target

DLQ configured per target

With DLQ configured, the queue receives the event that failed in the message with important metadata that you can use to troubleshoot the issue. This can include: Error Code, Error Message, Exhausted Retry Condition, Retry Attempts, Rule ARN, and the Target ARN.

You can use this data to more easily troubleshoot what went wrong with the original delivery attempt and take action to resolve or prevent such failures in the future. You could also use the information such as Exhausted Retry Condition and Retry Attempts to further tweak your custom retry policy.

You can configure a DLQ when creating or updating rules via the AWS Management Console and AWS Command Line Interface (AWS CLI). You can also use infrastructure as code (IaC) tools such as AWS CloudFormation.

In the console, select the queue to be used for your DLQ configuration from the drop-down as shown here:

DLQ configuration

DLQ configuration

When configured via API, AWS CLI, or IaC tools, you must specify the ARN of the queue:

arn:aws:sqs:us-east-1:123456789012:orders-bus-shipping-service-dlq

When you configure a DLQ, the target SQS queue requires a resource-based policy that grants EventBridge access. One is created and applied automatically via the console when you create or update an EventBridge rule with a DLQ that exists in your own account.

For any queues created in other accounts, or via API, AWS CLI, or IaC tools, you must add a policy that allows SQS’s SendMessage permission to the EventBridge rule ARN, as shown below:

{
  "Sid": "Dead-letter queue permissions",
  "Effect": "Allow",
  "Principal": {
     "Service": "events.amazonaws.com"
  },
  "Action": "sqs:SendMessage",
  "Resource": "arn:aws:sqs:us-east-1:123456789012:orders-bus-shipping-service-dlq",
  "Condition": {
    "ArnEquals": {
      "aws:SourceArn": "arn:aws:events:us-east-1:123456789012:rule/MyTestRule"
    }
  }
}

You can read more about setting permissions for DLQ the documentation for “Granting permissions to the dead-letter queue”.

Once configured, you can monitor CloudWatch metrics for the DLQ queue. This shows both the successful delivery of messages via the InvocationsSentToDLQ metric, in addition to any failures via the InvocationsFailedToBeSentToDLQ. Note that these metrics do not exist if your queue is not considered “active”.

Retry policies

By default, EventBridge retries delivery of an event to a target so long as it does not receive a client-side error as described earlier. Retries occur with a back-off, for up to 185 attempts or for up to 24 hours, after which the event is dropped or sent to a DLQ, if configured. Due to the jitter of the back-off and retry process you may reach the 24-hour limit before reaching 185 retries.

For many workloads, this provides an acceptable way to handle momentary service issues or throttling that might occur. For some however, this model of back-off and retry can cause increased and on-going traffic to an already overloaded target system.

For example, consider an Amazon API Gateway target that has a resource constrained backend service behind it.

Constrained target service

Constrained target service

Under a consistently high load, the bus could end up generating too many API requests, tripping the API Gateway’s throttling configuration. This would cause API Gateway to respond with throttling errors back to EventBridge.

Throttled API reply

Throttled API reply

You may decide that allowing the failed events to retry for 24 hours puts too much load into this system and it may not properly recover from the load. This could lead to potential data loss unless a DLQ was configured.

Added DLQ

Added DLQ

With a DLQ, you could choose to process these events later, once the overwhelmed target service has recovered.

DLQ drained back to API

DLQ drained back to API

Or the events in question may no longer have the same value as they did previously. This can occur in systems where data loss is tolerated but the timeliness of data processing matters. In these situations, the DLQ would have less value and dropping the message is acceptable.

For either of these situations, configuring the maximum number of retries or the maximum age of the event could be useful.

Now with retry policies, you can configure per target the following two attributes:

  • MaximumEventAgeInSeconds: between 60 and 86400 seconds (86400, or 24 hours the default)
  • MaximumRetryAttempts: between 0 and 185 (185 is the default)

When either condition is met, the event fails. It’s then either dropped, which generates an increase to the FailedInvocations CloudWatch metric, or sent to a configured DLQ.

You can configure retry policy attributes when creating or updating rules via the AWS Management Console and AWS Command Line Interface (AWS CLI). You can also use infrastructure as code (IaC) tools such as AWS CloudFormation.

Retry policy

Retry policy

There is no additional cost for configuring either of these new capabilities. You only pay for the usage of the SQS standard queue configured as the dead letter queue during a failure and any application that handles the failed events. SQS pricing can be found here.

Conclusion

With dead letter queues and custom retry policies, you have improved handling and control over failure in distributed systems built with EventBridge. With DLQs you can capture failed events and then process them later, potentially saving yourself from data loss. With custom retry policies, you gain the improved ability to control the number of retries and for how long they can be retried.

I encourage you to explore how both of these new capabilities can help make your applications more resilient to failures, and to standardize on using them both in your infrastructure.

For more serverless learning resources, visit https://serverlessland.com.