AWS Compute Blog

Enhanced Amazon CloudWatch metrics for Amazon EventBridge

This post is written by Vaibhav Shah, Sr. Solutions Architect.

Customers use event-driven architectures to orchestrate and automate their event flows from producers to consumers. Amazon EventBridge acts as a serverless event router for various targets based on event rules. It decouples the producers and consumers, allowing customers to build asynchronous architectures.

EventBridge provides metrics to enable you to monitor your events. Some of the metrics include: monitoring the number of partner events ingested, the number of invocations that failed permanently, and the number of times a target is invoked by a rule in response to an event, or the number of events that matched with any rule.

In response to customer requests, EventBridge has added additional metrics that allow customers to monitor their events and provide additional visibility. This blog post explains these new capabilities.

What’s new?

EventBridge has new metrics mainly around the API, events, and invocations metrics. These metrics give you insights into the total number of events published, successful events published, failed events, number of events matched with any or specific rule, events rejected because of throttling, latency, and invocations based metrics.

This allows you to track the entire span of event flow within EventBridge and quickly identify and resolve issues as they arise.

EventBridge now has the following metrics:

Metric Description Dimensions and Units
PutEventsLatency The time taken per PutEvents API operation

None

Units: Milliseconds

PutEventsRequestSize The size of the PutEvents API request in bytes

None

Units: Bytes

MatchedEvents Number of events that matched with any rule, or a specific rule None
RuleName,
EventBusName,
EventSourceName

Units: Count

ThrottledRules The number of times rule execution was throttled.

None, RuleName

Unit: Count

PutEventsApproximateCallCount Approximate total number of calls in PutEvents API calls.

None

Units: Count

PutEventsApproximateThrottledCount Approximate number of throttled requests in PutEvents API calls.

None

Units: Count

PutEventsApproximateFailedCount Approximate number of failed PutEvents API calls.

None

Units: Count

PutEventsApproximateSuccessCount Approximate number of successful PutEvents API calls.

None

Units: Count

PutEventsEntriesCount The number of event entries contained in a PutEvents request.

None

Units: Count

PutEventsFailedEntriesCount The number of event entries contained in a PutEvents request that failed to be ingested.

None

Units: Count

PutPartnerEventsApproximateCallCount Approximate total number of calls in PutPartnerEvents API calls. (visible in Partner’s account)

None

Units: Count

PutPartnerEventsApproximateThrottledCount Approximate number of throttled requests in PutPartnerEvents API calls. (visible in Partner’s account)

None

Units: Count

PutPartnerEventsApproximateFailedCount Approximate number of failed PutPartnerEvents API calls. (visible in Partner’s account)

None

Units: Count

PutPartnerEventsApproximateSuccessCount Approximate number of successful PutPartnerEvents API calls. (visible in Partner’s account)

None

Units: Count

PutPartnerEventsEntriesCount The number of event entries contained in a PutPartnerEvents request.

None

Units: Count

PutPartnerEventsFailedEntriesCount The number of event entries contained in a PutPartnerEvents request that failed to be ingested.

None

Units: Count

PutPartnerEventsLatency The time taken per PutPartnerEvents API operation (visible in Partner’s account)

None

Units: Milliseconds

InvocationsCreated Number of times a target is invoked by a rule in response to an event. One invocation attempt represents a single count for this metric.

None

Units: Count

InvocationAttempts Number of times EventBridge attempted invoking a target.

None

Units: Count

SuccessfulInvocationAttempts Number of times target was successfully invoked.

None

Units: Count

RetryInvocationAttempts The number of times a target invocation has been retried.

None

Units: Count

IngestiontoInvocationStartLatency The time to process events, measured from when an event is ingested by EventBridge to the first invocation of a target. None,
RuleName,
EventBusName

Units: Milliseconds

IngestiontoInvocationCompleteLatency The time taken from event Ingestion to completion of the first successful invocation attempt None,
RuleName,
EventBusName

Units: Milliseconds

Use-cases for these metrics

These new metrics help you improve observability and monitoring of your event-driven applications. You can proactively monitor metrics that help you understand the event flow, invocations, latency, and service utilization. You can also set up alerts on specific metrics and take necessary actions, which help improve your application performance, proactively manage quotas, and improve resiliency.

Monitor service usage based on Service Quotas

The PutEventsApproximateCallCount metric in the events family helps you identify the approximate number of events published on the event bus using the PutEvents API action. The PutEventsApproximateSuccessfulCount metric shows the approximate number of successful events published on the event bus.

Similarly, you can monitor throttled and failed events count with PutEventsApproximateThrottledCount and PutEventsApproximateFailedCount respectively. These metrics allow you to monitor if you are reaching your quota for PutEvents. You can use a CloudWatch alarm and set a threshold close to your account quotas. If that is triggered, send notifications using Amazon SNS to your operations team. They can work to increase the Service Quotas.

You can also set an alarm on the PutEvents throttle limit in transactions per second service quota.

  1. Navigate to the Service Quotas console. On the left pane, choose AWS services, search for EventBridge, and select Amazon EventBridge (CloudWatch Events).
  2. In the Monitoring section, you can monitor the percentage utilization of the PutEvents throttle limit in transactions per second.
    Monitor the percentage utilization of PutEvents
  3. Go to the Alarms tab, and choose Create alarm. In Alarm threshold, choose 80% of the applied quota value from the dropdown. Set the Alarm name to PutEventsThrottleAlarm, and choose Create.
    Create alarm
  4. To be notified if this threshold is breached, navigate to Amazon CloudWatch Alarms console and choose PutEventsThrottleAlarm.
  5. Select the Actions dropdown from the top right corner, and choose Edit.
  6. On the Specify metric and conditions page, under Conditions, make sure that the Threshold type is selected as Static and the % Utilization selected as Greater/Equal than 80. Choose Next.
    Specify metrics and conditions
  7. Configure actions to send notifications to an Amazon SNS topic and choose Next.
    7.	Configure actions to send notifications.
  8. The Alarm name should be already set to PutEventsThrottleAlarm. Choose Next, then choose Update alarm.
    Add name and description

This helps you get notified when the percentage utilization of PutEvents throttle limit in transactions per second reaches close to the threshold set. You can then request Service Quota increases if required.

Similarly, you can also create CloudWatch alarms on percentage utilization of Invocations throttle limit in transactions per second against the service quota.

Invocations throttle limit in transactions per second

Enhanced observability

The PutEventsLatency metric shows the time taken per PutEvents API operation. There are two additional metrics, IngestiontoInvocationStartLatency metric and IngestiontoInvocationCompleteLatency metric. The first metric shows the time to process events measured from when the events are first ingested by EventBridge to the first invocation of a target. The second shows the time taken from event ingestion to completion of the first successful invocation attempt.

This helps identify latency-related issues from the time of ingestion until the time it reaches the target based on the RuleName. If there is high latency, these two metrics give you visibility into this issue, allowing you to take appropriate action.

Enhanced observability

You can set a threshold around these metrics, and if the threshold is triggered, the defined actions can help recover from potential failures. One of the defined actions here can be to send events generated later to EventBridge in the secondary Region using EventBridge global endpoints.

Sometimes, events are not delivered to the target specified in the rule. This can be because the target resource is unavailable, you don’t have permission to invoke the target, or there are network issues. In such scenarios, EventBridge retries to send these events to the target for 24 hours or up to 185 times, both of which are configurable.

The new RetryInvocationAttempts metric shows the number of times the EventBridge has retried to invoke the target. The retries are done when requests are throttled, target service having availability issues, network issues, and service failures. This provides additional observability to the customers and can be used to trigger a CloudWatch alarm to notify teams if the desired threshold is crossed. If the retries are exhausted, store the failed events in the Amazon SQS dead-letter queues to process failed events for the later time.

In addition to these, EventBridge supports additional dimensions like DetailType, Source, and RuleName to MatchedEvents metrics. This helps you monitor the number of matched events coming from different sources.

  1. Navigate to the Amazon CloudWatch. On the left pane, choose Metrics, and All metrics.
  2. In the Browse section, select Events, and Source.
  3. From the Graphed metrics tab, you can monitor matched events coming from different sources.Graphed metrics tab

Failover events to secondary Region

The PutEventsFailedEntriesCount metric shows the number of events that failed ingestion. Monitor this metric and set a CloudWatch alarm. If it crosses a defined threshold, you can then take appropriate action.

Also, set an alarm on the PutEventsApproximateThrottledCount metric, which shows the number of events that are rejected because of throttling constraints. For these event ingestion failures, the client must resend the failed events to the event bus again, allowing you to process every single event critical for your application.

Alternatively, send events to EventBridge service in the secondary Region using Amazon EventBridge global endpoints to improve resiliency of your event-driven applications.

Conclusion

This blog shows how to use these new metrics to improve the visibility of event flows in your event-driven applications. It helps you monitor the events more effectively, from invocation until the delivery to the target. This improves observability by proactively alerting on key metrics.

For more serverless learning resources, visit Serverless Land.