Enhanced Amazon CloudWatch metrics for Amazon EventBridge

This post is written by Vaibhav Shah, Sr. Solutions Architect.

Customers use event-driven architectures to orchestrate and automate their event flows from producers to consumers. Amazon EventBridge acts as a serverless event router for various targets based on event rules. It decouples the producers and consumers, allowing customers to build asynchronous architectures.

EventBridge provides metrics to enable you to monitor your events. Some of the metrics include: monitoring the number of partner events ingested, the number of invocations that failed permanently, and the number of times a target is invoked by a rule in response to an event, or the number of events that matched with any rule.

In response to customer requests, EventBridge has added additional metrics that allow customers to monitor their events and provide additional visibility. This blog post explains these new capabilities.

What’s new?

EventBridge has new metrics mainly around the API, events, and invocations metrics. These metrics give you insights into the total number of events published, successful events published, failed events, number of events matched with any or specific rule, events rejected because of throttling, latency, and invocations based metrics.

This allows you to track the entire span of event flow within EventBridge and quickly identify and resolve issues as they arise.

EventBridge now has the following metrics:

Metric	Description	Dimensions and Units
PutEventsLatency	The time taken per PutEvents API operation	None Units: Milliseconds
PutEventsRequestSize	The size of the PutEvents API request in bytes	None Units: Bytes
MatchedEvents	Number of events that matched with any rule, or a specific rule	None RuleName, EventBusName, EventSourceName Units: Count
ThrottledRules	The number of times rule execution was throttled.	None, RuleName Unit: Count
PutEventsApproximateCallCount	Approximate total number of calls in PutEvents API calls.	None Units: Count
PutEventsApproximateThrottledCount	Approximate number of throttled requests in PutEvents API calls.	None Units: Count
PutEventsApproximateFailedCount	Approximate number of failed PutEvents API calls.	None Units: Count
PutEventsApproximateSuccessCount	Approximate number of successful PutEvents API calls.	None Units: Count
PutEventsEntriesCount	The number of event entries contained in a PutEvents request.	None Units: Count
PutEventsFailedEntriesCount	The number of event entries contained in a PutEvents request that failed to be ingested.	None Units: Count
PutPartnerEventsApproximateCallCount	Approximate total number of calls in PutPartnerEvents API calls. (visible in Partner’s account)	None Units: Count
PutPartnerEventsApproximateThrottledCount	Approximate number of throttled requests in PutPartnerEvents API calls. (visible in Partner’s account)	None Units: Count
PutPartnerEventsApproximateFailedCount	Approximate number of failed PutPartnerEvents API calls. (visible in Partner’s account)	None Units: Count
PutPartnerEventsApproximateSuccessCount	Approximate number of successful PutPartnerEvents API calls. (visible in Partner’s account)	None Units: Count
PutPartnerEventsEntriesCount	The number of event entries contained in a PutPartnerEvents request.	None Units: Count
PutPartnerEventsFailedEntriesCount	The number of event entries contained in a PutPartnerEvents request that failed to be ingested.	None Units: Count
PutPartnerEventsLatency	The time taken per PutPartnerEvents API operation (visible in Partner’s account)	None Units: Milliseconds
InvocationsCreated	Number of times a target is invoked by a rule in response to an event. One invocation attempt represents a single count for this metric.	None Units: Count
InvocationAttempts	Number of times EventBridge attempted invoking a target.	None Units: Count
SuccessfulInvocationAttempts	Number of times target was successfully invoked.	None Units: Count
RetryInvocationAttempts	The number of times a target invocation has been retried.	None Units: Count
IngestiontoInvocationStartLatency	The time to process events, measured from when an event is ingested by EventBridge to the first invocation of a target.	None, RuleName, EventBusName Units: Milliseconds
IngestiontoInvocationCompleteLatency	The time taken from event Ingestion to completion of the first successful invocation attempt	None, RuleName, EventBusName Units: Milliseconds

Use-cases for these metrics

These new metrics help you improve observability and monitoring of your event-driven applications. You can proactively monitor metrics that help you understand the event flow, invocations, latency, and service utilization. You can also set up alerts on specific metrics and take necessary actions, which help improve your application performance, proactively manage quotas, and improve resiliency.

Monitor service usage based on Service Quotas

The PutEventsApproximateCallCount metric in the events family helps you identify the approximate number of events published on the event bus using the PutEvents API action. The PutEventsApproximateSuccessfulCount metric shows the approximate number of successful events published on the event bus.

Similarly, you can monitor throttled and failed events count with PutEventsApproximateThrottledCount and PutEventsApproximateFailedCount respectively. These metrics allow you to monitor if you are reaching your quota for PutEvents. You can use a CloudWatch alarm and set a threshold close to your account quotas. If that is triggered, send notifications using Amazon SNS to your operations team. They can work to increase the Service Quotas.

You can also set an alarm on the PutEvents throttle limit in transactions per second service quota.

Navigate to the Service Quotas console. On the left pane, choose AWS services, search for EventBridge, and select Amazon EventBridge (CloudWatch Events).
In the Monitoring section, you can monitor the percentage utilization of the PutEvents throttle limit in transactions per second.
Go to the Alarms tab, and choose Create alarm. In Alarm threshold, choose 80% of the applied quota value from the dropdown. Set the Alarm name to PutEventsThrottleAlarm, and choose Create.
To be notified if this threshold is breached, navigate to Amazon CloudWatch Alarms console and choose PutEventsThrottleAlarm.
Select the Actions dropdown from the top right corner, and choose Edit.
On the Specify metric and conditions page, under Conditions, make sure that the Threshold type is selected as Static and the % Utilization selected as Greater/Equal than 80. Choose Next.
Configure actions to send notifications to an Amazon SNS topic and choose Next.
The Alarm name should be already set to PutEventsThrottleAlarm. Choose Next, then choose Update alarm.

This helps you get notified when the percentage utilization of PutEvents throttle limit in transactions per second reaches close to the threshold set. You can then request Service Quota increases if required.

Similarly, you can also create CloudWatch alarms on percentage utilization of Invocations throttle limit in transactions per second against the service quota.

Enhanced observability

The PutEventsLatency metric shows the time taken per PutEvents API operation. There are two additional metrics, IngestiontoInvocationStartLatency metric and IngestiontoInvocationCompleteLatency metric. The first metric shows the time to process events measured from when the events are first ingested by EventBridge to the first invocation of a target. The second shows the time taken from event ingestion to completion of the first successful invocation attempt.

This helps identify latency-related issues from the time of ingestion until the time it reaches the target based on the RuleName. If there is high latency, these two metrics give you visibility into this issue, allowing you to take appropriate action.

You can set a threshold around these metrics, and if the threshold is triggered, the defined actions can help recover from potential failures. One of the defined actions here can be to send events generated later to EventBridge in the secondary Region using EventBridge global endpoints.

Sometimes, events are not delivered to the target specified in the rule. This can be because the target resource is unavailable, you don’t have permission to invoke the target, or there are network issues. In such scenarios, EventBridge retries to send these events to the target for 24 hours or up to 185 times, both of which are configurable.

The new RetryInvocationAttempts metric shows the number of times the EventBridge has retried to invoke the target. The retries are done when requests are throttled, target service having availability issues, network issues, and service failures. This provides additional observability to the customers and can be used to trigger a CloudWatch alarm to notify teams if the desired threshold is crossed. If the retries are exhausted, store the failed events in the Amazon SQS dead-letter queues to process failed events for the later time.

In addition to these, EventBridge supports additional dimensions like DetailType, Source, and RuleName to MatchedEvents metrics. This helps you monitor the number of matched events coming from different sources.

Navigate to the Amazon CloudWatch. On the left pane, choose Metrics, and All metrics.
In the Browse section, select Events, and Source.
From the Graphed metrics tab, you can monitor matched events coming from different sources.

Failover events to secondary Region

The PutEventsFailedEntriesCount metric shows the number of events that failed ingestion. Monitor this metric and set a CloudWatch alarm. If it crosses a defined threshold, you can then take appropriate action.

Also, set an alarm on the PutEventsApproximateThrottledCount metric, which shows the number of events that are rejected because of throttling constraints. For these event ingestion failures, the client must resend the failed events to the event bus again, allowing you to process every single event critical for your application.

Alternatively, send events to EventBridge service in the secondary Region using Amazon EventBridge global endpoints to improve resiliency of your event-driven applications.

Conclusion

This blog shows how to use these new metrics to improve the visibility of event flows in your event-driven applications. It helps you monitor the events more effectively, from invocation until the delivery to the target. This improves observability by proactively alerting on key metrics.

For more serverless learning resources, visit Serverless Land.

AWS Compute Blog