Monitoring Amazon EBS volume failures using Amazon EventBridge

A top priority for customers is to make sure that their applications are highly resilient and durable to meet their business goals. To make sure that applications are working and performing as expected, you can use monitoring tools such as Amazon CloudWatch metrics and events to track the health of your applications. Monitoring and alarming can also enable customers to quickly respond to anomalies. For example, one anomaly could be an infrastructure failure, from which customers can use snapshots or local backups to restore and recover. However, if there are delays in noticing infrastructure failures, then they can lead to a longer downtime and cause significant business loss. Therefore, you must set up all of the proper and necessary alarms that can notify you to react to failures in a timely manner.

Amazon Elastic Block Storage (EBS) provides easy-to-use, highly durable, resilient, scalable, and high performant block level storage for use with your Amazon Elastic Compute Cloud (EC2) service, as well as many other AWS managed services. EBS lets customers replicate data across multiple servers within the same (AZ) to prevent the loss of data from any single component failure. However, durability risks still exist when multiple component failures occur before redundancy is restored. In the rare event of a multiple component volume failure, the data associated with the volume is unrecoverable and the volume state will move to an “error” state. Although you can’t recover a volume in an error state, you can restore the lost data from a local backup of data or an Amazon EBS Snapshot.

In this post, we share a solution that helps you receive notifications and alarms that are needed for faster recovery from a volume failure. Alerts can be configured to endpoint(s) such as Slack, Microsoft Teams, Amazon Chime, and emails linked to Amazon Simple Notification Service (SNS) or Amazon Simple Queue Service (SQS). Additionally, these alerts let you build automation for handling “error” state events with an AWS Lambda function, thus enabling you to more quickly recover from an infrastructure event.

Solution overview

When a volume goes into an “error” state, AWS sends a notification to your account’s health dashboard and an email notification to the primary contact of your AWS account. You can also add secondary contact details to receive an operational email by following the instructions here. However, the existing notifying mechanism isn’t scalable because notifications are only sent to a single primary contact and if they’re configured to a secondary contact. Many customers miss these notifications and fail to take timely actions, such as recovering their resources from a backup, thus causing unexpected business losses. Moreover, as customers continue to scale their mission critical workloads on Amazon EBS, many customers have asked for a scalable mechanism to get notified in the rare event that a volume moves into the “error” state.

Amazon EventBridge is a serverless event bus that can be used to ingest and process events from various sources, such as AWS services and SaaS applications. EventBridge lets developers build loosely coupled and independently scalable event-driven applications. This solution uses EventBridge to detect and react to AWS Health events. Once the solution is successfully implemented, you can get notified when a volume goes into an “error” state. Then, based on rules that you create, you can programmatically create custom automations based on “error” state events. For example, when your volume moves to the “error” state, you can route the event to the correct team, person, or system by using Amazon SNS or Amazon SQS. Additionally, you can use a Lambda function to automatically recover from a snapshot. This improves the time to resolution and helps customers gain efficiencies.

Walkthrough

To be notified when a volume goes into the “error” state, you must implement the following steps:

Log in to the AWS Management Console and go to the EventBridge service.
Create a new EventBridge Rule by navigating to the “Rules” section from the left panel.
Specify the rule details and event pattern. This step determines the Event Source, Event Type, and source AWS service.
Build the event pattern.
Specify the target to invoke when an event matches your event pattern. In this case, when the volume goes into the “error” state.
Review your rule to make sure that it meets your event monitoring requirements, and select “Create Rule”. Doing this lets you get notified when a volume is in the “error” state.
Finally, view your newly created rule from the EventBridge console page.

Prerequisites

You must set up an AWS account with sufficient permissions to access EventBridge and create rules. Additionally, when using the EventBridge console, EventBridge will automatically configure the proper permissions for the selected targets.

Step 1: Log in to the Console and go to the EventBridge service

Log in to the Console and select the appropriate region.
Navigate to EventBridge console here.
To change the AWS Region, use the Region selector in the upper-right corner of the page. Choose the Region in which you want to track AWS Health events.

Step 2: Create a new EventBridge Rule

1. In the navigation pane, choose Rules.

2. Choose Create rule.

Step 3: Define rule details

1. On the Define rule detail page, enter a name and description for your rule.

Give your EventBridge rule a name, an optional description, an event bus and a rule type that will run based on the matching event pattern.

2. Keep the default values for Event bus and Rule type, and then choose Next.

Step 4: Build event pattern

1. On the Build event pattern page, for Event source, choose AWS events and EventBridge partner events.
2. Scroll down to Event pattern, for AWS service, choose Health.
3. For Event type, choose the following options – Specific Health events. Create a rule for events for a specific AWS service, such as Amazon EC2.
4. You can choose Any service or Specific service(s). In this case choose, Specific service(s) – EBS.
5. Choose Specific event type category(s) and then choose issue from the list.
6. Choose Specific event type code(s) and then select AWS_EBS_VOLUME_LOST.
7. Choose one of the following options for affected resources.

- Choose Any resource to create a rule that applies to all resources.
- Choose Specific resource(s) and enter the Volume IDs. For example, you might specify an Amazon EBS volume ID, such as vol-EXAMPLEa1b2c3de4, to monitor for events that only affect this resource.

Your event pattern is built! Choose Next after creating the EventBridge rule. The steps to set this up can also be found here.

Define parameters to set alarms for. For this blog, we select EBS as the service and event type as AWS EBS VOLUME LOST.

Step 5: Select Targets

1. On the Select target(s) page, you can choose the target type that you created for this rule, and then configure any additional options that are required for that type.
2. Choose Next. For example, you can send the event to an Amazon SQS queue or an Amazon SNS topic. Furthermore, consider using a Lambda function to pass a notification to a Slack channel when an AWS Health event occurs. You can also use Lambda and EventBridge to send a custom text or SMS notifications with Amazon SNS when an AWS Health event occurs.

This image shows how an SNS topic can be set up as a target to the alarm.

3. Optionally, you can configure tags to search and filter your resources. On the Configure tags page, add any tags, and then choose Next.

This is where you can create an optional tag for the EventBridge rule

Step 6: Review and create rule

On the Review and create page, review your rule setup and make sure that it meets your event monitoring requirements.

Review all your EventBridge configuration for the last time before creating the rule.

2. Choose Create rule.

Step 7: View the newly created rule

Once the rule creation is complete, you can navigate to the EventBridge console and select Rules under Buses in the left navigation panel.
Search your newly created rule and make sure that it’s enabled.

Once you create the rule, you can see it in your Amazon EventBridge console.

Reactive measures to take when you notice the alarm

Once the alarm comes in, you must restore the volume that is in the error state from its latest snapshot. This allows for minimal business impact and downtime. To restore a volume from snapshot, follow the steps listed in the documentation for restoring an EBS volume from an Amazon EBS Snapshot.

Cleaning up

If you no longer need this solution, delete the Amazon EventBridge rule. In addition, review the newly created or existing Amazon SNS subscription and topic for deletion to avoid incurring charges for resources created following along with this blog post.

Conclusion

In this post, we presented a solution for monitoring Amazon EBS volume failures via EventBridge, and setting up alarms that can be sent via Amazon SNS. With the help of this solution, you will rarely miss an alarm related to volume failures because now you can receive notifications at all your designated endpoints. This is extremely important because now you can receive the alarms and then react to those alarms by quickly recovering to a new volume that you will either create from a snapshot or from a local backup. The ability to monitor the alarm and react to it lets you make sure of the highest resiliency for your mission critical applications.

Thank you for reading this blog. If you have any comments or questions, don’t hesitate to leave them in the comments section.