AWS Compute Blog

Automating Amazon EBS Snapshot Management with AWS Step Functions and Amazon CloudWatch Events

Brittany Doncaster, Solutions Architect

Business continuity is important for building mission-critical workloads on AWS. As an AWS customer, you might define recovery point objectives (RPO) and recovery time objectives (RTO) for different tier applications in your business. After the RPO and RTO requirements are defined, it is up to your architects to determine how to meet those requirements.

You probably store persistent data in Amazon EBS volumes, which live within a single Availability Zone. And, following best practices, you take snapshots of your EBS volumes to back up the data on Amazon S3, which provides 11 9's of durability. If you are following these best practices, then you've probably recognized the need to manage the number of snapshots you keep for a particular EBS volume and delete older, unneeded snapshots. Doing this cleanup helps save on storage costs.

Some customers also have policies stating that backups need to be stored a certain number of miles away as part of a disaster recovery (DR) plan. To meet these requirements, customers copy their EBS snapshots to the DR region. Then, the same snapshot management and cleanup has to also be done in the DR region.

All of this snapshot management logic consists of different components. You would first tag your snapshots so you could manage them. Then, determine how many snapshots you currently have for a particular EBS volume and assess that value against a retention rule. If the number of snapshots was greater than your retention value, then you would clean up old snapshots. And finally, you might copy the latest snapshot to your DR region. All these steps are just an example of a simple snapshot management workflow. But how do you automate something like this in AWS? How do you do it without servers?

One of the most powerful AWS services released in 2016 was Amazon CloudWatch Events. It enables you to build event-driven IT automation, based on events happening within your AWS infrastructure. CloudWatch Events integrates with AWS Lambda to let you execute your custom code when one of those events occurs. However, the actions to take based on those events aren't always composed of a single Lambda function. Instead, your business logic may consist of multiple steps (like in the case of the example snapshot management flow described earlier). And you may want to run those steps in sequence or in parallel. You may also want to have retry logic or exception handling for each step.

AWS Step Functions serves just this purpose―to help you coordinate your functions and microservices. Step Functions enables you to simplify your effort and pull the error handling, retry logic, and workflow logic out of your Lambda code. Step Functions integrates with Lambda to provide a mechanism for building complex serverless applications. Now, you can kick off a Step Functions state machine based on a CloudWatch event.

In this post, I discuss how you can target Step Functions in a CloudWatch Events rule. This allows you to have event-driven snapshot management based on snapshot completion events firing in CloudWatch Event rules.

As an example of what you could do with Step Functions and CloudWatch Events, we've developed a reference architecture that performs management of your EBS snapshots.

Automating EBS Snapshot Management with Step Functions

This architecture assumes that you have already set up CloudWatch Events to create the snapshots on a schedule or that you are using some other means of creating snapshots according to your needs.

This architecture covers the pieces of the workflow that need to happen after a snapshot has been created.

  • It creates a CloudWatch Events rule to invoke a Step Functions state machine execution when an EBS snapshot is created.
  • The state machine then tags the snapshot, cleans up the oldest snapshots if the number of snapshots is greater than the defined number to retain, and copies the snapshot to a DR region.
  • When the DR region snapshot copy is completed, another state machine kicks off in the DR region. The new state machine has a similar flow and uses some of the same Lambda code to clean up the oldest snapshots that are greater than the defined number to retain.
  • Also, both state machines demonstrate how you can use Step Functions to handle errors within your workflow. Any errors that are caught during execution result in the execution of a Lambda function that writes a message to an SNS topic. Therefore, if any errors occur, you can subscribe to the SNS topic and get notified.

The following is an architecture diagram of the reference architecture:

Creating the Lambda functions and Step Functions state machines

First, pull the code from GitHub and use the AWS CLI to create S3 buckets for the Lambda code in the primary and DR regions. For this example, assume that the primary region is us-west-2 and the DR region is us-east-2. Run the following commands, replacing the italicized text in <> with your own unique bucket names.

git clone https://github.com/awslabs/aws-step-functions-ebs-snapshot-mgmt.git

cd aws-step-functions-ebs-snapshot-mgmt/

aws s3 mb s3://<primary region bucket name> --region us-west-2

aws s3 mb s3://<DR region bucket name> --region us-east-2

Next, use the Serverless Application Model (SAM), which uses AWS CloudFormation to deploy the Lambda functions and Step Functions state machines in the primary and DR regions. Replace the italicized text in <> with the S3 bucket names that you created earlier.

aws cloudformation package --template-file PrimaryRegionTemplate.yaml --s3-bucket <primary region bucket name>  --output-template-file tempPrimary.yaml --region us-west-2

aws cloudformation deploy --template-file tempPrimary.yaml --stack-name ebsSnapshotMgmtPrimary --capabilities CAPABILITY_IAM --region us-west-2

aws cloudformation package --template-file DR_RegionTemplate.yaml --s3-bucket <DR region bucket name> --output-template-file tempDR.yaml  --region us-east-2

aws cloudformation deploy --template-file tempDR.yaml --stack-name ebsSnapshotMgmtDR --capabilities CAPABILITY_IAM --region us-east-2

CloudWatch event rule verification

The CloudFormation templates deploy the following resources:

  • The Lambda functions that are coordinated by Step Functions
  • The Step Functions state machine
  • The SNS topic
  • The CloudWatch Events rules that trigger the state machine execution

So, all of the CloudWatch event rules have been created for you by performing the preceding commands. The next section demonstrates how you could create the CloudWatch event rule manually. To jump straight to testing the workflow, see the “Testing in your Account” section. Otherwise, you begin by setting up the CloudWatch event rule in the primary region for the createSnapshot event and also the CloudWatch event rule in the DR region for the copySnapshot command.

First, open the CloudWatch console in the primary region.

Choose Create Rule and create a rule for the createSnapshot command, with your newly created Step Function state machine as the target.

For Event Source, choose Event Pattern and specify the following values:

  • Service Name: EC2
  • Event Type: EBS Snapshot Notification
  • Specific Event: createSnapshot

For Target, choose Step Functions state machine, then choose the state machine created by the CloudFormation commands. Choose Create a new role for this specific resource. Your completed rule should look like the following:

Choose Configure Details and give the rule a name and description.

Choose Create Rule. You now have a CloudWatch Events rule that triggers a Step Functions state machine execution when the EBS snapshot creation is complete.

Now, set up the CloudWatch Events rule in the DR region as well. This looks almost same, but is based off the copySnapshot event instead of createSnapshot.

In the upper right corner in the console, switch to your DR region. Choose CloudWatch, Create Rule.

For Event Source, choose Event Pattern and specify the following values:

  • Service Name: EC2
  • Event Type: EBS Snapshot Notification
  • Specific Event: copySnapshot

For Target, choose Step Functions state machine, then select the state machine created by the CloudFormation commands. Choose Create a new role for this specific resource. Your completed rule should look like in the following:

As in the primary region, choose Configure Details and then give this rule a name and description. Complete the creation of the rule.

Testing in your account

To test this setup, open the EC2 console and choose Volumes. Select a volume to snapshot. Choose Actions, Create Snapshot, and then create a snapshot.

This results in a new execution of your state machine in the primary and DR regions. You can view these executions by going to the Step Functions console and selecting your state machine.

From there, you can see the execution of the state machine.

Primary region state machine:

DR region state machine:

I've also provided CloudFormation templates that perform all the earlier setup without using git clone and running the CloudFormation commands. Choose the Launch Stack buttons below to launch the primary and DR region stacks in Dublin and Ohio, respectively. From there, you can pick up at the Testing in Your Account section above to finish the example. All of the code for this example architecture is located in the aws-step-functions-ebs-snapshot-mgmt AWSLabs repo.

Launch EBS Snapshot Management into Ireland with CloudFormation
Primary Region eu-west-1 (Ireland)

Launch EBS Snapshot Management into Ohio with CloudFormation
DR Region us-east-2 (Ohio)

Summary

This reference architecture is just an example of how you can use Step Functions and CloudWatch Events to build event-driven IT automation. The possibilities are endless:

  • Use this pattern to perform other common cleanup type jobs such as managing Amazon RDS snapshots, old versions of Lambda functions, or old Amazon ECR images—all triggered by scheduled events.
  • Use Trusted Advisor events to identify unused EC2 instances or EBS volumes, then coordinate actions on them, such as alerting owners, stopping, or snapshotting.

Happy coding and please let me know what useful state machines you build!