AWS Storage Blog
How DXC creates application consistent EBS Snapshots for Linux
Large enterprises run mission critical applications on AWS. These applications can have hundreds of Amazon Elastic Compute Cloud (Amazon EC2) instances. Backing up Amazon Elastic Block Store (Amazon EBS) volumes is a requirement for these applications for use cases such as disaster recovery, application recovery, rollback, compliance and audit. Stringent data consistency and integrity are required for these mission critical backups.
There are two types of backups for mission critical applications – application consistent and crash consistent. Application consistent backups make sure that the backed up data is in a consistent state and is devoid of incomplete data. This is done by making sure that transactions are temporarily paused and data is flushed from memory to disk before taking a backup. Crash consistent backups are a snapshot of the data at the time of taking the backup, regardless of the state of in-flight transactions.
AWS Backup service provides a native way of creating application consistent backups for Windows EC2 instances through Microsoft’s Volume Shadow Copy Services (VSS). However, for EC2 Linux instances there is no native way of creating application consistent snapshots. For many customers, this was a blocker.
In this post, we share how DXC and AWS collaborated to implement a serverless solution to create application consistent backups for Linux instances. We cover the aspects of the solution such as automating the triggering of backup jobs, identifying Linux instances that need application consistent backups, and orchestrating the process of taking application consistent backups of the volumes attached to those instances by leveraging serverless services from AWS.
Solution overview
An AWS Storage post published earlier (Automating AWS Backup pre- and post-script execution with AWS Step Functions) discusses a generic approach to orchestrate pre-script and post-script execution in Linux or Windows-based EC2 instances by leveraging the AWS Systems Manager run command feature. Recently, AWS has also announced the ability to use pre-script and post-script with Amazon Data Lifecycle Manager. Although these are viable solutions for this specific use case, they were not the most optimal solution for DXC’s requirements.
- DXC provides backup services for various AWS storage services, such as Amazon Simple Storage Service (Amazon S3), Amazon Elastic File System (Amazon EFS), Amazon FSx, and Amazon Elastic Block Store (Amazon EBS). They need a standardized approach and tool for all storage types. Data Lifecycle Manager provides basic lifecycle management for EBS Snapshots and EBS-backed Amazon Machine Images (AMIs)
- AWS Backup provides rich capabilities for managing and restoring backups of many data stores, such as EBS volumes. AWS Backup supports Amazon EFS, Amazon FSx, Amazon S3, Amazon DynamoDB and many others. For a full list, select this link.
- Because of its rich features and support for a broad range of services, many managed service providers have standardized their backup offering on AWS Backup.
Solution requirements
- Execute multiple backup jobs in parallel. DXC has thousands of instances under management, and it is important to leverage the massive scale of AWS serverless services to execute jobs in parallel, and to make sure of timely completion.
- Take application consistent backups for Linux instances without stopping the instances.
- Handle long-running backup jobs without timing out. First-time snapshots of large volumes or incremental snapshots of volumes for highly transactional systems can take a long time if the amount of new data is large. Therefore, using an AWS Lambda function, which has a max timeout of 15 minutes, may need additional considerations that add complexity to the solution. On the other hand, AWS Step Functions executions have a maximum execution duration of one year. Therefore, instead of leveraging Lambda for running specific tasks on each instances, this post describes an approach in which Lambda is only used to identify EC2 instances needing application consistent backups and trigger a separate Step Function state machine for each instance in parallel. The tasks for each instance are executed by the state machine.
- ITSM integration – Enterprises use standard tools for centralized reporting and ticketing for IT operations. In this post we show how job status is reported to ServiceNow by leveraging Amazon Simple Notification Service (Amazon SNS).
- Ability to extend and standardize the solution for other data stores such as Amazon EFS and Amazon FSx.
Figure 1: DXC and AWS Backup high-level architecture
Solution workflow
The solution uses an automated serverless approach. A Lambda function is triggered on a predefined schedule. This is achieved by using Amazon EventBridge Scheduler. The Lambda function is used to discover the Linux instances that need application consistent backups. The identification is done based on tagging. Then, the Lambda invokes a Step Function. The Step Function State Machine orchestrates a set of tasks on the machine. A new State Machine is launched for each instance. The State Machine executes the following sequence of tasks:
1. The State Machine first executes an AWS Systems Manager document using the Run Command. This document fires a script on the instance to freeze IO on the instance and flush the buffer to disk using standard Linux commands:
fsfreeze -f
sync
2. Next, the State Machine calls the AWS Backup API to take snapshots and backup the attached EBS volumes to an AWS Backup Vault. In case the backup job fails, Amazon SNS is used to send a notification to DXC’s ServiceNow Connector, which logs a ticket in ServiceNow.
3. Then, the State Machine executes another Systems Manager document using the Run Command to unfreeze the file system on the instance using the following Linux command:
fsfreeze -u
Benefit
This is a completely automated solution. Once the correct tagging is done on the EC2 instances, the solution creates application consistent backups on a pre-defined schedule without the need for manual intervention. The solution also parallelizes the execution of backup jobs for the instances in an account and addresses Linux instances with a single EBS volume as well as multiple EBS volumes. It creates tickets and notifies operations teams in case of failures. Second, this solution is completely built on the advanced serverless services of AWS, which means it is low cost and needs no additional operational overhead. Cost is incurred only when the system is executing a backup job. There is no need to manage instances or infrastructure.
Cleaning up
If you spin up resources in your AWS account to setup and test this solution, you should delete those resources to avoid incurring charges. EBS volumes and EC2 instances will continue to incur costs even when not in use. The rest of the services are serverless, incurring charges only when used. However, it is recommended to delete them as well.
Conclusion
In this post, we showed how DXC uses AWS serverless capabilities to take application consistent backups for Linux EC2 instances, automating the process and running multiple jobs in parallel. Application consistent backups help ensure that your data is consistent and not missing transactional data. You can learn more about AWS Backup by visiting the technical documentation. Thank you for reading this post. If you have comments or questions, we encourage you to share them in the comments section.
DXC is an AWS Premier Consulting Partner, Managed Service Provider and Global Systems Integrator (GSI) that provides a wide array of solutions and services to its users. DXC’s AWS Managed Service offers management, governance, and operations of customers’ AWS environments at scale. Learn more about DXC’s AWS capabilities here.