Archiving relational databases to Amazon S3 Glacier storage classes for cost optimization

Many customers are growing their data footprints rapidly, with significantly more data stored in their relational database management systems (RDBMS) than ever before. Additionally, organizations subject to data compliance including the Health Insurance Portability and Accountability Act (HIPAA), the Payment Card Industry Data Security Standard (PCI-DSS) and General Data Protection Regulation (GDPR) are often required to retain records for extended periods, further increasing their substantial data footprint.

As a result, your organization may need to archive RDBMS data from traditional databases backups (“dumps”) over a long period of time, ready to be used by a database engine. To address this requirements while keeping costs under control, you can leverage the Amazon S3 Glacier storage classes, which are purpose-built for data archiving and provide you with the highest performance, most retrieval flexibility and the lowest cost archive storage in the cloud.

In this post, we will demonstrate how to build an automated backup and archiving solution on Amazon S3 Glacier storage classes using AWS Batch, AWS Lambda, AWS Secret Manager, and Amazon DynamoDB. This solution will leverage tools such as mysqldump and pg_dump to take database backups from Amazon RDS and Amazon Aurora instances on a recurrent schedule using Amazon EventBridge. The backups are archived via automatic transition to S3 Glacier storage classes in an S3 bucket, providing a cost-effective, long-term storage solution.

Solution overview

This solution leverages database dumps and S3 Glacier storage classes to provide a flexible, cost-effective approach for long-term data archival. The proposed architecture automates backup processes using a combination of EventBridge for scheduling, Lambda for code execution, and Fargate-based container images through AWS Batch to execute the archiving, enabling efficient data transfer to S3 Glacier storage classes.

While this method offers an innovative archival strategy, it is crucial to understand its limitations. Unlike native Amazon RDS snapshots, which enable quick point-in-time database restoration, this approach requires manually reconstructing the database. Restoring from database dumps involves reading entire dump files and executing SQL statements to rebuild database structures and reinsert data—a process that can be significantly time-consuming, potentially taking hours or even days for large databases.

This solution is best suited for long-term archival and should complement, not replace, standard backup and recovery mechanisms provided by Amazon RDS.

Figure 1 - End-to-end database archival process with the associated AWS services

Figure 1: End-to-end database archival process with the associated AWS services

This solution works as follows:

Amazon EventBridge triggers an AWS Lambda function which scan Amazon RDS databases to find specific object tags key/value pairs indicating a Database we want archived.
AWS Batch launch an AWS Fargate image from Amazon ECR. This image contains databases backup tools such as “pg_dump” and “mysqldump“.
Database credentials are pulled from AWS Secret Manager. We then create a backup dump which get archived to S3 Glacier storage classes. A log entry is created in Amazon DynamoDB for reference and governance purposes.
In case of problems with the process, AWS Batch forwards the details of the issue through Amazon SNS to an email of your choice.

In the following sections, we walk you through the steps in details to create your resources and deploy the solution.

Prerequisites

The following are prerequisites to implementing this solution.

1. An S3 bucket

This solution stores your archives in the Amazon S3 Standard storage class to store newly created objects. You should configure an Amazon S3 Lifecycle policy to set the destination Amazon S3 storage class to the desired class that correspond to your organization needs for your archives. S3 Lifecycle helps you store objects cost effectively throughout their lifecycle by transitioning them to lower-cost storage classes automatically as they age.

When selecting an Amazon S3 Glacier storage class, consider the following key characteristics:

Figure 2 - S3 Glacier storage classes characteristics

Figure 2: S3 Glacier storage classes characteristics

In doubt, you can use the S3 Glacier Instant Retrieval storage class to facilitate retrieval of your objects. In contrast with objects in S3 Glacier Instant Retrieval, Amazon S3 objects that are stored in the S3 Glacier Flexible Retrieval or S3 Glacier Deep Archive storage classes are not immediately accessible. To access an object in these storage classes, you must restore a temporary copy of the object to its S3 bucket. For additional information on restoring objects in S3 Glacier Flexible Retrieval and S3 Glacier Deep Archive classes, visit the S3 User Guide.

2. Networking and security components

To facilitate the archival process and communication with AWS Services, you will need to establish a destination subnet. We suggest simply re-using your databases subnet. Make sure communication to AWS services leverages VPC endpoints, which have the advantage of keeping your traffic within the AWS infrastructure.

3. A development environment

To build a container image, you will need a development environment with the AWS CLI and Docker binaries installed.

Solution deployment

To test this solution in your environment you will be required to follow these steps:

1. Create an Elastic Container Registry (ECR) repository and push your container image.

2. Prepare the AWS Lambda function and deploy the CloudFormation template.

3. Create credentials and secrets in Secrets Manager for database access.

4. Create a schedule with EventBridge and apply tag on your databases that require archiving.

1. Create an ECR repository and push your container image

To utilize AWS Batch for backing up and archiving your data, you must have an Elastic Container Registry (ECR) configured to store the container image.

Create an ECR repository:

1. Go to the Amazon ECR console (this link will take you to the us-east-1 region, make sure this reflects the region your are building this solution in), and choose Create a Repository.

2. Select Private fromthe Visibility setting.

3. Enter a Repository name.

4. Select Create Repository.

Copy the solution files in your development environment

To download a copy of this solution from GitHub that contains the AWS CloudFormation templates, the AWS Lambda function code, and the Dockerfile used to build the container, complete the following steps:

1. Log in to your development environment.

2. Enter the following command:

git clone https://github.com/aws-samples/Archiving-RDBMs-for-cost-reduction.git .

Build the Docker image

To build and push your Docker image to Amazon ECR, complete the following steps:

1. Go to Amazon ECR in AWS Console and choose your repository name.

2. Select View push commands.

Figure 3: Location of the View push command button in the console

3. Select and copy each command listed in the View push commands above (steps 1 to 4) to your development environment terminal windows in the GitHub project you have just cloned. This will authenticate you and allow our container image to be pushed to your Amazon ECR private repository.

Figure 4 - Output of the View push commands window in the console

Figure 4: Output of the View push commands window in the console

Important: When building Docker images on a macOS computer with an ARM-based processor, ensure compatibility with x86-based systems (like Amazon’s Elastic Container Service) by adding the “–platform linux/amd64” flag to your Docker build command. This modification allows the image to run correctly on different server architectures. For detailed guidance, consult the official Docker documentation.

4. Finally, go back to the ECR console and record the URI of the image you just built and uploaded.

2. Prepare the AWS Lambda function and deploy the CloudFormation template

The CloudFormation template deployed in the next step expect the Lambda function code to be zipped and stored in an Amazon S3 bucket. To store the Lambda function, do the following:

1. In the S3 console, choose the Amazon S3 bucket that will contain your archives (see “Prerequisites” above).

2. Zip and upload the file named “py” located in the root directory of your cloned copy of this solution repository to your S3 bucket. Make sure the file you are uploading is named “cold_archiving_lambda_func.zip“ precisely.

Deploy the CloudFormation stack

To deploy the CloudFormation stack, complete the following steps:

1. Go to the CloudFormation section in the AWS and Choose Create stack with new resources.

2. Select Choose an existing template, then Specify a template, Upload a template file and finally select Choose File.

3. Use the location of the CloudFormation template you downloaded locally and and select the “cold_archiving_cfn.yaml“ file.

Figure 5 - View of the CloudFormation deployment section

Figure 5: View of the CloudFormation deployment section

CloudFormation deployment parameters

To deploy the CloudFormation stack, you will need to provide some deployment parameters.

To enter the required CloudFormation deployment parameters, complete the following steps:

1. Enter a Stack name.

2. Enter the docker image URI you previously built in the EcrImage

3. Enter an email address in the EmailAlerts field. This used as a destination for alerts.

4. Enter a Subnet for the Fargate instance, this is the network discussed in the Prerequisites section.

5. Enter the destination VPC.

6. Leave everything else at their default value and choose Submit.

3. Create secrets in Secrets Manager for database access.

We are going to use credentials that only have the minimum set of permissions required to backup RDS instances and store them securely.

Creating the credentials

The backup process outlined in this blog post utilizes native database management tools, such as “mysqldump,” which require security credentials. By default, an Amazon RDS DB instance has a single administrative account with full privileges. However, it is recommended to create a dedicated user account that only has the necessary privileges to perform backups. For more information on granting least privilege access, you can refer to this link.

To create a backup user, complete the following steps:

1. Connect to your Amazon RDS database endpoint from a location with the mysql client installed using the following command:

mysql -u admin -h <rds_Endpoint> -p

2. Enter the following command substituting your backup username and wanted password to create a user with the minimum set of required privileges to backup your data:

CREATE USER 'backup_user_name'@'%' IDENTIFIED BY 'your_password';
GRANT LOCK TABLES, SELECT ON DATABASE_NAME.* TO 'backup_user_name'@'%' IDENTIFIED BY 'your_password';

Storing a secret in Secrets Manager

Now that you have created a backup user, you need to store it’s credentials in Secrets Manager.

To store a secret in Secrets Manager, complete the following steps:

1. Open the AWS Management Console and choose Secrets manager.

2. Choose Store a new Secret.

3. Choose Secret Type, then select Credential for Amazon RDS database.

Figure 6 - Selecting the type of credentials to be stored in Secrets Manager

Figure 6: Selecting the type of credentials to be stored in Secrets Manager

4. Enter the credentials for the backup user created in the Creating a backup user step then choose Next.

5. Leave encryption selected to use a generated one or select a customer managed one.

6. Go to Database, and select the database to which this secret should be associated.

7. Choose Next then enter a secret name. It is imperative you precede its name with the “cold-archiving/” prefix (eg. cold-archiving/secret_name). This is to ensure the task execution role will be able to access it and create a connection to your database. Take note of the secret name, including its prefix.

Figure 7 - The Secret name with its associated prefix

Figure 7: The Secret name with its associated prefix

8. Leave everything else at their default value and store your secret by choosing Next.

4. Create a schedule with EventBridge and tag your databases

The archival process can (and should), be scheduled to run periodically. You will create a schedule that runs once a month at a time of your choosing using EventBridge.

To create an EventBridge schedule, complete the following steps:

1. Open EventBridge, then under Scheduler, choose Schedules.

2. Choose Create Schedule.

3. Give a name to your schedule, then under Occurence, select Recurring Schedule.

4. Under Schedule type, select Cron-based schedule.

5. Under Cron expression, define a schedule. Here is an example showing a Cron based schedule occurring every first day of the month at 1:30 AM (this can be any day and time of your choosing). For more information about the Cron scheduling syntax, you can refer to the information here and here.

Figure 8 - Configuring the Cron based schedule with a cron expression syntax

Figure 8: Configuring the Cron based schedule with a cron expression syntax

6. Select fixed time window, your local time zone, and choose Next.

Configure the Lambda invocation

You must select with which parameters your Lambda function will be launched by EventBridge.

To select the Lambda function as a target and configure its input parameters, complete the following steps:

1. Continuing from last step, under Target detail, select Templated targets then AWS Lambda Invoke.

2. Scroll down to Invoke, Select the cold-archiving Lambda function from the drop down as your Lambda function.

3. Under the Payload section, input how many days this archive Retention field should correspond to in your DynamoDB archiving journal. The valid JSON format is ‘{“RetentionDays”:“number_of_days“}’. Here is an example for 365 days of retention:

Figure 9 - Entering the Payload information to configure a retention

Figure 9: Entering the Payload information to configure a retention

4. Choose Next, then scroll down to the Permissions section.

5. Select Use an existing role, then use the role the CloudFormation template created for you (named deployment_name_taskexec-role). In doubt, you can go look at the Cloudformation “outputs” section to find the appropriate role that was created for you.

6. Under the Retry policy and dead-letter queue (DLQ) section, disable retry under Retry policy, then under Dead-letter queue (DLQ) choose Select an Amazon SQS queue in my AWS account as a DLQ. Finally under SQS queue, select the cold-archiving-error-queue.

7. Scroll down, choose Next then Create schedule.

Tag your resources

The last step is tagging the resources that you want to be archived. These tags will be scanned by a Lambda function periodically which will launch an archive process on the databases that have the “AutomatedArchiving:Active” tag:value present.

To tag your resources, complete the following steps:

1. Open your Amazon RDS instance. Under the Tags tab add all the following key values pair:

Figure 10 - All the available and mandatory key values pairs to be applied to your databases - unsquished

Figure 10: All the available and mandatory key:values pairs to be applied to your databases

Handling archives deletion

The solution presented in this blog post does not include a built-in file deletion mechanism. The responsibility of managing the deletion of archived files is left to you, to decide based on your specific requirements and policies.

By recording the archiving activities in Amazon DynamoDB, you have the flexibility to build a file deletion process tailored to your organization’s needs. This allows for a more controlled and adaptable approach to managing the lifecycle of the archived data, without relying on a one-size-fits-all solution.

For more information on how to delete Amazon S3 objects programmatically, you can refer to the link provided here.

Cleaning up

When you’re finished evaluating this solution, you should delete the Cloudformation stack, any secrets you created, all associated Evenbridge schedules, and any S3 resources to avoid any further charges.

Conclusion

In this blog post, we introduced an innovative solution for managing the exponential growth of relational database volumes through strategic cloud archiving to S3 Glacier storage classes. By harnessing AWS services like Batch, Lambda, and EventBridge, we have developed an automated approach that addresses critical data retention challenges for organizations operating under strict regulatory frameworks.

Our solution offers a transformative framework that enables organizations to:

Optimize storage costs
Maintain compliance with industry regulations
Preserve data accessibility
Streamline data lifecycle management

We encourage you to explore this solution by testing it within your own infrastructure. By doing so, you’ll gain firsthand insights into the potential cost savings and operational efficiencies achievable through intelligent, cloud-based data archiving strategies.

Select your cookie preferences

AWS Storage Blog