How to archive data from relational databases to Amazon Glacier using AWS DMS
This post was updated on April 2021 to instruct readers to use Python 3.7. On July 15, 2021, AWS Lambda will deprecate Python 2.7 as a supported runtime, formally ending our Python 2.7 support. Check this post for more information.
You can use AWS Database Migration Service (AWS DMS) to migrate data from various sources to most widely used commercial and open-source databases. AWS DMS also supports Amazon S3 as a source and as a target for migrations. When you use Amazon S3 as a target, you can use AWS DMS to extract information from any database that is supported by AWS DMS. You can then write it to Amazon S3 in CSV format, which can be used by almost any application.
You can use this feature in the following two scenarios:
- As part of the migration from commercial engines such as Oracle to Amazon RDS for PostgreSQL, you can migrate infrequently accessed data or historical audit and archive data into AWS storage services like Amazon S3 or Amazon Glacier. Doing this can help you meet governance and compliance needs for long-term retention of data.
- As the data size grows in relational databases, database administrators (DBAs) and developers spend more time monitoring and reorganizing indexes and partitions to maintain optimal performance. This additional maintenance leads to increased operational expenses for storing the data due to redundancy and backup. It’s a best practice to archive infrequently accessed data in Amazon S3 or Amazon Glacier for better cost savings and manageability.
In this post, I talk about the second use case, where infrequently accessed data—for example, sales information that is older than a few months or years (often referred as cold data)—is archived in a secure vault for a long duration. The finance or information security teams should be able to access this data when required, although such requests might be sporadic.
If you compare the cost of per GB storage per month for Amazon RDS PostgreSQL Multi-AZ GP2 with that of Amazon Glacier in the us-east-1 Region, the Amazon Glacier storage cost is about 57 times cheaper than that of Amazon RDS. For example, the cost of 100 GB of storage per month in Amazon Glacier is under a dollar ($0.40), compared to Amazon RDS Multi-AZ storage, which is $23.00 (at the time of this writing).
In this post, I show you how to use AWS DMS to store the data from Amazon RDS PostgreSQL in Amazon S3 as a CSV file, and then archive the CSV file in Amazon Glacier. Amazon Glacier provides multiple retrieval options that allow you to retrieve the data in a few minutes to several hours. I then use the Amazon Glacier Select feature to query a subset of the data from the archive. With Amazon Glacier Select, you can perform analysis on your data without staging it in a hotter storage tier like Amazon S3. This makes it cheaper, faster, and easier to gather insights from your cold data in Amazon Glacier.
To help you manage all the infrastructure for this post, we created an AWS CloudFormation template that provisions all the required resources and the AWS Lambda code used in the post. You can find the code in this GitHub repository. The solution that is demonstrated in this post includes the following tasks:
- Download the public data set containing UK property registration information from https://data.gov.uk.
- Create an Amazon Glacier vault for storing the archive.
- Launch an AWS CloudFormation template that creates the necessary resources. These include the virtual private cloud (VPC), Amazon RDS PostgreSQL instance, AWS DMS replication instance, S3 bucket, IAM policies, and other resources that are required for this example.
- Create the folders in Amazon S3.
- Connect to the RDS instance, and load the dataset using the PostgreSQL client.
- Create an AWS Lambda function, and set up an S3 event notification to trigger the Lambda function. This function downloads the file from S3 and uploads it to Amazon Glacier as soon as the CSV file is created by AWS DMS.
- Configure the source and target endpoints for AWS DMS.
- Set up an AWS DMS task to extract the data from the RDS instance to Amazon S3.
- Create a Lambda function to retrieve the subset of data from the archive using Amazon Glacier Select.
- View the archive results in the S3 bucket.
- Clean up the AWS resources.
Note: To illustrate this architecture, the example uses serverless computing. I use an AWS Lambda function to download CSV files that are created by AWS DMS and upload it to Amazon Glacier. Lambda has limited local disk capacity of 512 MB, and hence this solution assumes that the archives are under this limit. For archives that are larger than 512 MB, consider replacing Lambda with Amazon EC2 to implement similar functionality.
Prerequisites and assumptions
To complete this solution, you should have the following:
- An AWS account that provides access to the services shown in the preceding diagram. The steps are performed in the us-west-2 (Oregon) Region. Before you start, ensure that the services that are used in this post are available in your AWS Region.
- A tool that connects to a PostgreSQL database, such as psql or pgAdmin.
- Working knowledge of Amazon RDS, AWS DMS, Amazon VPC, Amazon S3, and AWS Lambda.
The following sections walk you through the steps that are required to create and configure the solution.
1. Download the sample dataset
- Download this UK land registry file, which contains a decade of property registration information. Note: This file contains HM Land Registry data © Crown copyright and database copyright 2018. This data is licensed under the Open Government Licence v3.0. The data provided in this post includes address data that has usage restrictions, including personal and non-commercial use. For more information about the permitted use of this dataset, see Price Paid Data on the Gov.uk site.
- You can save the file to any Amazon EC2 instance (preferred) or to your local computer that has the PostgreSQL client installed.
At the time of this writing, this dataset contains UK property registration information from 1995 to 2017, with approximately 22 million records. We assume that you need to extract old data on a yearly basis (that is, partitions by year) and archive it in Amazon Glacier. After successfully storing it in Amazon Glacier, you can delete those records from Amazon RDS.
2. Create the Amazon Glacier vault
- Sign in to the AWS Management Console, and open the Amazon Glacier console. Create a vault with the name rdsarchive in the us-west-2 Region. Choose Do not enable notifications, and then choose Submit.
- Note the name and Amazon Resource Name (ARN) of the vault.
3. Launch the AWS CloudFormation template
- Copy or download the AWS CloudFormation JSON file from this GitHub repository.
- Go to the AWS CloudFormation console, and choose Create Stack.
- Launch a template by uploading the JSON file from Step 1.
- Specify the mandatory parameter values:
- Type a stack name, for example, dms_archive. Provide a unique S3 bucket name and an Amazon Glacier vault name (it should be the same as the one in Step 2).
- For ExistsDMSRole, specify Y if dms-vpc-role is already created in your AWS account. Otherwise, specify N so that it can be created with this CloudFormation stack. For more information about setting up this role, see Creating the IAM Roles to Use with the AWS CLI and AWS DMS API.
- The rest of the parameter values are optional and can be left as the default settings.
- Choose Next, and then choose Create.
The AWS CloudFormation stack creation takes about 10–15 minutes and creates the following resources. Some resources, such as the S3 bucket name, must be unique across all existing bucket names in Amazon S3. (In this post, I use the bucket name
rdstoglacier.) We recommend that you restrict the access to the RDS instance for your specific IP range (by default, it allows access to
- An Amazon S3 bucket with the name as provided.
- An AWS DMS replication instance named
- An IAM role for Lambda execution named
rdsarchive-lambda-execution. Add the managed policy
AWSLambdaBasicExecutionRoleto the role, so that the Lambda function can create and access Amazon CloudWatch Logs.
- An (IAM) policynamed
lambda-to-archive-policyto provide access to the S3 bucket and the Amazon Glacier vault that is created in Step 2, and add it to the
- An IAM role named
dms-vpc-rolefor managing VPC settings for AWS managed customer configurations. Add the managed policy
AmazonDMSVPCManagementRoleto the role if it is not created already.
- An IAM policy
DELETEobjects from the bucket created in Step 1). Add it to the
- An RDS instance with identifier
archivedb, database name
ukdb, user name
dbadmin, and default password
- A VPC for the Amazon RDS PostgreSQL and AWS DMS instances, and a security group to allow access to the RDS PostgreSQL instance from your IP range.
The AWS resources provisioned are assigned with default names in the template. If you provide different names while creating the stack, use those names in the rest of the steps. You can note the resource names, RDS endpoint names, S3 bucket name, and so on from the Outputs section of your stack execution.
4. Create the Amazon S3 folders
After creating the stack, go to the Amazon S3 console and locate the bucket
rdstoglacier. Create the following two folders under the bucket:
ukregistryfolder, which acts as a target for AWS DMS.
outputfolder, which acts as a target for Amazon Glacier Select to retrieve the data from the archive.
5. Load the data into the Amazon RDS DB instance
- Using psql or pgAdmin, connect to the RDS DB instance as follows (assuming that you are using the default database name
ukdband the user name
- Create the database table, and load the UK dataset, as shown following:
Note: Another option is to download the dataset to an EC2 instance and load it from there via psql installed on that EC2 instance.
After a successful load, you will see an output like
COPY 22898552, or you can run the query
select count(*) from uk_property_registry; to verify that the load is successful.
6. Create a Lambda function to download the file from S3 and upload it to the Amazon Glacier vault
- On the AWS Lambda console, create the
archivetoglacierLambda function as follows:
- Use the blueprint s3-get-object-python with the Python 3.7 runtime.
- For the function name, type archivetoglacier.
- Choose the existing rdsarchive-lambda-execution role for this function.
- Type the bucket name.
- For the event type, choose Object Created (All).
- Specify ukregistry as the prefix.
- Select the Enable triggercheck box.
- Create the function.
- After it is created, edit the function.
- Replace all the code in the window with the Python code from the GitHub repository.
- In the Environment variables section, specify the vaultname as rdsarchive.
- Increase the Lambda timeout to 5 minutes because the upload of the archive might not complete within the default timeout settings.
Save the changes.
The preceding code fetches the S3 object, downloads it to a Lambda instance, and uploads it to the Amazon Glacier vault. This function takes the Amazon Glacier vault name through the environment variables.
7. Set up the source and target endpoints for AWS DMS
- Open the AWS DMS console, and choose Endpoints in the navigation pane. Create a source endpoint for the PostgreSQL database using the RDS instance that you created in Step3, as shown following:Make sure that the database name, port, and user information are correct.
- Choose Run testand test the endpoint. If the test succeeds, be sure to save the endpoint. If the test fails, verify the endpoint and security group configurations.
- Choose Refresh schemas after successful testing.
- Create a target endpoint for the S3 connection, as shown following:
- Specify the role ARN of dms-vpc-role as arn:aws:iam::<AWS Account>:role/dms-vpc-role.
- Specify the Bucket name and Bucket folder from Step 4.
- Choose Run testand test the endpoint. If the test succeeds, be sure to save the endpoint. If the test fails, verify the role configuration and the S3 bucket name.
8. Set up a DMS task to extract the data from the RDS instance to S3
- Open the AWS DMS console. Choose Tasks in the navigation pane, and then choose Create task.
- For the task name, use archivetos3. Use the replication instance that was created via AWS CloudFormation and the endpoints that you created in Step 7.
- For Migration type, choose Migrate existing data.
- Accept the default values in the Task Settings.
- In the Table mappingssection, choose public for the schema field. For Table name, choose uk_property_registry.
- AWS DMS provides an option to add column filters, which enable you to extract the data that meets your archive criteria. In this example, you can specify the transfer_date column to extract the data for a particular year—for example, transfer_date between 1995-01-01 and 1996-01-01, as shown following. Choose Add selection rule.
- Choose Create task. The migration task then runs because the Start task on create box is selected. The task status changes from Creating to Starting, then to Running, and finally to Load Complete.
- You can verify that AWS DMS extracts the subset of information from the table as a CSV file and stores it in the S3 path that you specified.
- Examine the contents of the S3 bucket:
- Go to the Amazon S3 console, and choose the rdstoglacierChoose the key (folder) named ukregistry. Choose the key public, which is the schema name. Choose uk_property_registry, which is the table name.
- You will see the object
LOAD00000001.csvfile, which contains the data <= 1996-01-01. You can download this file and examine it with a text editor if you want.
- Check whether the Lambda function was invoked for the upload process.
- Go to the AWS Lambda console and choose the archivetoglacier Choose the Monitoring tab, and check to see whether the function was invoked as the S3 trigger is configured on the bucket.
- Choose Jump to Logs to go to CloudWatch Logs for the Lambda function. Note the
u'archiveId') of the Amazon Glacier upload operation from the log. You need this for the next step.
9. Create a Lambda function to retrieve the subset of the data
Now, let’s assume that you need to identify the registrations that are done with price > 500,000 in year 1995 from the archive. With Amazon Glacier Select, you can extract the subset of data from the archive using a SQL query. Doing this not only reduces the total time involved in querying, but it also simplifies the way that you can access the archive data. It is important to understand the structure of the archived CSV file, and the SQL operations and data types that are supported in the Amazon Glacier Select feature.
For this example, you will use the following SQL expression to retrieve the data. The second column (
s._2) is the price attribute in the table and the CSV file that was uploaded as the archive. You use the
CAST function to convert this to decimal because Amazon Glacier, by default, assumes these attributes as a string. Although Amazon Glacier supports several retrieval options, you use the Standard option for this post. Amazon Glacier Select stores the output in an S3 bucket specified in the retrieval job.
You use a Lambda function to achieve this.
- Create a Lambda function as follows:
- In the AWS Lambda console, choose Create function. Choose the Author from scratch
- Type glacierselect for the function name.
- Choose the Python 3.7 runtime.
- Choose the existing role rdsarchive-lambda-execution for this function.
- Create the function.
- After it is created, edit the function.
- Replace all the code in the window with the Python code from the GitHub repository. In the Environment variables section, add the following variables. It is important that you provide the correct values, such as bucketname, the prefix from Step 4, and the archiveid from Step 8.
- Set the Lambda Timeout value to 5 minutes.
- Save the function.
- Invoke the Lambda function
glacierselectmanually using a command line, or test it using a dummy “Hello World” test event, as shown following.
- Test the Lambda function
glacierselectusing the test event that you created in Step 2. You can see the output in the Lambda console, as shown following. Alternatively, you can also view the information in the Lambda logs.
10. View the archive results in the S3 bucket
After successful invocation, it may take a few hours to get the output in the S3 folder (
output/UcPixxxx/results) because you used the Standard retrieval option for Amazon Glacier. If you need to view the results faster, you might consider the Expedited option. Also, you might consider the Bulk retrieval option for downloading the large volume of data in a cost-effective manner. You can download the result file and analyze it using text editor. Alternatively, you can re-upload the data into an Amazon RDS table using AWS DMS with S3 as a source.The following is sample data with a property registration value > 500000 retrieved from Amazon Glacier using Amazon Glacier Select:
11. Clean up all the resources
- Remove the UK public data set from your EC2/ local desktop because it contains data with usage restrictions.
- Delete the folders created in the S3 bucket
- Delete the AWS Lambda functions created
- Delete the AWS CloudFormation stack that you created in Step 3. Doing this step removes all the resources used in this post.
- Delete the Amazon Glacier vault.
- You can extend this architecture to include Amazon DynamoDB and build a complete archiving solution. You can store the Amazon Glacier
archiveidin DynamoDB with a meaningful description (for example:
partition_key= archive_criteria_id(uk_registry_1996), Attribute1 (archiveid=xxxx), Attribute2 (date_created=xxxx)etc.) for indexing. You can then programmatically retrieve it as needed.
- If you need to query the archive data more frequently, you can use AWS DMS to archive it to S3 and query via Amazon S3 Select or Amazon Athena.
- You can automate the archival process by scheduling AWS DMS tasks to run periodically.
Managing ever-growing data is a major challenge for many enterprises. This post explained how you can use AWS DMS, Amazon Glacier, AWS Lambda, and Amazon S3 to handle infrequently accessed data that is stored in relational databases. The Amazon Glacier Select feature helps you save costs and improve performance by reducing the targeted dataset that you use while querying the data archives. This combination of services helps you manage your data growth while balancing cost and performance.
About the Authors
Gowri Balasubramanian is a senior solutions architect at Amazon Web Services. He works with AWS customers to provide guidance and technical assistance on both relational and NoSQL database services, helping them improve the value of their solutions when using AWS.
Ganesh Shankaran is a senior solutions architect at Amazon Web Services. He works with AWS customers to provide guidance and technical assistance to help them improve the value of their solutions when using AWS.