Building data lakes and implementing data retention policies with Amazon RDS snapshot export to Amazon S3

Amazon Relational Database Service (RDS) helps you easily create, operate, and scale a relational database in the cloud. In January 2020, AWS announced the ability to export snapshots from Amazon RDS for MySQL, Amazon RDS for PostgreSQL, Amazon RDS for MariaDB, Amazon Aurora PostgreSQL, and Amazon Aurora MySQL into Amazon S3 in Apache Parquet format. You can now make data available from production databases to downstream reporting and analytical applications without impacting your main transactional applications.

Database administrators and data owners constantly receive requests to provide access to database tables from builders of business intelligence reports, machine learning developers, or enterprise dashboard creators. The challenge here is to provide access to data in a way that does not impact the business-critical applications that operate on the data. Creating read replicas is one solution, but you still have to maintain provisioned compute capacity to serve the downstream application requests.

To make matters more complex, some reporting systems require a copy of data as of a previous timestamp. The only way to accomplish this previously was to create new RDS instances by restoring from a snapshot and making the new instance accessible to the reporting system.

With the launch of Amazon RDS snapshot export to Amazon S3, you can simply create a process to export the requested tables from an appropriate snapshot to an S3 bucket and provide access to the Parquet files to the downstream applications.

This post shows how to create an export file in Amazon S3 from an Amazon RDS snapshot, which you filter to specific tables, and how to query the dataset in Amazon Athena.

Solution overview

The following diagram shows how you can set up delivery streams for data to downstream applications without impacting the business-critical applications that perform read/write operations on the primary RDS instance.

In this architecture, data is replicated into a read-only target (read replica) set up through the Amazon RDS console. A duplicate copy of the entire database is maintained here, which continuously runs provisioned database compute resources to serve the data to analytics and data lake build crawlers. The advantage of maintaining the extra resources this way is that it keeps the production database isolated from the load other than the critical applications themselves.

With Amazon Aurora, the readers use the same clustered storage resources as the primary instance, but additional compute capacity is still provisioned to enable the read replica serve data for the analytics applications.

There are multiple factors to consider when you use read replicas in this way:

The entire database is duplicated in the read replica. You must take special care to make sure that downstream applications are authorized only for the part of the data they need to access.
Applications only have access to live data because it is continuously duplicated. Read replicas can’t present data to readers that require them from a specific snapshot in time.
Analytical queries may require index structures that might not be available in the read replica resulting in poor query performance and heavy resource usage.

The following diagram is an alternative architecture in which you create a process to export either the whole database or parts of it by using already existing snapshots. This method doesn’t impact the production database or creates additional resources for a read replica.

In the preceding use case, as an alternative to maintaining a read replica, the BI analytics application takes advantage of Athena to query the Apache Parquet dataset in an S3 bucket created through the snapshot export feature. The Parquet format is up to two times faster to export and up to six times less storage in Amazon S3 compared to text format. The AWS Glue crawlers that feed the data lakes access the files directly and don’t need to perform a query download and any subsequent formatting that may be required to feed data into the lake.

To optimize the performance of your snapshot export, you should create your tables with auto-incrementing primary keys—especially for larger tables. It is generally a good design practice to avoid using composite keys for primary keys and clustered indexes, but in this scenario, it becomes especially important because the export process is much faster with auto-incrementing primary keys.

Creating a snapshot export

In the following use case, you use an existing snapshot of an Amazon RDS database and extract data from it to an S3 bucket by using the export feature. To accomplish this, you filter the export to a specific table. You then create and run an AWS Glue crawler, which extracts the schema from the export and creates a table. The advantage of this approach versus manually defining the table in Athena is that crawlers can infer schema definitions, which eliminates the effort to identify the columns and its data types. Lastly, you query the exported Parquet file using Athena.

To create a snapshot export, complete the following steps:

On the Amazon RDS console, choose Snapshots.
You can run the export from automated backups, a DB snapshot you create manually, or snapshots you create in AWS Backup.
Select the snapshot to export.
From the Actions drop-down menu, choose Export to Amazon S3.
For Export identifier, enter an identifier for the export.
For Identifiers, specify the filtering conditions for the data to export; for example, a table in a schema.
You always export the data in Apache Parquet format. For this post, you export a table that contains information about major league baseball players.
For S3 bucket, choose the S3 bucket to which you want to export the data.
As an optional step, for S3 prefix, specify a subfolder with a prefix.
For IAM role, choose an IAM role that has access to create the export and write to the S3 bucket you specified earlier.
Optionally, you can create a role with the required access. To protect the exported data, it is encrypted with an AWS KMS key specified.
For Encryption, enter your key ARN.
Choose Export to S3.

If you are pre-creating the role and granting the access, complete the following steps:

Create the role with the following CLI command:

aws iam create-role  --role-name S3rdsexportrole  --assume-role-policy-document '{
     "Version": "2012-10-17",
     "Statement": [
       {
         "Effect": "Allow",
         "Principal": {
            "Service": "export.rds.amazonaws.com"
          },
         "Action": "sts:AssumeRole"
       }
     ] 
   }'

Create a policy with the following code:

aws iam create-policy  --policy-name ExportPolicy --policy-document '{
     "Version": "2012-10-17",
     "Statement": [
         {
             "Effect": "Allow",
             "Action": [
                 "s3:ListBucket",
                 "s3:GetBucketLocation"
             ],
             "Resource": [
                 "arn:aws:s3:::*"
             ]
         },
         {
             "Effect": "Allow",
             "Action": [
                 "s3:PutObject*",
                 "s3:GetObject*",
                 "s3:DeleteObject*"
             ],
             "Resource": [
                 "arn:aws:s3:::your-s3-bucket",
                 "arn:aws:s3:::your-s3-bucket/*"
             ]
         }
     ]
}'

Attach the policy to the role you created earlier. Use the policy-arn from the policy you created. See the following code:
```
aws iam attach-role-policy  --policy-arn your-policy-arn  --role-name S3rdsexportrole
```
Add the role as a key user for the KMS key you are using to encrypt the export.
The export dashboard shows the new export process with the status Starting.
Wait for the status to change to Available.
After the export process is complete, create an AWS Glue crawler to crawl the exported data, extract the scheme from the export, and create a table.
On the AWS Glue console, choose Crawlers.
Choose Add crawler.
The crawler requires the following parameters (at a minimum):

Name
Source type – Choose data stores because, for this use case, you need AWS Glue to scan the data and create a table.
Data store – Choose Amazon S3 and provide the bucket name you chose for the Amazon RDS snapshot export.
IAM role – Use this role to run the crawler process (the role is different from what you used to run the snapshot export). It should have access to run the crawler, S3 bucket with the export, and the KMS key you used to encrypt the export. Optionally, you can create a new role for the crawler to use. For information about tailoring policies to attach to a role, see AWS Glue Access Control Policy Examples.
A schedule for the crawler – Configure the crawler to run on demand or at a specified frequency.
Output – Define a database to host the tables (or create a new database) and any prefixes to use for table names.

Run the crawler and wait for it to create the table.
After the crawler is complete, it updates its status to Ready and displays a message with the number of tables it created or updated. You can also choose Logs in the crawler’s dashboard to view the Amazon CloudWatch Logs it wrote and make sure that all the steps in the process are complete.

Querying the table

To query the table, on the Athena console, choose the database you created earlier.

The following screenshot shows the query output from the MLB player data.

You can also use Athena to run analytical queries against the tables in Amazon RDS by using the Amazon RDS snapshots to Amazon S3 feature.

For example, to run a query regarding the distribution of left-handed and right-handed pitchers across different teams, enter the following query:

The following screenshot shows the query output.

Pricing

The Amazon RDS snapshot export to Amazon S3 feature incurs charges per GB of the snapshot size. For more information, see Amazon RDS for MySQL Pricing and choose Snapshot Export. Pricing is based on the full size of the snapshot, even if you only run a partial export (for a schema or a table). The other components of the process—AWS Glue and Athena—run in serverless mode, so you pay only for the actual compute capacity you use. The storage pricing component of the export is pay per use. For more information, see Amazon S3 pricing. You can optimize this cost by moving infrequently accessed data to Amazon S3 Glacier.

Conclusion

The ability to export data directly from snapshots is an excellent tool for data owners, administrators, and analysts to give access to data without impacting the performance and availability of production databases. You can make data available in a secure, reliable, and performant way without maintaining read replicas that require consistent compute and storage allocations. More importantly, you can recover the original database because you can access partial data and data from a previous time by using snapshots you already created.

For more information about creating, querying, and scheduling snapshot exports, see Amazon RDS Snapshot Export to S3 on YouTube.

About the Author

Prasanth Kollarath is a Sr. Technical Account Manager with Amazon Web Services.

AWS Database Blog