Amazon Aurora PostgreSQL backups and long-term data retention methods

When operating relational databases, the need for backups and options for retaining data long term is omnipresent. This is especially true for many users of Amazon Aurora PostgreSQL-Compatible Edition. Although AWS-native, PostgreSQL-native, and hybrid solutions are available, the main challenge lies in choosing the correct backup and retention strategy for a given use case. In this post, we explore the primary tools focusing on backups, backup storage, and restore, and discuss some best practices related to applying these tools to a given use case.

Aurora-native backup and AWS backup

Aurora provides automated backups of your Aurora cluster volume for the length of the backup retention period, which can be set from 1–35 days. Aurora backups are continuous and incremental in nature, without causing a performance impact to your application or any interruption to the database service. If you need to restore data from a backup, you can restore to any point within the backup retention period.

Many businesses have requirements to retain Aurora backups beyond the backup retention period. This is where taking a manual snapshot of the data in your cluster volumes comes in handy. Because Aurora retains incremental restore data for the entire backup retention period, you only need to create a snapshot for backups that you want to retain beyond the automated backup retention period. You can create a new DB cluster from the snapshot.

Aurora publishes a set of Amazon CloudWatch metrics for each Aurora DB cluster designed to help you track your backup storage consumption. The backup storage consumed is expressed in three dimensions using CloudWatch metrics:

BackupRetentionPeriodStorageUsed – Represents the amount of backup storage used, in gigabytes, for storing continuous backups at the current time. This value depends on the size of the cluster volume and the amount of changes you make during the retention period.
SnapshotStorageUsed – Represents the amount of backup storage used, in gigabytes, for storing manual snapshots beyond the backup retention period. Manual snapshots taken within the retention period don’t count against your backup storage. All automatic snapshots also don’t count against your backup storage.
TotalBackupStorageBilled – Represents the sum, in gigabytes, of BackupRetentionPeriodStorageUsed and SnapshotStorageUsed, minus an amount of free backup storage that equals the size of the cluster volume for one day.

You can use these metrics to track the backup storage consumed by database clusters over time, and determine the effects of data changes, backup retention period changes, and different manual snapshot retention policies over time.

Nevertheless, storing backups beyond 35 days results in each copy of the backup being considered as a full storage. There are several ways to retain these backups beyond 35 days, which we discuss in later sections, in order to optimize the cost.

You can also use AWS Backup to manage Aurora database manual cluster snapshots. AWS Backup provides central management and monitoring of Aurora backup activity. It also provides the capability to copy Aurora snapshots within and across AWS Regions, except for China Regions. Backups managed by AWS Backup don’t count towards the DB cluster snapshot quota for Aurora. With AWS Backup Vault Lock, you get a write once, read-many (WORM) capability that provides an additional layer of defense around Aurora and Amazon Relational Database Service (Amazon RDS) backups against malicious activities such as ransomware attacks. It provides protection against deletion of backups and any change in your retention periods.

Export to Amazon S3

Import and export to Amazon Simple Storage Service (Amazon S3) is a selective backup option available for Amazon RDS and Aurora in which you import CSV and other flat file types into these database engines from a configured S3 bucket. This backup method allows for a great deal of flexibility in selecting data to be backed up, because this option allows for exporting the result of a database query to a delineated flat file (such as a CSV).

With this flexibility comes some disadvantages, because only the data itself is backed up (and not the underlying DDL and table structure). Additionally, users and other global permissions can’t be preserved with this backup method. Assuming the prerequisites have been met, you can enable this feature in PostgreSQL-based engines with the following psql command:

psql=> CREATE EXTENSION IF NOT EXISTS aws_s3 CASCADE;

For more information on setting up the required resources to enable this feature, refer to Exporting data from an RDS for PostgreSQL DB instance to Amazon S3.

After the PostgreSQL extension has been enabled, you can save query results as CSV and other delineated flat file types. If we want to export a specific range of data from the table employees, for example, the process generally consists of two steps. The first step maps the desired S3 bucket or file path for a given Aurora PostgreSQL database:

psql=> SELECT aws_commons.create_s3_uri( 'sample-bucket', 'sample-filepath', 'us-west-2') AS s3_uri_x \gset

The following step selects a subset of data and writes it to a delimited file in the mapped S3 bucket:

psql=> SELECT * from aws_s3.query_export_to_s3('SELECT id, name_last, name_first FROM employees WHERE id <= 56', :'s3_uri_x', options :='format csv, delimiter $$,$$');

If you want to dump the entire contents of the employees table, the only section of this psql command that needs to change is the WHERE clause:

psql=> SELECT * from aws_s3.query_export_to_s3('SELECT * FROM employees', :'s3_uri_x', options :='format csv, delimiter $$,$$');

If you want to export the entire working database to Amazon S3, you could use the following query:

psql=> SELECT * from aws_s3.query_export_to_s3('SELECT * FROM *', :'s3_uri_x', options :='format csv, delimiter $$,$$');

pg_dump

pg_dump is a PostgreSQL-native tool that you can use to take several types of logical backups from PostgreSQL-compatible databases. This utility is capable of taking full or partial backups, data-only, or structure (schema and DDL)-only backups with a variety of organizational formats, ranging from a single text-editable SQL file to a compressed directory-organized backup that is capable of multi-threaded backup and restore. The utility is capable of running on Amazon Elastic Compute Cloud (Amazon EC2) and on premises, using GNU/Linux, MacOS, and Windows environments (depending on what operating system compiled the binaries). In order for the utility to connect to your Aurora PostgreSQL instance, you need to properly configure the networking to allow for communication between the resource pg_dump is running on and the source Aurora PostgreSQL instance being backed up.

The construction of a pg_dump command in the most commonly used environment (GNU/Linux) is as follows:

pg_dump [connection-options] [options] [dbname]

pg_dump’s connection options are identical to those used for psql. Among vital pg_dump options to understand, the format flag (-F) is a great place to start. The format flag sets the format type of the backup to be taken, of which there are four: plain (p), custom (c), directory (d), and tar (t). The plain database backup format is a flag that’s not used very often in practice, but because it’s the default format, many beginner database administrators use it frequently.

The plain format writes the backup specified into a text (SQL) file without encoding. This allows for easy copy, edit, and rearranging of database backups after they’ve been taken. This can be especially useful if you want the ability to copy specific pieces of data out of a larger backup, although this backup method produces a single text (SQL) file. This often results in a text file that can be difficult to edit and modify using traditional tools (such as Notepad or Vim) when backup files span into the gigabyte range. Although this backup method is flexible, it’s not recommended for large production backups of an entire dataset. Additionally, this is the only pg_dump backup type that can’t be used with pg_restore, and it can’t be parallelized (though it can be restored using PSQL with the -f flag).

The next pg_dump backup format we discuss is the directory (d) format. This type of pg_dump allows for the output of the PostgreSQL database backup, which is comprised of multiple database backup files, with each blob representing a single table or partition. Included in this backup type is also a TOC (table of contents) file, which describes each of the dumped files in such a way that pg_restore can read. You can also manipulate this backup type using standard Linux tools, such as Tar and Gzip for compressing the resulting backup. Importantly, this type of backup can be paralyzed using the -j option flag.

Complementing the directory backup format is the custom (c) pg_dump backup type. This format allows for creating a custom-formatted archive suitable for use with pg_restore. Using this option with the directory (d) output format allows for the most flexible type of production-ready database backups pg_dump can create, because it allows for manual selection and reordering or archived items during the restore process. This format is also compressed by default, saving on disk space.

Restoring the database is almost as easy as backing up the database, but it’s understood that pg_dump doesn’t generate a CREATE DATABASE statement when creating backups. This isn’t of consequence, because it’s a best practice to create globals and empty databases to be restored into before restoring any data or DLL into them.

When to apply each backup option

In this section, we discuss the appropriate use cases for each backup option.

Use cases for Aurora PostgreSQL snapshots and automated backups

Use this option whenever possible if the backup is going to be restored in full to another Aurora PostgreSQL cluster. Automated backups are best used for point-in-time recovery (PITR) and recovering specific data from a point in time when needed, and not as a primary mechanism for longer-term production backups. You can easily copy full snapshots into different AWS accounts and Regions using the AWS Command Line Interface (AWS CLI), the AWS Management Console, or AWS Backup, and they’re the best choice for full production backups of existing Aurora PostgreSQL clusters.

Use cases for exporting to Amazon S3

Use this option when you need to extract and back up portions of a dataset for usage in services outside of Aurora PostgreSQL or when moving data selectively into other clusters. One example might be exporting table views at specific points in time into a format that can then be inserted into a separate cluster for analytics or long-term storage for later analysis.

Use cases for pg_dump

pg_dump is appropriate when you need to export data in such a way that you can restore it into other types of PostgreSQL databases (such as self-managed or AWS managed platforms) or if you need to export a much larger subset of data from an Aurora PostgreSQL cluster (but not the cluster in its entirety). Examples of this might be reorganizing multi-tenant environments and needing to move individual databases into new clusters (but not wanting to incur the additional overhead of setting up more complicated services to migrate and rebalance the data).

Best practices

In this section, we share some best practices when considering these backup options.

Right-size automated Aurora PostgreSQL backup retention intervals

You might enable automatic snapshots in order to ensure that appropriate PITR goals are achievable at implementation time. However, you might not be considering how many automatic snapshots are being retained as your workloads develop and grow over time. As Aurora PostgreSQL is used and backup and restore patterns tested, the number of automatic backups retained should be right-sized based on actual usage and changing business requirements. The current automatic-backup retention period is often set to 7 days, but depending on the size of the dataset and velocity of change within a cluster, this number might be reduced in order to realize cost savings for billed backup storage. This should be done carefully, in order to ensure that PITR goals can be met while minimizing storage costs.

Plan how frequently manual snapshots are taken, and retain enough of them

You might take snapshots, use them, and retain them until the end of time (or until the monthly storage costs become high enough that the number is more closely examined). Especially when the process of taking manual snapshots isn’t automated, lack of automated rotation mechanisms for these backups can result in a large number of manual snapshots, causing higher than necessary billed backup storage charges.

If manual snapshots are taken regularly, automate their creation and rotation (elimination of older backups, replacing them with newer backups for cost savings) as you’re able. If this isn’t possible, periodically rotate and delete unneeded backups in order to save on the related storage costs.

Understand how generated data will be consumed when exporting to Amazon S3

Before you use the integration to export to Amazon S3, it’s important to understand how data is going to be consumed once it’s generated. Will the data be reloaded into another cluster? Is it going into Amazon Redshift? What delimiters will the destination application be looking for when processing the data later? You should ask these and related questions before generating data using the Amazon S3 integration, because how the data will be consumed plays a big role in the options and flags to use when generating data using this integration.

Understand how consumers of database backup data will use the generated data

Regardless of the format, understanding the downstream needs of users of your data is vital to said data being consumed effectively. Whether the data is to be loaded into a different type of relational (or nonrelational) database, or used for analytics purposes, understanding how downstream users of your data will consume it helps determine how the backups should be taken.

Choose the correct pg_dump format flags before taking backups

When using pg_dump, make sure that the intended usage of the target backups is understood before taking the backups. After you take a backup using pg_dump, it’s impossible to change the format without restoring into a new cluster and “redumping” the data or taking a new replacement database dump from the source cluster. When schema-only dumps are needed, the -p flag is usually desired (because it allows for easy copy/paste of the needed objects). For larger clusters (over 500 GB), which may require parallelized dump and restore, directory and custom formats should be considered.

Conclusion

Many backup and long-term archival options are available for users of Aurora PostgreSQL. Between AWS-native and PostgreSQL-native solutions, solutions are available to meet a variety of customer use cases. Knowing when and where to apply each strategy is vital to successfully deploying any of the strategies discussed.

We welcome your feedback on this post, and encourage you to share your experience and any questions in the comments.

About the Authors

Peter Celentano is a Senior Specialist Solutions Architect at AWS, focusing on managed PostgreSQL. He works with AWS customers to design scalable, secure, performant, and robust database architectures on the cloud.

Sundar Raghavan is a Principal Database Specialist Solutions Architect at AWS and specializes in relational databases. Among his customers are clients across multiple industries in support of Oracle, PostgreSQL, and migration from Oracle to PostgreSQL on AWS. Previously, Sundar served as a database and data platform architect at Oracle, Cloudera/Horton Works. He enjoys reading, watching movies, playing chess and being outside when he is not working.

Kapil Shardha is Principal Solutions Architect at AWS. He supports enterprise customers with their AWS adoption and his background is in infrastructure automation and DevOps.