Implementing a disaster recovery strategy with Amazon RDS
This post was updated 2/1/2021 to fix a statement about how to share automated snapshots between AWS Accounts.
Amazon RDS (Relational Database Service) is a managed service that makes it easier to set up, operate, and scale a relational database. Based on AWS high performance compute and storage, Amazon RDS supports the MySQL, SQL Server, PostgreSQL, MariaDB, and Oracle database engines. It offers a complete set of solutions for provisioning, patching, monitoring, and disaster recovery (DR). This blog presents three features in Amazon RDS that support DR: automated backups, manual backups, and Read Replicas.
Why do you need a DR plan?
For a production environment, it is important to take precautions so that you can recover if there’s an unexpected event. While Amazon RDS provides a highly available Multi-AZ configuration, it can’t protect from every possibility, such as a natural disaster, a malicious actor, or logical corruption of a database. To maintain business continuity, it is important to design and test a DR plan.
Understanding RTO and RPO
Recovery time objective (RTO) and recovery point objective (RPO) are two key metrics to consider when developing a DR plan. RTO represents how many hours it takes you to return to a working state after a disaster. RPO, which is also expressed in hours, represents how much data you could lose when a disaster happens. For example, an RPO of 1 hour means that you could lose up to 1 hour’s worth of data when a disaster occurs.
Different features of Amazon RDS support different RTOs and RPOs at different cost points:
|Automated backups||Good||Better||Low||Single Region|
As you can see, automated backups are limited to a single AWS Region while manual snapshots and Read Replicas are supported across multiple Regions.
This post does not explicitly cover Amazon Aurora, because Amazon Aurora has slightly different DR features. However, many of the techniques presented are applicable to Aurora DB clusters. For more information, see the Amazon Aurora documentation.
Amazon RDS backups
Backups are a key component of a DR plan for your database. Amazon RDS supports two different types of backups: automated backups, and manual snapshots.
Amazon RDS backups follow these rules:
- Your DB instance must be in the ACTIVE state for backups to occur.
- Automated backups and automated snapshots do not occur while a copy is executing in the same Region for the same DB instance.
- The first snapshot of a DB instance contains the data of the full DB instance.
- The snapshots taken after the first snapshot are incremental snapshots. This means that only the latest changed data is captured and saved.
- If it’s a Multi-AZ configuration, backups occur on the standby to reduce impact on the primary.
Note: Automated backups and manual snapshots are stored in an S3 bucket that is owned and managed by the Amazon RDS service. Hence, you are not able to see them from your Amazon S3 console.
For detailed information on backup mechanisms and backup storage, see Working with Backups in the Amazon RDS User Guide.
The automated backup feature of Amazon RDS is turned on by default. Amazon RDS creates a storage volume snapshot of your DB instance, backing up the entire DB instance and not just individual databases. The first backup consists of a full instance backup. Subsequent backups are incremental in nature with snapshots containing only the blocks that changed since the previous backup. Each snapshot contains pointers to all of the snapshot data blocks that are required to reconstruct it. Deleting the earlier snapshot does not cause data loss as long as the data is still referenced by at least one snapshot.
Why do we need automated backups?
There are several benefits of having automated backups:
- Data is stored in a S3 bucket that is owned and managed by Amazon RDS service.
- You avoid the pressure of having to set aside time to do a manual backup and transfer it to a safe location.
- You can choose a timeline that works for you: daily, weekly, or monthly.
- Both manual and natural calamities are mitigated (for example, viruses, software malfunctions, or power outages).
- Most importantly, you avoid the adverse effects of losing valuable data.
Automated backup window
The automated backup window is a weekly time period used for creating automated backups. The window is selected at random from an 8-hour block of time for each AWS Region. However, I strongly suggest that you set the backup window during low peak hours to prevent undue load on the server. For a list of the time blocks for each Region, see Backup Window in the Amazon RDS User Guide.
Automated backup retention period
The backup retention period is the time window during which you can perform a point-in-time restore (PITR). You can set a different backup retention period when you create a DB instance, and you can modify the retention period at any time.
For more detailed information, see Backup Retention Period.
There are a few differences between manual snapshots and automated backups:
- Manual snapshot limits (100 snapshots per Region) do not apply to automated backups
- The backup retention period does not apply for manual snapshots
- Manual snapshots are not automatically deleted; they must be explicitly deleted.
Configuring automated backups during instance creation
- On the Backup tab, in the Configure advanced settings section, set the Backup retention period and Backup window.
- Set a Start time for the backup and the Duration in hours that your database requires to finish the backup. It’s a good practice to set the backup window during non-peak hours.
Modifying automated backup settings
To change your automated backup settings, follow these steps:
- Click the DB instance that you want change the automated backup setting for, and click the Modify button.
- Scroll down to Backup and make the changes you want on the Backup retention period tab and the Backup window
- Under When to apply modifications, select the option you want, and then click Modify DB instance.
Restoring to a specified point in time
Point-in-time recovery (PITR) is the process of restoring a database to the state it was in at a specified date and time.
When automated backups are turned on for your DB instance, Amazon RDS automatically performs a full daily snapshot of your data. The snapshot occurs during your preferred backup window. It also captures transaction logs to Amazon S3 every 5 minutes (as updates to your DB instance are made). Archiving the transaction logs is an important part of your DR process and PITR. When you initiate a point-in-time recovery, transactional logs are applied to the most appropriate daily backup in order to restore your DB instance to the specific requested time.
For instructions, see Restoring a DB Instance to a Specified Time in the Amazon RDS User Guide.
Retaining automated backups
When you delete a DB instance, you can choose to retain its automated backups. This can be useful if you later decide to restore the DB instance. Retained automated backups contain automated snapshots and transaction logs from a DB instance. They also include your DB instance properties (such as allocated storage and DB instance class), which are required to restore it to an active instance. You can restore or remove retained automated backups using the AWS Management Console, the Amazon RDS API, and the AWS CLI. See Retaining Automated Backups in the Amazon RDS User Guide for more information on limitations and recommendations for retaining automated backups.
Database snapshots are manual (user-initiated) backups of your complete DB instance that serve as full backups. They’re stored in Amazon S3, and are retained until you explicitly delete them. These snapshots can be copied and shared to different Regions and accounts. Because DB snapshots include the entire DB instance, including data files and temporary files, the size of the instance affects the amount of time it takes to create the snapshot.
Creating a DB snapshot on a Single-AZ DB instance leads to a brief I/O suspension. The I/O suspension can last a few seconds or minutes depending on the instance size and class of your DB instance. Multi-AZ DB instances are not affected by the I/O suspension because the backup is taken from the standby.
See the Amazon RDS User Guide for instructions on Creating a DB Snapshot.
You can view snapshots in the Amazon RDS console. Click Snapshots, and then choose Manual Snapshots. Then choose a snapshot from the list.
Copying and sharing snapshots
In Amazon RDS, you can copy automated or manual DB snapshots. When you create a copy of a snapshot, that copy becomes a manual snapshot. You can copy a snapshot within the same AWS Region or across AWS Regions, and you can even copy a snapshot across AWS accounts. If you copy a DB snapshot to another AWS Region, you create a manual DB snapshot that is retained in that AWS Region. Copying a DB snapshot out of the source AWS Region incurs Amazon RDS data transfer charges.
You cannot directly copy an automated snapshot to another AWS account. To share an automated DB snapshot, create a manual DB snapshot by copying the automated snapshot, and then share that copy. This process also applies to AWS Backup–generated resources. Instead, it is a two-step process, where you first share the snapshot, and then copy it in the other account.
For detailed information on copying snapshots, including limitations, see Copying a Snapshot in the Amazon RDS User Guide.
Amazon RDS cross-account snapshot sharing
Amazon RDS enables you to share DB snapshots or cluster snapshots with other AWS accounts. Sharing snapshots with other highly secure accounts can be helpful if you are concerned about a “bad actor” disrupting operations in your production accounts. You can share manual DB snapshots with up to 20 AWS accounts.
- Automated Amazon RDS snapshots cannot be shared directly with other AWS accounts. To share an automated snapshot, you first make a copy of the snapshot, which turns it into a manual version. Then you share the copy with the other account.
- Manual snapshots of DB instances that use custom option groups with persistent or permanent options, such as Transparent Data Encryption (TDE) and time zone, cannot be shared.
- Snapshots that use the default Amazon RDS encryption key (aws/rds) cannot be shared directly. Instead, you first copy the snapshot by choosing a custom encryption key, and then you share the custom key and the copied snapshot.
For detailed instructions on sharing snapshots across accounts, see Sharing a DB Snapshot in the Amazon RDS User Guide.
Restoring from a DB snapshot
If a disaster occurs, you can create a new DB instance by restoring from a DB snapshot. When you restore the DB instance, you choose the name of the DB snapshot from which you want to restore. Then, you provide a name for the new DB instance that is created. Here are a few things to note about the restoration process:
- You cannot restore from a DB snapshot to an existing DB instance. Instead, you create a new DB instance when you restore. If you want to use the same name as the existing DB instance, you must first delete or rename the existing one.
- While it’s possible to restore a DB snapshot to a DB instance with a different storage type than the source DB instance, the restoration process is slower. There is additional work required to migrate the data to a new storage type.
- You can’t restore a DB instance from a shared DB snapshot that is encrypted. Instead, you make a copy of the DB snapshot and then restore the DB instance from the copy.
- It’s a good practice to retain the parameter group of any DB snapshots that you create. This enables you to restore the DB instance with the correct parameter group.
- When you restore from a DB snapshot, by default the option group that is associated with the DB snapshot is associated with the restored DB instance. You can associate a different option group with a restored DB instance. However, the new option group must contain any persistent or permanent options that were included in the original option group.
For detailed instructions, see Restoring from a DB Snapshot in the Amazon RDS User Guide.
As discussed, when you perform a cross-Region restore of a DB snapshot, first you copy the snapshot to the desired Region. Then, you can restore the DB snapshot to a new DB instance.
Integrating with AWS Backup
Amazon RDS DB snapshots can be integrated with AWS Backup. AWS Backup is a fully managed backup service that you can use to centralize and automate the backup of data across AWS services in the cloud and on premises. Using AWS Backup, you can centrally configure backup policies and monitor backup activity for your AWS resources. For more information, see AWS Backup.
Amazon RDS for MariaDB, MySQL, PostgreSQL, and Oracle support the ability to create Read Replicas of a source database. When you create a Read Replica, Amazon RDS first takes a snapshot of the source DB instance, and then creates a read-only instance. Amazon RDS then uses the asynchronous replication method of the DB engine to update the Read Replica whenever there is a change made on the source DB instance. The Read Replica operates as a DB instance that allows only read-only connections. Applications can connect to a Read Replica the same way they do to any DB instance. Amazon RDS replicates all objects in the source DB instance. By default, a Read Replica is created with the same instance and storage type as the source DB instance. However, you can create a Read Replica that has a different storage type from the source DB instance. You can create up to five Read Replicas per source DB instance.
In addition to using Read Replicas to reduce the load on your source DB instance, you can also use Read Replicas to implement a DR solution for your production DB environment. If the source DB instance fails, you can promote your Read Replica to a standalone source server. Read Replicas can also be created in a different Region than the source database. Using a cross-Region Read Replica can help ensure that you get back up and running if you experience a regional availability issue.
An important metric to monitor with a Read Replica is the replica lag, which is the amount of time that the replica is behind the source database. A replica lag can impact your recovery. Replica lag can vary based on the network latency between the source and destination Regions. It can also be affected by the amount of traffic that is being replicated. Because Read Replicas have a running DB instance, the time required to recover after a disaster is lower. However, using Read Replicas in this way is generally a more expensive option than using automated backups or database snapshots.
For instructions on creating a Read Replica or cross-Region Read Replica, see Working with Read Replicas in the Amazon RDS User Guide.
Promoting a Read Replica
Unlike an Amazon RDS Multi-AZ configuration, failover to a Read Replica is not an automated process. If you are using cross-Region Read Replicas, you should be certain that you want to switch your AWS resources between Regions. Cross-Region traffic can experience latency, and reconfiguring applications can be complicated.
For instructions, see Promoting a Read Replica in the Amazon RDS User Guide.
After you promote a cross-Region Read Replica to be a standalone instance, if you want to later switch back to the original Region, you must create a new Read Replica. Unlike an Amazon RDS Multi-AZ configuration, this is not done for you automatically.
Testing your DR plan
A DR plan is helpful only if it’s periodically tested and validated. Testing your DR plan helps you to identify potential issues or gaps so you can take corrective action. A full DR plan includes not only your database resources, but all of your application infrastructure. While a full DR plan test can take a significant amount of time and resources, it helps ensure that you feel confident it will work when needed.
In this post, I have shared some best practices for implementing DR strategy using Amazon RDS. This post provides a basic framework that you can implement on Amazon RDS for DR using automated backups, manual backups, and Read Replicas. I have also highlighted some key concepts, including point-in-time recovery, Amazon RDS cross-account snapshot sharing, and creating cross-Region Read Replicas.
If you have questions or comments about this blog post, use the comments section to post your thoughts.
About the author
Anuraag Deekonda is an Associate Consultant with the AWS Professional Services team. He works with customers to build scalable, highly-available, and secure solutions in AWS Cloud. His focus area is homogenous and heterogeneous migrations of on-premises databases to Amazon RDS and Aurora PostgreSQL.