Implement disaster recovery with Amazon Redshift

Amazon Redshift is a fully managed, petabyte-scale data warehouse service in the cloud. You can start with just a few hundred gigabytes of data and scale to a petabyte or more. This enables you to use your data to acquire new insights for your business and customers.

The objective of a disaster recovery plan is to reduce disruption by enabling quick recovery in the event of a disaster that leads to system failure. Disaster recovery plans also allow organizations to make sure they meet all compliance requirements for regulatory purposes, providing a clear roadmap to recovery.

This post outlines proactive steps you can take to mitigate the risks associated with unexpected disruptions and make sure your organization is better prepared to respond and recover Amazon Redshift in the event of a disaster. With built-in features such as automated snapshots and cross-Region replication, you can enhance your disaster resilience with Amazon Redshift.

Disaster recovery planning

Any kind of disaster recovery planning has two key components:

Recovery Point Objective (RPO) – RPO is the maximum acceptable amount of time since the last data recovery point. This determines what is considered an acceptable loss of data between the last recovery point and the interruption of service.
Recovery Time Objective (RTO) – RTO is the maximum acceptable delay between the interruption of service and restoration of service. This determines what is considered an acceptable time window when service is unavailable.

To develop your disaster recovery plan, you should complete the following tasks:

Define your recovery objectives for downtime and data loss (RTO and RPO) for data and metadata. Make sure your business stakeholders are engaged in deciding appropriate goals.
Identify recovery strategies to meet the recovery objectives.
Define a fallback plan to return production to the original setup.
Test out the disaster recovery plan by simulating a failover event in a non-production environment.
Develop a communication plan to notify stakeholders of downtime and its impact to the business.
Develop a communication plan for progress updates, and recovery and availability.
Document the entire disaster recovery process.

Disaster recovery strategies

Amazon Redshift is a cloud-based data warehouse that supports many recovery capabilities out of the box to address unforeseen outages and minimize downtime.

Amazon Redshift RA3 instance types and Redshift serverless store their data in Redshift Managed Storage (RMS), which is backed by Amazon Simple Storage Service (Amazon S3), which is highly available and durable by default.

In the following sections, we discuss the various failure modes and associated recovery strategies.

Using backups

Backing up data is an important part of data management. Backups protect against human error, hardware failure, virus attacks, power outages, and natural disasters.

Amazon Redshift supports two kinds of snapshots: automatic and manual, which can be used to recover data. Snapshots are point-in-time backups of the Redshift data warehouse. Amazon Redshift stores these snapshots internally with RMS by using an encrypted Secure Sockets Layer (SSL) connection.

Redshift provisioned clusters offer automated snapshots that are taken automatically with a default retention of 1 day, which can be extended for up to 35 days. These snapshots are taken every 5 GB data change per node or every 8 hours, and the minimum time interval between two snapshots is 15 minutes. The data change must be greater than the total data ingested by the cluster (5 GB times the number of nodes). You can also set a custom snapshot schedule with frequencies between 1–24 hours. You can use the AWS Management Console or ModifyCluster API to manage the period of time your automated backups are retained by modifying the RetentionPeriod parameter. If you want to turn off automated backups altogether, you can set up the retention period to 0 (not recommended). For additional details, refer to Automated snapshots.

Amazon Redshift Serverless automatically creates recovery points approximately every 30 minutes. These recovery points have a default retention of 24 hours, after which they get automatically deleted. You do have the option to convert a recovery point into a snapshot if you want to retain it longer than 24 hours.

Both Amazon Redshift provisioned and serverless clusters offer manual snapshots that can be taken on-demand and be retained indefinitely. Manual snapshots allow you to retain your snapshots longer than automated snapshots to meet your compliance needs. Manual snapshots accrue storage charges, so it’s important that you delete them when you no longer need them. For additional details, refer to Manual snapshots.

Amazon Redshift integrates with AWS Backup to help you centralize and automate data protection across all your AWS services, in the cloud, and on premises. With AWS Backup for Amazon Redshift, you can configure data protection policies and monitor activity for different Redshift provisioned clusters in one place. You can create and store manual snapshots for Redshift provisioned clusters. This lets you automate and consolidate backup tasks that you had to do separately before, without any manual processes. To learn more about setting up AWS Backup for Amazon Redshift, refer to Amazon Redshift backups. As of this writing, AWS Backup does not integrate with Redshift Serverless.

Node failure

A Redshift data warehouse is a collection of computing resources called nodes.
Amazon Redshift will automatically detect and replace a failed node in your data warehouse cluster. Amazon Redshift makes your replacement node available immediately and loads your most frequently accessed data from Amazon S3 first to allow you to resume querying your data as quickly as possible.

If this is a single-node cluster (which is not recommended for customer production use), there is only one copy of the data in the cluster. When it’s down, AWS needs to restore the cluster from the most recent snapshot on Amazon S3, and that becomes your RPO.

We recommend using at least two nodes for production.

Cluster failure

Each cluster has a leader node and one or more compute nodes. In the event of a cluster failure, you must restore the cluster from a snapshot. Snapshots are point-in-time backups of a cluster. A snapshot contains data from all databases that are running on your cluster. It also contains information about your cluster, including the number of nodes, node type, and admin user name. If you restore your cluster from a snapshot, Amazon Redshift uses the cluster information to create a new cluster. Then it restores all the databases from the snapshot data. Note that the new cluster is available before all of the data is loaded, so you can begin querying the new cluster in minutes. The cluster is restored in the same AWS Region and a random, system-chosen Availability Zone, unless you specify another Availability Zone in your request.

Availability Zone failure

A Region is a physical location around the world where data centers are located. An Availability Zone is one or more discrete data centers with redundant power, networking, and connectivity in a Region. Availability Zones enable you to operate production applications and databases that are more highly available, fault tolerant, and scalable than would be possible from a single data center. All Availability Zones in a Region are interconnected with high-bandwidth, low-latency networking, over fully redundant, dedicated metro fiber providing high-throughput, low-latency networking between Availability Zones.

To recover from Availability Zone failures, you can use one of the following approaches:

Relocation capabilities (active-passive) – If your Redshift data warehouse is a single-AZ deployment and the cluster’s Availability Zone becomes unavailable, then Amazon Redshift will automatically move your cluster to another Availability Zone without any data loss or application changes. To activate this, you must enable cluster relocation for your provisioned cluster through configuration settings, which is automatically enabled for Redshift Serverless. Cluster relocation is free of cost, but it is a best-effort approach subject to resource availability in the Availability Zone being recovered in, and RTO can be impacted by other issues related to starting up a new cluster. This can result in recovery times between 10–60 minutes. To learn more about configuring Amazon Redshift relocation capabilities, refer to Build a resilient Amazon Redshift architecture with automatic recovery enabled.
Amazon Redshift Multi-AZ (active-active) – A Multi-AZ deployment allows you to run your data warehouse in multiple Availability Zones simultaneously and continue operating in unforeseen failure scenarios. No application changes are required to maintain business continuity because the Multi-AZ deployment is managed as a single data warehouse with one endpoint. Multi-AZ deployments reduce recovery time by guaranteeing capacity to automatically recover and are intended for customers with mission-critical analytics applications that require the highest levels of availability and resiliency to Availability Zone failures. This also allows you to implement a solution that is more compliant with the recommendations of the Reliability Pillar of the AWS Well-Architected Framework. Our pre-launch tests found that the RTO with Amazon Redshift Multi-AZ deployments is under 60 seconds or less in the unlikely case of an Availability Zone failure. To learn more about configuring Multi-AZ, refer to Enable Multi-AZ deployments for your Amazon Redshift data warehouse. As of writing, Redshift Serverless currently does not support Multi-AZ.

Region failure

Amazon Redshift currently supports single-Region deployments for clusters. However, you have several options to help with disaster recovery or accessing data across multi-Region scenarios.

Use a cross-Region snapshot

You can configure Amazon Redshift to copy snapshots for a cluster to another Region. To configure cross-Region snapshot copy, you need to enable this copy feature for each data warehouse (serverless and provisioned) and configure where to copy snapshots and how long to keep copied automated or manual snapshots in the destination Region. When cross-Region copy is enabled for a data warehouse, all new manual and automated snapshots are copied to the specified Region. In the event of a Region failure, you can restore your Redshift data warehouse in a new Region using the latest cross-Region snapshot.

The following diagram illustrates this architecture.

For more information about how to enable cross-Region snapshots, refer to the following:

Use a custom domain name

A custom domain name is easier to remember and use than the default endpoint URL provided by Amazon Redshift. With CNAME, you can quickly route traffic to a new cluster or workgroup created from snapshot in a failover situation. When a disaster happens, connections can be rerouted centrally with minimal disruption, without clients having to change their configuration.

For high availability, you should have a warm-standby cluster or workgroup available that regularly receives restored data from the primary cluster. This backup data warehouse could be in another Availability Zone or in a separate Region. You can redirect clients to the secondary Redshift cluster by setting up a custom domain name in the unlikely scenario of an entire Region failure.

In the following sections, we discuss how to use a custom domain name to handle Region failure in Amazon Redshift. Make sure the following prerequisites are met:

You need a registered domain name. You can use Amazon Route 53 or a third-party domain registrar to register a domain.
You need to configure cross-Region snapshots for your Redshift cluster or workgroup.
Turn on cluster relocation for your Redshift cluster. Use the AWS Command Line Interface (AWS CLI) to turn on relocation for a Redshift provisioned cluster. For Redshift Serverless, this is automatically enabled. For more information, see Relocating your cluster.
Take note of your Redshift endpoint. You can locate the endpoint by navigating to your Redshift workgroup or provisioned cluster name on the Amazon Redshift console.

Set up a custom domain with Amazon Redshift in the primary Region

In the hosted zone that Route 53 created when you registered the domain, create records to tell Route 53 how you want to route traffic to Redshift endpoint by completing the following steps:

On the Route 53 console, choose Hosted zones in the navigation pane.
Choose your hosted zone.
On the Records tab, choose Create record.
For Record name, enter your preferred subdomain name.
For Record type, choose CNAME.
For Value, enter the Redshift endpoint name. Make sure to provide the value by removing the colon (:), port, and database. For example, redshift-provisioned.eabc123.us-east-2.redshift.amazonaws.com.
Choose Create records.

Use the CNAME record name to create a custom domain in Amazon Redshift. For instructions, see Use custom domain names with Amazon Redshift.

You can now connect to your cluster using the custom domain name. The JDBC URL will be similar to jdbc:redshift://prefix.rootdomain.com:5439/dev?sslmode=verify-full, where prefix.rootdomain.com is your custom domain name and dev is the default database. Use your preferred editor to connect to this URL using your user name and password.

Steps to handle a Regional failure

In the unlikely situation of a Regional failure, complete the following steps:

Use a cross-Region snapshot to restore a Redshift cluster or workgroup in your secondary Region.
Turn on cluster relocation for your Redshift cluster in the secondary Region. Use the AWS CLI to turn on relocation for a Redshift provisioned cluster.
Use the CNAME record name from the Route 53 hosted zone setup to create a custom domain in the newly created Redshift cluster or workgroup.
Take note of the Redshift endpoint’s newly created Redshift cluster or workgroup.

Next, you need to update the Redshift endpoint in Route 53 for achieve seamless connectivity.

On the Route 53 console, choose Hosted zones in the navigation pane.
Choose your hosted zone.
On the Record tab, select the CNAME record you created.
Under Record details, choose Edit record.
Change the value to the newly created Redshift endpoint. Make sure to provide the value by removing the colon (:), port, and database. For example, redshift-provisioned.eabc567.us-west-2.redshift.amazonaws.com.
Choose Save.

Now when you connect to your custom domain name using the same JDBC URL from your application, you should be connected to your new cluster in your secondary Region.

Use active-active configuration

For business-critical applications that require high availability, you can set up an active-active configuration at the Region level. There are many ways to make sure all writes occur to all clusters; one way is to keep the data in sync between the two clusters by ingesting data concurrently into the primary and secondary cluster. You can also use Amazon Kinesis to sync the data between two clusters. For more details, see Building Multi-AZ or Multi-Region Amazon Redshift Clusters.

Additional considerations

In this section, we discuss additional considerations for your disaster recovery strategy.

Amazon Redshift Spectrum

Amazon Redshift Spectrum is a feature of Amazon Redshift that allows you to run SQL queries against exabytes of data stored in Amazon S3. With Redshift Spectrum, you don’t have to load or extract the data from Amazon S3 into Amazon Redshift before querying.

If you’re using external tables using Redshift Spectrum, you need to make sure it is configured and accessible on your secondary failover cluster.

You can set this up with the following steps:

Replicate existing S3 objects between the primary and secondary Region.
Replicate data catalog objects between the primary and secondary Region.
Set up AWS Identity and Access Management (IAM) policies for accessing the S3 bucket residing in the secondary Region.

Cross-Region data sharing

With Amazon Redshift data sharing, you can securely share read access to live data across Redshift clusters, workgroups, AWS accounts, and Regions without manually moving or copying the data.

If you’re using cross-Region data sharing and one of the Regions has an outage, you need to have a business continuity plan to fail over your producer and consumer clusters to minimize the disruption.

In the event of an outage affecting the Region where the producer cluster is deployed, you can take the following steps to create a new producer cluster in another Region using a cross-Region snapshot and by reconfiguring data sharing, allowing your system to continue operating:

Create a new Redshift cluster using the cross-Region snapshot. Make sure you have correct node type, node count, and security settings.
Identify the Redshift data shares that were previously configured for the original producer cluster.
Recreate these data shares on the new producer cluster in the target Region.
Update the data share configurations in the consumer cluster to point to the newly created producer cluster.
Confirm that the necessary permissions and access controls are in place for the data shares in the consumer cluster.
Verify that the new producer cluster is operational and the consumer cluster is able to access the shared data.

In the event of an outage in the Region where the consumer cluster is deployed, you will need to create a new consumer cluster in a different Region. This makes sure all applications that are connecting to the consumer cluster continue to function as expected, with proper access.

The steps to accomplish this are as follows:

Identify an alternate Region that is not affected by the outage.
Provision a new consumer cluster in the alternate Region.
Provide necessary access to data sharing objects.
Update the application configurations to point to the new consumer cluster.
Validate that all the applications are able to connect to the new consumer cluster and are functioning as expected.

For additional information on how to configure data sharing, refer to Sharing datashares.

Federated queries

With federated queries in Amazon Redshift, you can query and analyze data across operational databases, data warehouses, and data lakes. If you’re using federated queries, you need to set up federated queries from the failover cluster as well to prevent any application failure.

Summary

In this post, we discussed various failure scenarios and recovery strategies associated with Amazon Redshift. Disaster recovery solutions make restoring your data and workloads seamless so you can get business operations back online quickly after a catastrophic event.

As an administrator, you can now work on defining your Amazon Redshift disaster recovery strategy and implement it to minimize business disruptions. You should develop a comprehensive plan that includes:

Identifying critical Redshift resources and data
Establishing backup and recovery procedures
Defining failover and failback processes
Enforcing data integrity and consistency
Implementing disaster recovery testing and drills

Try out these strategies for yourself, and leave any questions and feedback in the comments section.

About the authors

Nita Shah is a Senior Analytics Specialist Solutions Architect at AWS based out of New York. She has been building data warehouse solutions for over 20 years and specializes in Amazon Redshift. She is focused on helping customers design and build enterprise-scale well-architected analytics and decision support platforms.

Poulomi Dasgupta is a Senior Analytics Solutions Architect with AWS. She is passionate about helping customers build cloud-based analytics solutions to solve their business problems. Outside of work, she likes travelling and spending time with her family.

Ranjan Burman is an Analytics Specialist Solutions Architect at AWS. He specializes in Amazon Redshift and helps customers build scalable analytical solutions. He has more than 16 years of experience in different database and data warehousing technologies. He is passionate about automating and solving customer problems with cloud solutions.

Jason Pedreza is a Senior Redshift Specialist Solutions Architect at AWS with data warehousing experience handling petabytes of data. Prior to AWS, he built data warehouse solutions at Amazon.com and Amazon Devices. He specializes in Amazon Redshift and helps customers build scalable analytic solutions.

Agasthi Kothurkar is an AWS Solutions Architect, and is based in Boston. Agasthi works with enterprise customers as they transform their business by adopting the Cloud. Prior to joining AWS, he worked with leading IT consulting organizations on customers engagements spanning Cloud Architecture, Enterprise Architecture, IT Strategy, and Transformation. He is passionate about applying Cloud technologies to resolve complex real world business problems.