Disaster Recovery (DR) Architecture on AWS, Part III: Pilot Light and Warm Standby

In this blog post, you will learn about two more active/passive strategies that enable your workload to recover from disaster events such as natural disasters, technical failures, or human actions. Previously, I introduced you to four strategies for disaster recovery (DR) on AWS. Then we explored the backup and restore strategy. Now let’s learn about the pilot light and warm standby strategies.

DR strategies: Pilot light or warm standby

When selecting your DR strategy, you must weigh the benefits of lower RTO (recovery time objective) and RPO (recovery point objective) vs the costs of implementing and operating a strategy. The pilot light and warm standby strategies both offer a good balance of benefits and cost, as shown in Figure 1.

Figure 1. DR strategies

Implementing pilot light or warm standby

Figures 2 and 3 show how to implement the pilot light and warm standby strategies, respectively. These are both active/passive strategies (see the “Active/passive and active/active DR strategies” section in my previous post). The left AWS Region is the primary Region that is active, and the right Region is the recovery Region that is passive before failover.

Figure 2. Pilot light DR strategy

Figure 3. Warm standby DR strategy

Similarities between these two DR strategies

Both strategies might replicate data from the primary Region to data resources in the recovery Region, such as Amazon Relational Database Service (Amazon RDS) DB instances or Amazon DynamoDB tables. These data resources in the recovery Region are then ready to serve requests. In addition to replication, both strategies require you to create a continuous backup in the recovery Region. This is because when human action type disasters occur, data can be deleted or corrupted, and replication will replicate the bad data. Backups are necessary to enable you to get back to the last known good state.

Resources used for the workload infrastructure are deployed in the recovery Region for both strategies. This includes support infrastructure such as Amazon Virtual Private Cloud (Amazon VPC) with subnets and routing configured, Elastic Load Balancing, and Amazon EC2 Auto Scaling groups. For both strategies, the deployed infrastructure will require additional actions to become production ready. However, the extent of workload infrastructure readiness differs between the two strategies, as detailed in the next section.

As required for all active/passive strategies, both require a means to route traffic to the primary Region, and then fail over to the recovery Region when recovering from a disaster.

RPO for these strategies is similar, since they share a common data strategy.

Differences between these two DR strategies

The primary difference between the two strategies is infrastructure deployment and readiness. The warm standby strategy deploys a functional stack, but at reduced capacity. The recovery Region endpoint can handle requests, but cannot handle production levels of traffic. This is shown as one Amazon Elastic Compute Cloud (Amazon EC2) instance per tier in Figure 3. It may be more, but is always less than the full production deployment for cost savings. If the passive stack is deployed to the recovery Region at full capacity however, then this strategy is known as “hot standby.” Because warm standby deploys a functional stack to the recovery Region, this makes it easier to test recovery readiness using synthetic transactions.

A pilot light in a home furnace does not provide heat to the home. It provides a quick way to light the furnace burners that then provide heat. Similarly, the recovery Region in a pilot light strategy (unlike warm standby) cannot serve requests until additional steps are taken. Figure 2 shows an EC2 Auto Scaling group that is configured, but it has no deployed EC2 instances.

RTO for these strategies is different. Warm standby can handle traffic at reduced levels immediately. Then it requires you to scale out this existing deployment, which gives it a lower RTO time than pilot light. This is because pilot light requires you to first deploy infrastructure and then scale out resources before the workload can handle requests.

Recovery with pilot light or warm standby

When a disaster occurs, successful recovery depends on detection of the disaster event, restoration of the workload in the recovery Region, and failover to send traffic to the recovery Region.

1. Detect

In a previous blog post, I showed how quick detection is essential for low RTO, and I shared a serverless architecture to achieve this. It relies in part on Amazon CloudWatch alarms that enable you to determine your workload health based on metrics such as:

Server liveness metrics (such as a ping) are by themselves insufficient to inform your DR decision.
Service API metrics such as error rates and response latencies are a good way to understand your workload health.
Service validation tests provide metrics on the function and correctness of your API operations. Using Amazon CloudWatch Synthetics allows you to create scripts that call your service and validate responses. This gives excellent insight on your workload health.
Workload key performance indicators (KPIs) are among the best metrics you can use to understand workload health. KPIs indicate whether the workload is performing as intended and meeting customer needs. For example, an ecommerce workload would look at order rates. Because order rates constantly change with time, CloudWatch anomaly detection can be used to detect if an order drop is not typical.

2. Restore

Using the AWS Command Line Interface (AWS CLI) or AWS SDK, you can script scaling up the desired count for resources such as concurrency for AWS Lambda functions, number of Amazon Elastic Container Service (Amazon ECS) tasks, or desired EC2 capacity in your EC2 Auto Scaling groups. In the cloud, you can easily create or delete resources.

Also, AWS CloudFormation is a powerful tool for making these updates. Using CloudFormation parameters and conditional logic, you can create a single template that can create both active stacks (primary Region) or passive stacks (recovery Region). The following is an excerpt from a CloudFormation template. It lets you specify “active” or “passive” for the parameter ActiveOrPassive, which determines whether zero or non-zero EC2 instances will be deployed. You can download the entire template here. In this example to choose between two options we use the !If function to set the DesiredCapacity value. For more than two options, the !FindInMap function would also be a good choice.

Parameters:
  Web1AutoScaleDesired:
    Default: '3'
    Description: The desired number of web instances in auto scaling group
    # other properties omitted
  Web1AutoScaleMax:
    Default: '6'
    Description: The maximum number of web instances in auto scaling group
    # other properties omitted
  ActiveOrPassive:
    Default: 'active'
    Description: Is this the active (primary) deployment or the passive (recovery) deployment
    Type: String
    AllowedValues:
      - active
      - passive
    ConstraintDescription: enter active or passive, all lowercase

# Determine whether this is an active stack or passive stack
Conditions:
  IsActive: !Equals [!Ref "ActiveOrPassive", "active"]

Resources:
  #https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/aws-properties-as-group.html
  WebAppAutoScalingGroup:
    Type: 'AWS::AutoScaling::AutoScalingGroup'
    Properties:
      MinSize: !If [IsActive, !Ref Web1AutoScaleDesired, 0]
      MaxSize: !Ref Web1AutoScaleMax
      DesiredCapacity: !If [IsActive, !Ref Web1AutoScaleDesired, 0]
      # other properties omitted

The parameter value can be set via the AWS Management Console as shown in Figure 4. Here it is set “passive,” and no EC2 instances will be deployed.

Figure 4. Setting ActiveOrPassive to “passive” for the CloudFormation stack using parameters

Or to automate the process, you can use the AWS CLI to update the stack, and change the ActiveOrPassive value. The following command will update the EC2 Auto Scaling group, which currently has no EC2 instances to add three (the value of Web1AutoScaleDesired) EC2 instances. No new template is supplied; this command only updates the parameter value to active.

aws cloudformation update-stack \
    --stack-name SampleWebApp --use-previous-template \
    --capabilities CAPABILITY_NAMED_IAM \
    --parameters ParameterKey=ActiveOrPassive,ParameterValue=active

3. Fail over

Failover re-directs production traffic from the primary Region (where you have determined the workload can no longer run) to the recovery Region. If you are using Amazon Route 53 for DNS, you can set up both your primary Region and recovery Region endpoints under one domain name. Then choose a routing policy that determines which endpoint receives traffic for that domain name.

Failover routing will automatically send traffic to the recovery Region if the primary is unhealthy based on health checks you configure. Fully automatic failover such as this should be used with caution. Even using the best practices discussed here, recovery time and recovery point will be greater than zero, incurring some loss of availability and data. If you fail over when you don’t need to (false alarm), then you incur those losses. If needed, fall back to the original location will again incur similar losses.

To implement manually initiated failover you can use Amazon Route 53 Application Recovery Controller. With Route 53 ARC, you can create Route 53 health checks that do not actually check health, but instead act as on/off switches that you have full control over. Using the AWS CLI or AWS SDK, you can script failover using this highly available, data plane API. Your script toggles these switches (the Route 53 health checks) telling Route 53 to send traffic to the recovery Region instead of the primary Region. Another option for manually initiated failover that some have used is to use a weighted routing policy and change the weights of the primary and recovery Regions so that all traffic goes to the recovery Region. However, be aware this is a control plane operation and therefore not as resilient as the data plane approach using Amazon Route 53 ARC.

Instead of using Route 53 and DNS records, you can also use AWS Global Accelerator to implement failover. Customer traffic is onboarded at the closest of over 200 edge locations and travels over the AWS network to the endpoints you configure. This will result in lower latencies. Here too you can use endpoint health checks for automatic routing, or set the percent traffic to each endpoint using traffic dials; however, note that the latter is a control plane.

Conclusion

In case of disaster, both pilot light and warm standby offer the capability to limit data loss (RPO). Both offer sufficient RTO performance that enables you to limit downtime. Between these two strategies, you have a choice of optimizing for RTO or for cost.

AWS Architecture Blog