AWS Cloud Operations Blog

Disaster Recovery (DR) Failover to the Disconnected Edge

Introduction

Many enterprises rely on AWS to host the entirety of their infrastructure due to the inherent advantages of cloud computing. However, some enterprises operate mission critical workloads from remote areas at an increased risk to lose external network connectivity. For instance, a research facility located in a remote desert, an oil rig in international waters, or a military base in the remote Pacific. In this blog post, we will explore six best practices when designing for disaster recovery for these remote sites with limited external connectivity using AWS services.

These enterprises require a continuity of operations strategy that utilizes local resources to mitigate connectivity risk. The Department of Defense defines this as Denied, Disrupted, Intermittent, and/or Limited (DDIL) communications. AWS offers a disconnected edge compute option with AWS Snowball Edge that runs a subset of AWS services, such as Amazon Elastic Compute Cloud (Amazon EC2), Amazon Simple Storage Service (Amazon S3), and AWS Systems Manager.

Snowball Edge devices can constitute a local Disaster Recovery (DR) site when paired with other AWS services for the primary site running in the cloud. Pairing the power of the cloud with a local DR capability enables enterprises to take advantage of the cloud while managing connectivity risk. With this in mind, let’s dive deeper into planning for disaster recovery between an active site running in the cloud and a contingent site running locally on Snowball Edge devices.

1. Identify Requirements

As with any Disaster Recovery Plan, understanding business requirements is essential. First, consider the criticality of the workload – how important is it to the business? To the mission? What would the impact be if this workload went down? Identifying these answers and bucketing workloads by criticality level is an effective way to document and manage requirements. Underestimating criticality could result in business impact during an event, while overestimating the workload criticality will drive unnecessary cost and complexity. Working backwards from business and mission requirements is critical to striking the appropriate balance.

When documenting workload criticality, it is important to scope the workload to ensure all required components are available in case of a failover. Workload scope must account for dependency trees. For instance, if Application A is critical and depends on Application B, Database C, and Message Broker D, those dependencies are also critical.

When determining the level of criticality, consider the Recovery Point Objective (RPO) and Recovery Time Objective (RTO). RPO is the amount of tolerable data loss for a failover event, and RTO is the maximum duration the system can be unavailable. These requirements will drive replication frequencies and failover processes – four common disaster recovery strategies to accomplish this are described in Figure 1 below. Refer to Disaster Recovery Options in the Cloud for more information on said strategies.

Graph showing disaster recovery strategies and highlights of each strategy.

Figure 1 – Disaster recovery strategies

All the above elements become even more important in a disconnected scenario, where you are failing from the primary site in the cloud to a local failover site with limited resources. Which workloads are mission critical to failover? Which should be prioritized in the event of insufficient resources? All of these considerations should be analyzed when planning your failover architecture.

2. Architect Critical Workloads for Parity

Once criticality is determined, highly critical workloads must be architected to run in the cloud and locally by utilizing capabilities available on both sites. For example, you might elect to run containers on Amazon Elastic Kubernetes Service (EKS) rather than Amazon Elastic Container Service (ECS) because EKS can run on Snowball Edge with EKS Anywhere.

Leveraging AWS services based on Open Source Software (OSS) or Commercial Off the Shelf (COTS) software that can run locally on EC2 instances can mitigate this difference. This allows customers to realize benefits of managed services in the cloud while maintaining Snowball Edge feature parity with limited overhead. For example, using the Amazon Relational Database Service (RDS) in the cloud will yield security, reliability, and operational benefits customers are used to from AWS managed services while still allowing to failover to Snowball Edge. You can accomplish this by installing and operating the same database engine locally on EC2 instances to support your critical workload. For example, RDS can run PostgreSQL 15.3 in the cloud, and you can run the same version locally. Managing workload dependencies across both sites is critical to an effective failover capability.

3. Account for Foundational Capability Differences Between Cloud and Local Site

The cloud and local sites run workloads with different foundational supporting services. These supporting services include networking capabilities, computing capabilities, and monitoring capabilities. Fortunately, proper planning as part of an enterprise DDIL DR/COOP strategy can easily overcome these differences:

  • Routing – Many customers use Amazon Virtual Private Cloud (Amazon VPC) in the cloud for features such as network subnetting, routing, Dynamic Host Configuration Protocol (DHCP), and Network Time Protocol (NTP). Amazon VPC is not available on Snowball Edge, but you can replicate these features; for example:
    • The local switch and/or router used to connect the Snowball Edge devices and other local equipment can likely subnet and apply traffic routing logic.
    • You can install the dhcp open source package on a Snowball Edge. It contains the Internet Systems Consortium (ISC) DHCP Server.
    • You can install the chrony package on a Snowball Edge to use its implementation of NTP for your local site..
  • DNS – Amazon Route 53 provides DNS in the cloud but is not available on Snowball Edge. You can install and run Berkely Internet Name Domain (BIND) locally on Snowball Edge to provide DNS functionality for the local site.
  • Load Balancing – Amazon Elastic Load Balancer (Amazon ELB) provides load balancing services within the cloud. Though ELB is not available on Snowball Edge, you can install open source load balancers like HAProxy or NGINX locally. You may also consider COTS alternatives like F5 Big IP.
  • Compute – Amazon EC2 is available in the cloud and on Snowball Edge, but offers different hardware capabilities. Review hardware dependencies against EC2 on both sites to ensure compatibility, especially with respect to processor architecture and GPUs.
  • Monitoring – Amazon CloudWatch is the primary service for application and infrastructure monitoring in the cloud. Amazon CloudWatch is not available on Snowball Edge, but you can use OSS like OpenSearch, Prometheus, Grafana, Fluent Bit and OpenTelemetry to collect, process and visualize the operational state.

See the below graphic for a visualization for how all this comes together.

Overview of Available AWS services in region versus on Snowball Edge, including Cross-Site design considerations.

Figure 2 – Cloud and Local Site Available Services with Design Considerations

4. Define Your Data Replication Strategy

Data must be periodically replicated from the cloud to the local site, so that the passive local site can be activated when needed. Data refers to assets that the staged application will need to satisfy the RPO requirement and could include configuration files, data files, and/or database snapshots. The data replication frequency also depends on RPO; the lower the RPO, the more frequently the data must be refreshed.

When replicating data at scale for a portfolio of workloads, it’s helpful to stage the data in a single source of truth with clear taxonomy, defined as part of DR/COOP design. The structure should be standardized and mirror your organization so that it is intuitive to navigate. One simple bucket structure would be portfolio, workload, and component (i.e., EnterpriseSecurity / LoginService / FrontEnd). Processes and guard rails should also maintain the structure to maintain integrity. For instance, you can create IAM policies limiting which prefixes various roles or users can access, preventing administrators from replicating data to the wrong places. Reference the Amazon S3 documentation for an example of how to apply permissions to prefixes.

Amazon S3 is a logical place to serve as this source of truth because it is highly available, durable, secure, cost effective and available on Snowball Edges, simplifying the replication process. In fact, local Snowball Edge devices can be clustered to scale local S3 storage.

Data can be copied in to the S3 staging zone using a number of methods depending on the source location. For example:

  • Amazon RDS PostgreSQL can natively copy database snapshots to S3.
  • AWS DataSync can copy data from multiple shared file systems to S3.
  • AWS CLI can copy local files to S3.

Snowball Edge comes with AWS Systems Manager – a secure end-to-end management solution for resources on AWS and hybrid environments. It can continuously run tasks on a local Snowball Edge to synchronize the S3 staging bucket in the cloud with the corresponding local S3 bucket using a tool like S5cmd or AWS CLI. Once replicated, Systems Manager tasks can apply those artifacts to the passive local environment so that it’s ready when needed.

Sample Data Replication Design between an AWS Region and a Local Site using AWS Snowball Edges. Primary site shows EC2 application server and RDS DB instance replicating into a data staging bucket. Local Site shows data flowing from primary site to a data replication Systems Manager Task and redeployed with an Artifact Deployment task to application and DB servers running on a Compute Optimized Snowball edge. Systems manager on the Snowball Edge also replicates to a data staging bucket on an Snowball Edge S3 Compatible Storage cluster.

Figure 3 – Sample Primary to Local Site Data Replication Design

5. Design for High Availability within the Local Site

Just as you would replicate applications and data across multiple availability zones in the cloud to protect against AZ failure, consider replicating those assets across multiple Snowball Edge devices to mitigate Snowball Edge device failure risk. Deploy applications across multiple Snowball Edge devices in an active/passive or active/active configuration using a load balancer or DNS health checks for failovers. To ensure that critical data is not stored on a single Snowball Edge device, you can use several tools to manage applications and data across Snowball Edge devices:

  • Systems Manager can trigger, manage, and report on replication jobs.
  • A load balancer can distribute network traffic across devices with health checks..
  • RSYNC can replicate application files.
  • Cluster databases across multiple Snowball Edge devices.
  • S3 Compatible Storage Clusters can manage availability across Snowball Edge devices.

Consider the following example, which employs a subset of these tools to provide redundancy:

  • The application and database servers are replicated across two Snowball Edge Devices in an active/passive configuration. The passive instance is running and accessible to the end user, so it is ready to activate as needed. A DNS service with health checks could be used to trigger a failover.
  • The database servers are clustered across Snowball Edge devices. Each transaction is replicated across Snowball Edge devices to prevent data loss in case of a hardware failure.
  • The application leverages a Snowball Edge cluster (multiple Snowball Edge devices) running S3. Said clusters replicate data across multiple devices so that data is not lost when a device fails. Refer to the Clustering Overview documentation for more information.

Sample Local Site Redundancy Architecture using Snowball Edges for S3 storage clustering and running application and DB EC2 instances.

Figure 4 – Sample Local Site Redundancy Design with Snowball Edge Devices

Lastly, plan for hardware failures outside of Snowball Edge devices as well. For example, if you are using a local switch to connect Snowball Edge and local workstations, consider what will happen when that switch fails and how your organization will recover. Planning for device failure on the local site will yield dividends when activated during a disaster.

6. Plan for Operations and Maintenance

Maintaining a passive site in heterogenous environments requires consistent consideration. Configurations applied to the cloud site and not successfully translated to the local site will likely cause issues during failover. However, this risk can be mitigated by applying the following operational practices:

  • Explicitly account for both environments during the change control process; how will this change impact the Secondary Environment as well as the Primary Environment and what is the test plan for each environment? Environmental changes should be treated as atomic operations; either applied to both environments or not.
  • Maintain separate configuration baselines but cross reference them to maintain traceability.
  • Leverage Infrastructure as Code (IaC) tooling, such as Ansible, to consistently apply changes across both sites.
  • Ensure role-based training considers AWS Region capabilities, Snowball Edge capabilities, and key Standard Operating Procedures (SOPs) like how to failover.
  • Periodically review mission critical functions against workload categorization and execute failover exercises to test the failover procedure as well as Secondary Environment capabilities against mission critical functionality.

Additionally, we offer AWS OpsHub to centrally manage existing Snowball Edge devices locally in an environment. This provides a graphical user interface for all Snowball API operations and a unified view of all AWS services running on local Snowball Edges. Additionally, this service works with AWS Systems Manager to automate operational tasks on Snowball Edges, as discussed earlier. You can find more information here.

Conclusion

Implementing a DR/COOP strategy with Snowball Edge as the local site enables organizations to leverage the benefits of the cloud while mitigating connectivity risk. To be successful in this endeavor, enterprises must understand which workloads and features are mission critical, the dependency tree for each mission critical workload, and how that workload would operate on both sites. Enterprises must also account for differences between each environment in their ongoing operational processes to ensure the local site is ready for failover. When implemented successfully, COOPs with local Snowball Edge site enables enterprises to harness the power of the cloud, even for mission critical workloads in remote locations with limited or intermittent connectivity.

If your organization is interested in adopting AWS but unable to due to limited connectivity, consider this local COOP strategy as mitigation. If your organization is already operating in the cloud but is concerned about intermittent connectivity, consider building a local site on Snowball Edge. Connect with your Account Team or Click Here to learn more about AWS Edge Services.

References:

About the authors:

Image of David Horne, who is a Sr. Solutions Architect supporting Federal System Integrators at AWS. He is based in Washington, DC and has 15 years of experience building, modernizing and integrating systems for Public Sector customers. Outside of work, Dave enjoys playing with his kids, hiking, and watching Penn State football!

David Horne

David Horne is a Sr. Solutions Architect supporting Federal System Integrators at AWS. He is based in Washington, DC and has 15 years of experience building, modernizing and integrating systems for Public Sector customers. Outside of work, Dave enjoys playing with his kids, hiking, and watching Penn State football!

Image of Felipe Palhano, who is a Sr. Solutions Architect supporting large federal partners as part of the Growth, Partnerships, and Strategy (GPS) organization at AWS. He is based in Arlington, VA and has over 10 years of experience leading the architecture, migration, and optimization of mission critical systems for federal system integrators across the public sector space.

Luiz Felipe Breyer Palhano de Jesus

Felipe Palhano is a Sr. Solutions Architect supporting large federal partners as part of the Growth, Partnerships, and Strategy (GPS) organization at AWS. He is based in Arlington, VA and has over 10 years of experience leading the architecture, migration, and optimization of mission critical systems for federal system integrators across the public sector space.