Well-Architected approach to CloudEndure Disaster Recovery

Disaster recovery (DR) is a critical element of a company’s business continuity plan, enabling the quick resumption of IT operations and minimizing data loss if a disaster were to strike. Organizations often carry a high operational and cost burden to ensure continued operation of their applications, and databases in case of disaster. This includes operating a second physical site with duplicate hardware and software, management of multiple hardware or application specific replication tools, and ensuring readiness via periodic drills.

AWS’s CloudEndure Disaster Recovery (CEDR) makes it easy to shift your DR strategy to the AWS Cloud from existing physical or virtual data centers, private clouds, or other public clouds. CEDR is available on the AWS Marketplace as a software as a service (SaaS) contract and as a SaaS subscription. In this SaaS delivery model, AWS hosts and operates the CEDR application. As an additional component, CEDR uses an operating system level agent on each server that must be replicated and protected to AWS. The agent performs the block level replication to AWS.

The Well-Architected Framework has been developed to help cloud architects build secure, high-performing, resilient, and efficient infrastructure for their applications. Based on five pillars: operational excellence, security, reliability, performance efficiency, and cost optimization. The Framework provides a consistent approach for customers and partners to evaluate architectures, and implement designs that will scale over time.

Before deploying CEDR, it is recommended that you design, apply, and continually review the Well-Architected Framework to ensure that your target architecture aligns with cloud best practices. This blog provides guidance on applying the AWS Well-Architected Framework’s five pillars and provides best practices on aligning your CEDR deployment to the pillars to ensure success.

Operational Excellence pillar

Ensuring operational excellence when implementing CEDR begins with readiness. Educate the team by using the documentation, guides, videos, and playbooks in the CEDR online documentation. Once you are familiar with CEDR, you should:

Map your source environment into logical groups for DR. These groups may be based on location, application criticality, service level agreements (SLAs), recovery point objectives (RPO), recovery time objectives (RTO), etc. CEDR uses these groupings to prioritize and sequence recovery.
Ensure that the requisite staging and target VPCs to be used by CEDR are aligned to the Well-Architected Framework. This also applies other networking resources. CEDR orchestrates Amazon EC2 instances, Amazon EBS volumes, and VPC components for operation and recovery. Compute, storage, and network service limits should be validated and increased where necessary. Consider leveraging capacity reservations to ensure EC2 capacity when needed. Review account-level resource and API quotas, and limits that may be encountered during operation and recovery. Individual account API limits can be mitigated by using separate accounts for staging and recovering into a single production account.
Automate deployment, monitoring, and management of CEDR using the available CEDR API. Use the monitoring and logging data to monitor the operation status of CEDR, and inject the data into existing centralized monitoring tools.

Security pillar

The Well-Architected Framework’s security principles should be applied at multiple layers when deploying CEDR. Design principles, including strong identity protection, traceability, all-layer network security, and data protection, should be extended to CEDR.

To automate orchestration and recovery, CEDR uses the AWS API via IAM user credentials with programmatic access. The CEDR IAM uses a policy that permits the requisite actions, filtered by tagging, for resources, where AWS API functionality permits. IAM policies leveraging additional restrictions are available in the CEDR documentation. CEDR uses an extra layer of security by tagging all orchestrated resources with CEDR tags, and only tracking resources with these tags. The CEDR IAM user credential keys should be rotated on a regular basis, at least every 90 days. CEDR API can automate the rotation of keys using this example code.

Access to the CEDR User Console is authorized via user email and validating a unidirectional hash of the password. User management permits limiting access to specific projects, and should be implemented when practical. To facilitate and secure user management, integration with an existing IdP (that is, AD FS, Ping, Okta, etc.) via SAML should be considered. SAML ensures that access to CEDR is centrally managed, and aligns with corporate authentication standards including SSO, password requirements, password rotation, MFA, automated user provisioning, etc.

CEDR agent uses an HTTPS connection to the User Console, which is used for management and monitoring. The User Console stores metadata about the source server in an encrypted database. The source data is replicated directly from source infrastructure to the target AWS account. While the connection can be public or private, it is recommended to use a private connection rather than the public internet. Enabling a private connection, and disabling the allocation of public IPs for CEDR replication servers should be set under replication settings.

To ensure data security, CEDR encrypts all data replication traffic from the source to the staging area in your AWS account using the AES-256 encryption standard. Data at rest should be encrypted using the AWS KMS service. CEDR should be set to use the appropriate KMS key for EBS encryption under replication settings.

Reliability pillar

The best mechanism to ensure reliability in case of a disaster is to regularly validate and test recovery procedures. CEDR enables unlimited test launches allowing both spot testing, and full user acceptance and application testing. It is critical to test launch instances after initial synchronization to confirm availability and operation.

Automate recovery by monitoring key performance indicators (KPIs), which vary by organization and workload. Once the KPIs are identified, you can integrate CEDR launch triggers into existing monitoring tools. An example of monitoring and automating DR using CEDR is detailed in this AWS blog.

The CEDR User Console is managed by AWS and uses Well-Architected principles ensuring reliability and scalability. DR and redundancy plans are in place to ensure availability of the CEDR User Console. CEDR is leveraged to replicate its own User Console using a separate and isolated stack. This ensures recoverability of the User Console if the public SaaS version becomes unavailable.

Performance Efficiency pillar

As a SaaS solution, CEDR uses the performance efficiency pillar to maintain performance as demand changes. The replication architecture uses t3.small replication servers that can support the replication of most source servers. At times, source servers handling intensive write operations may require larger or dedicated replication servers. Dedicated or larger replication servers may be selected with minimal interruption to replication and be used for periods of time, for example to reduce initial sync times. CEDR uses a m5.xlarge instance type as the default dedicated replication server that provides increased network and EBS bandwidth for write intensive workloads where a shared t3.small replication server is the ingest bottleneck.

To meet RTOs, typically measured in minutes, CEDR defaults target EBS volumes to Provisioned IOPS SSD (IO1) in the blueprint. During the target launch processes, an intensive I/O re-scanning process of all hardware and drivers, due to changing the hardware configuration, may occur. IO1 volumes reduce this impact to RTO. If IO1 volumes are not required for normal workload performance, we recommend that the volume type be programmatically changed after instance initialization. Alternatively, Standard or SSD volume types may be selected in the blueprint before launch. To ensure they meet your RTO requirements, be sure to test launch the various volume types.

While CEDR encrypts all traffic in transit, it is recommended to use a secure connection from the source infrastructure to AWS via a VPN or AWS Direct Connect. The connection must have enough bandwidth to support the rate of data change to support ongoing replication, including spikes and peaks. Network efficiency and saturation may be impacted during the initial synchronization of data. CEDR agents utilize available bandwidth when replicating data. Throttling can be used to reduce impact on shared connections, and can be accomplished using bandwidth shaping tools to limit traffic on TCP Port 1500, or using the throttling option within CEDR. Considerations for throttling should include programmatically scheduling limits to avoid peak times, and understanding the impact of throttling to RPOs. It is recommended that all throttling be disabled once the initial sync of data is complete.

Cost Optimization pillar

CEDR takes advantage of the most effective services and resources, to achieve RPO of seconds, and RTO measured in minutes, at a minimal cost. The type of resources used for replication can be configured to balance cost, and RTO, and RPO requirements. For replication, CEDR uses shared t3.small instances in the staging area. Using EC2 Reserved Instances for replication servers is a method to reduce costs. Each shared replication server can mount up to 15 EBS volumes. In the staging area, CEDR uses either magnetic (<500 GB) or GP2 (>500 GB) EBS volumes to keep storage costs low. CEDR provides the ability to decrease storage costs further by using ST1 (>500 GB) EBS volumes. The use of ST1 EBS volumes can be configured in the replication settings, however this may impact RPOs and RTOs.

In the event of a disaster, CEDR triggers an orchestration engine that launches production instances in the target AWS Region within minutes. The production instances’ configuration is based on the defined blueprint. Using the appropriate configuration of resources is key to cost savings. Right sizing the instance type selected in the blueprint ensures the lowest cost resource that meets the needs of the workload. Selecting appropriate EBS volume type for root and data volumes has a significant impact on consumption costs. TSO Logic, an AWS company, is a service that provides right sizing recommendations, and can be imported into the CEDR blueprint. Additionally, third-party discovery tools, including Flexera | RISC Networks CloudScape, Cloudamize, and Device42 export formatted data that can directly set CEDR blueprints in mass.

Conclusion

In this post I reviewed best practices and considerations for operating CEDR in AWS. Reviewing and applying the AWS Well-Architected Framework is a key step when deploying CloudEndure Disaster Recovery. It lays the foundation to ensure a consistent and successful disaster recovery strategy. If disaster were to strike, implementing the concepts presented in the Operational, Security, and Reliability pillars supports a successful recovery. CEDR allows recoverability for your most critical workloads, while decreasing the total cost of ownership of your DR strategy.

Please visit the AWS CloudEndure Disaster Recovery page to learn how to get started with CEDR and to review case studies of customers that have leveraged CEDR to shift their recovery site to AWS. Additional best practices can be viewed in the CloudEndure documentation.

Thanks for reading this blog post, if you have any comments or questions, feel free to share them in the comment section.