AWS Cloud Operations Blog
How Merck Automated AWS Elastic Disaster Recovery Initialization and Monitoring
Blog is guest authored by Nasia Ullas of MSD.
Enhancing the resilience and productivity of manufacturing processes is essential for pharmaceutical companies to meet business continuity objectives and innovate continuously. Merck & Co., Inc., also known as MSD outside of the United States and Canada, a global bio-pharmaceutical company, mitigated resilience challenges by adopting AWS Elastic Disaster Recovery (AWS DRS). The solution minimizes downtime and data loss with fast, reliable recovery of cloud-based applications using affordable storage, minimal compute, and point-in-time recovery. To improve end user productivity, user experience and the knowledge gap, the company set out to automate the AWS DRS initialization steps, while addressing specific organizational requirements. In addition, the team needed a robust monitor and alert system to address any replication failure scenario.
Prior to this automation, initializing AWS DRS required that, the end user have an in-depth understanding of the company’s network architecture. Users also had to choose the appropriate VPC, subnet details for the staging, and target environment in both the production and disaster recovery (DR) region. As an added complexity, Merck has mandatory tagging requirements for Amazon Elastic Compute Cloud (Amazon EC2) instances. These additional steps require manual effort, impacting the application’s recovery time objective (RTO). To automate, the company leveraged another post, demonstrating how to automate the initialization and setup of AWS Application Migration Service and AWS DRS using AWS Service Catalog based automation.
This post demonstrates the power of the automated AWS DRS initialization steps using AWS CloudFormation and integration of a custom monitoring and alerting solution into a single Infrastructure as Code (IaC) template. This solution sends email alerts and creates tickets for any replication failure event. Deploying this easy-to-use CloudFormation template improved employee productivity, reduced manual errors, and helped them achieve their RTO targets for individual applications.
Overview of solution
This solution automates the initialization of AWS DRS to meet the company’s mandatory requirement of specific subnet and security group selection. The automation uses AWS CloudFormation to meet all the organizational standards for AWS DRS accelerated implementation, eliminating the risk of human error or any knowledge gap error.
Furthermore, this automation leverages Amazon CloudWatch to create a dashboard and an automated alerting system, which integrates with the company’s internal alerting software. The monitoring mechanism is necessary for application teams to gather insights into their servers’ replication status. The target account hosts the application which uses AWS DRS, and the central account congregates and processes alerts from target accounts before forwarding them to Merck’s internal ticketing system.
The following diagram illustrates the automation of AWS DRS initialization steps.
The automation steps are below:
- CloudFormation creates and triggers an AWS Lambda function as a custom resource.
- Lambda updates the Amazon EC2 instance-role with 3 IAM policies (AWSElasticDisasterRecoveryAgentInstallationPolicy, AWSElasticDisasterRecoveryEc2InstancePolicy, AWSElasticDisasterRecoveryRecoveryInstancePolicy) required to failover and failback using AWS DRS.
- Replication instances are servers created by AWS DRS to initiate replication from the source server and store data in staging area. Lambda will update the default replication settings, including the subnet and security group. These settings will facilitate that AWS DRS replication servers are deployed in appropriate staging subnet and are associated with required security groups to meet replication ensues.
- Lambda then updates the default launch settings, including subnet and security group. These settings will facilitate the deployment of DR servers within right subnet with appropriate security groups. This arrangement is designed to create an environment where the application can function as intended in the DR region. Setting this up beforehand reduces RTO during a disaster.
- Lambda updates the default DRS Amazon EC2 launch template. This action aims to apply the mandatory Merck standard tags to all DR instances when it is spun up. It also facilitates Instance Metadata Service Version 2 (IMDSv2) is enabled on DR instances, which is a company mandate. Not having either of these will lead to a failed DR.
- CloudFormation deploys the monitoring stack to create the monitoring solution.
The monitoring StackSets deployed in step#6 deploys the infrastructure shown in the target accounts in the below diagram.
Monitoring Solution
-
- AWS CloudFormation creates and triggers a Lambda function as a custom resource. The Lambda Function sends a custom event to the central account’s Amazon EventBridge event bus, notifying the central account that the target account has deployed the monitoring stack. Additionally, the function scans the DR region for any AWS DRS source servers, then creates CloudWatch alarms to monitor the LagDuration and Backlog metrics for each of the source servers. Last, the Lambda Function updates a CloudFormation dashboard with the metrics of all AWS DRS source servers in the target account’s DR region.
- When a source server is created or destroyed, Amazon EventBridge triggers the Lambda function to update the account’s Amazon CloudWatch alarms and dashboard accordingly.
- When AWS DRS publishes adverse events (such as the failure of a DR failover process) or if an Amazon CloudWatch alarm is triggered, an EventBridge rule receives the event and triggers a Lambda function. The function processes the event, queries AWS DRS and Amazon EC2 for more information about the server which triggered the event and sends an alert message to MSD’s DR team via Amazon Simple Notification Service (Amazon SNS). The EventBridge rule also forwards the event to the central account for further processing.
- The central account’s EventBridge bus receives events from all target accounts, and it triggers the central account’s event processing function to handle and process the events. If the event was sourced from the deployment of a target account’s monitoring template the function will store the target account’s account ID and region in an Amazon DynamoDB table. If the event is related to an AWS DRS alert, it will send that data via API to MSD’s internal monitoring tool.
- Every 24 hours, the Lambda function will assume a cross-account IAM role in all target accounts, and it records the status of each AWS DRS source server. The function then creates a CSV report (Figure 4), showing the status of each source server, and stores the report in S3. The function then sends an email to MSD’s DR team via Amazon SNS, notifying them that the daily report is ready. This daily report empowers the DR team with a comprehensive view of their disaster recovery readiness across all accounts.
Conclusion
In this post, we showed how automating the AWS DRS initialization steps using AWS CloudFormation improved productivity and user experience. The automation streamlined the process, eliminating the need for manual work and the associated risk of human error, thereby saving approximately 2 hours of debugging time and ensuring the application’s recovery time objective was met during the DR test. Furthermore, after implementing this automation, over the past six months, all applications utilizing AWS DRS have consistently met their defined recovery time objectives (RTOs) and recovery point objectives (RPOs). Additionally, we outlined how the monitoring and alerting system help to maintain their robust resiliency posture.
About the Authors: