How Merck Automated AWS Elastic Disaster Recovery Initialization and Monitoring

Blog is guest authored by Nasia Ullas of MSD.

Enhancing the resilience and productivity of manufacturing processes is essential for pharmaceutical companies to meet business continuity objectives and innovate continuously. Merck & Co., Inc., also known as MSD outside of the United States and Canada, a global bio-pharmaceutical company, mitigated resilience challenges by adopting AWS Elastic Disaster Recovery (AWS DRS). The solution minimizes downtime and data loss with fast, reliable recovery of cloud-based applications using affordable storage, minimal compute, and point-in-time recovery. To improve end user productivity, user experience and the knowledge gap, the company set out to automate the AWS DRS initialization steps, while addressing specific organizational requirements. In addition, the team needed a robust monitor and alert system to address any replication failure scenario.

Prior to this automation, initializing AWS DRS required that, the end user have an in-depth understanding of the company’s network architecture. Users also had to choose the appropriate VPC, subnet details for the staging, and target environment in both the production and disaster recovery (DR) region. As an added complexity, Merck has mandatory tagging requirements for Amazon Elastic Compute Cloud (Amazon EC2) instances. These additional steps require manual effort, impacting the application’s recovery time objective (RTO). To automate, the company leveraged another post, demonstrating how to automate the initialization and setup of AWS Application Migration Service and AWS DRS using AWS Service Catalog based automation.

This post demonstrates the power of the automated AWS DRS initialization steps using AWS CloudFormation and integration of a custom monitoring and alerting solution into a single Infrastructure as Code (IaC) template. This solution sends email alerts and creates tickets for any replication failure event. Deploying this easy-to-use CloudFormation template improved employee productivity, reduced manual errors, and helped them achieve their RTO targets for individual applications.

Overview of solution

This solution automates the initialization of AWS DRS to meet the company’s mandatory requirement of specific subnet and security group selection. The automation uses AWS CloudFormation to meet all the organizational standards for AWS DRS accelerated implementation, eliminating the risk of human error or any knowledge gap error.

Furthermore, this automation leverages Amazon CloudWatch to create a dashboard and an automated alerting system, which integrates with the company’s internal alerting software. The monitoring mechanism is necessary for application teams to gather insights into their servers’ replication status. The target account hosts the application which uses AWS DRS, and the central account congregates and processes alerts from target accounts before forwarding them to Merck’s internal ticketing system.

The following diagram illustrates the automation of AWS DRS initialization steps.

The image illustrates how the initilialization stack in AWS CloudFormation deploys both the initialization infrastructure and the monitoring infrastructure to the target account.

Figure 1 : CloudFormation initialization stack deploying initialization and monitoring infrastructures to the target account.

The automation steps are below:

CloudFormation creates and triggers an AWS Lambda function as a custom resource.
Lambda updates the Amazon EC2 instance-role with 3 IAM policies (AWSElasticDisasterRecoveryAgentInstallationPolicy, AWSElasticDisasterRecoveryEc2InstancePolicy, AWSElasticDisasterRecoveryRecoveryInstancePolicy) required to failover and failback using AWS DRS.
Replication instances are servers created by AWS DRS to initiate replication from the source server and store data in staging area. Lambda will update the default replication settings, including the subnet and security group. These settings will facilitate that AWS DRS replication servers are deployed in appropriate staging subnet and are associated with required security groups to meet replication ensues.
Lambda then updates the default launch settings, including subnet and security group. These settings will facilitate the deployment of DR servers within right subnet with appropriate security groups. This arrangement is designed to create an environment where the application can function as intended in the DR region. Setting this up beforehand reduces RTO during a disaster.
Lambda updates the default DRS Amazon EC2 launch template. This action aims to apply the mandatory Merck standard tags to all DR instances when it is spun up. It also facilitates Instance Metadata Service Version 2 (IMDSv2) is enabled on DR instances, which is a company mandate. Not having either of these will lead to a failed DR.
CloudFormation deploys the monitoring stack to create the monitoring solution.

The monitoring StackSets deployed in step#6 deploys the infrastructure shown in the target accounts in the below diagram.

The image illustrates how the AWS DRS monitoring infrastructure monitors multiple AWS accounts for DR-related events, and forwards the events to a central monitoring account.

Figure 2: Multi-account AWS DRS monitoring infrastructure forwarding DR events to central monitoring account.

Monitoring Solution

1. AWS CloudFormation creates and triggers a Lambda function as a custom resource. The Lambda Function sends a custom event to the central account’s Amazon EventBridge event bus, notifying the central account that the target account has deployed the monitoring stack. Additionally, the function scans the DR region for any AWS DRS source servers, then creates CloudWatch alarms to monitor the LagDuration and Backlog metrics for each of the source servers. Last, the Lambda Function updates a CloudFormation dashboard with the metrics of all AWS DRS source servers in the target account’s DR region.
  
  Figure 3: CloudWatch dashboard showing LagDuration and Backlog metrics for AWS DRS source servers in DR region.
2. When a source server is created or destroyed, Amazon EventBridge triggers the Lambda function to update the account’s Amazon CloudWatch alarms and dashboard accordingly.
3. When AWS DRS publishes adverse events (such as the failure of a DR failover process) or if an Amazon CloudWatch alarm is triggered, an EventBridge rule receives the event and triggers a Lambda function. The function processes the event, queries AWS DRS and Amazon EC2 for more information about the server which triggered the event and sends an alert message to MSD’s DR team via Amazon Simple Notification Service (Amazon SNS). The EventBridge rule also forwards the event to the central account for further processing.
4. The central account’s EventBridge bus receives events from all target accounts, and it triggers the central account’s event processing function to handle and process the events. If the event was sourced from the deployment of a target account’s monitoring template the function will store the target account’s account ID and region in an Amazon DynamoDB table. If the event is related to an AWS DRS alert, it will send that data via API to MSD’s internal monitoring tool.
5. Every 24 hours, the Lambda function will assume a cross-account IAM role in all target accounts, and it records the status of each AWS DRS source server. The function then creates a CSV report (Figure 4), showing the status of each source server, and stores the report in S3. The function then sends an email to MSD’s DR team via Amazon SNS, notifying them that the daily report is ready. This daily report empowers the DR team with a comprehensive view of their disaster recovery readiness across all accounts.
  
  Figure 4: CSV file representing the daily report for the AWS DRS monitoring system.
Conclusion

In this post, we showed how automating the AWS DRS initialization steps using AWS CloudFormation improved productivity and user experience. The automation streamlined the process, eliminating the need for manual work and the associated risk of human error, thereby saving approximately 2 hours of debugging time and ensuring the application’s recovery time objective was met during the DR test. Furthermore, after implementing this automation, over the past six months, all applications utilizing AWS DRS have consistently met their defined recovery time objectives (RTOs) and recovery point objectives (RPOs). Additionally, we outlined how the monitoring and alerting system help to maintain their robust resiliency posture.

About the Authors:

Nasia Ullas

Nasia is Associate Director in Cloud and Infrastructure Technologies with MSD and is responsible for Resilience and Disaster Recovery for MSD. She believes in continuous improvement and has successfully led the company in deploying multiple innovative solutions which has accelerated the resiliency of MSD’s enterprise applications. When not working, she enjoys working out, journaling, and spending time with her daughters and husband.

Sushovan Basak

Sushovan is a Senior Technical Account Manager at AWS, helping enterprise customers with their cloud adoption and modernization journey. He is passionate about utilizing his analytical, coding, and automation skills to tackle any problem that comes his way. Outside of work, he enjoys watching sci-fi movies, playing video games, and jamming with friends.

Stephen McCullough

Stephen is a Cloud Infrastructure Architect at AWS Professional Services. He is highly motivated to solve his customers’ problems using the broad toolset offered by AWS. Outside of the office, Stephen spends his time playing sand volleyball, playing pickleball, and reading.

Rohit Jagetia

Rohit is a Senior Solutions Architect at AWS, supporting healthcare and life sciences customers. With two decades of experience leading and managing infrastructure initiatives, he specializes in guiding customers through migrating and modernizing workloads to the cloud. This allows them to refocus efforts on innovation instead of infrastructure management. Outside of work, Rohit enjoys playing sports such as tennis, cricket, and racquetball. He also continues developing his expertise by studying for various certification exams.