AWS Storage Blog

Enhanced disaster recovery monitoring with CloudEndure and Amazon CloudWatch

Update (March 4, 2022): Updated Amazon CloudWatch events section to utilize Amazon EventBridge rules instead, allowing you to further customize your serverless event architecture. AWS is also deprecating the requests module in botocore to improve flexibility and performance, and added the ability to use Layers within AWS Lambda to continue using this module. Code has also been slightly reformatted.


Monitoring and troubleshooting Disaster Recovery (DR) is a critical component of any DR strategy. The ability to receive alerts and status updates is imperative when it comes to limiting downtime and maintaining business as usual. Often, DR tools do not have deep monitoring and notification built in. Administrators need to be logged in to the tool to see if it’s falling into a lag or backlog state.

CloudEndure Disaster Recovery is a block-level replication DR solution, that aids in accelerating and automating DR failover. Amazon CloudWatch monitors AWS resources as they are being consumed within the account. In this blog, we show how to use existing CloudEndure APIs, Amazon CloudWatch, and Amazon EventBridge to build a customizable and detailed dashboard for CloudEndure Disaster Recovery. You’ll learn how to get a deeper view of your DR health, and receive notifications about critical updates using AWS Lambda.

Overview of solution

The architecture enabling the CloudEndure Disaster Recovery health dashboard is represented in the following diagram:

CloudEndure DR architecture unchanged from standard implementation

The standard CloudEndure Disaster Recovery architecture is unchanged from the standard implementation, and only the additions of AWS Lambda, Amazon CloudWatch, and Amazon EventBridge are required.

This example shows how to schedule a Lambda function to run every five minutes, which queries the CloudEndure API using your API credentials. This populates an Amazon CloudWatch metric. From here, you can use the full feature set of AWS and related services to create dashboards, implement alerts, or drive automation.

Walk-through

The following steps show how to store CloudEndure credentials in AWS Secrets Manager and create the necessary IAM roles and security permissions for Lambda function. Implementing the Lambda function pushes metrics to CloudWatch. The function will be scheduled to run on a regular basis to build a Cloud DR Health dashboard.

Prerequisites

For this walk-through, you should have the following:

  • An AWS account
  • AWS resources
  • A CloudEndure Disaster Recovery account with agents installed in a healthy state, as shown here
  • Access to the following AWS services:

Securely store your CloudEndure account credentials in AWS Secrets Manager

  1. Open the AWS Secrets Manager console.
  2. On either the service introduction page or the Secrets list page, choose Store a new secret. On the Store a new secret page, choose Other type of secret. You choose this because your secret doesn’t apply to a database.
  3. Under Specify the key/value pairs to be stored in the secret, create a key value pair with a key of userAPIToken. As a value, enter your CloudEndure API Token.
  4. For Select the encryption key, choose DefaultEncryptionKey. Secrets Manager always encrypts the secret when you select this option and provides it at no charge to you. If you choose to use a custom KMS key, then AWS charges you at the standard AWS KMS rate.
  5. Secrets Manager uses a unique encryption key that resides within the account and can only be used with Secrets Manager in the same Region. Choose Next.
  6. Under Secret name, type a name for the secret in the text field. Use only alphanumeric characters and the characters /_+=.@-. For example, you can use a secret name, such as cloud_endure_credentials.
  7. In the Description field, type a description of the secret. For Description, type, for example, CloudEndure <Account_Name> api credentials, generated mm-dd-yyy.
  8. In the Tags section, add desired tags in the Key and Value – optional text fields. Choose Next.

You can leave tags blank. However, we recommend using tags as a best practice to help identify secrets.

For the purposes of this tutorial, you can leave automatic rotation disabled, however under Configure automatic rotation, you can enable Automatic rotation. It is good practice to rotate credentials, and instructions for implementing credential rotation are available in the Rotating Your AWS Secrets Manager Secrets guide.

Create IAM prerequisites

  1. Create the IAM CloudWatch policy for your Lambda function:
    • Navigate to the IAM console, select Policies, then Create Policy.
    • Select the JSON tab and enter the following policy document:
{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "VisualEditor0",
            "Effect": "Allow",
            "Action": [
                "cloudwatch:PutMetricData"
            ],
            "Resource": "*"
        }
    ]
}
    • Select Next: Review.
    • Select Create Policy.
  1. Create the IAM AWS Secrets Manager policy for your Lambda function:
    • Navigate to the IAM console and choose Create Policy.
    • Select the JSON tab and enter the following policy document:
{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "VisualEditor0",
            "Effect": "Allow",
            "Action": [
                "secretsmanager:GetResourcePolicy",
                "secretsmanager:GetSecretValue",
                "secretsmanager:DescribeSecret",
                "secretsmanager:ListSecretVersionIds",
                "secretsmanager:ListSecrets"
            ],
            "Resource": "*"
        }
    ]
}
    • Note that referencing the ARN of the secret created in the previous section, “cloud_endure_credentials,” would ensure that this policy provides access to only that secret.
    • Select Next: Tags.
      • Add any tags to use for identification.
    • Select Next: Review.
    • Choose Create policy.
  1. Navigate back to IAM console and select Roles, then select Create role:
    • Select AWS Service, and under Common use cases select Lambda.
    • Select Next: Permissions.
    • Select the policies we created in the previous step:
      1. CEDRMetricPolicy
      2. CEDRSecretsPolicy
    • Select Next: Tags.
      • Add appropriate tags.
    • Select Next: Review.
    • Enter role name: “ce-dr-monitoring-role.”
    • Select Create role.

Create AWS Lambda function

  1. Go to the AWS Lambda console.
  2. If you are new to Lambda, you see a welcome page. Choose Get Started Now. Otherwise, choose Create function.
  3. Select the Author from scratch option to create your Lambda function.
    • Enter the function name, CE-DR-Monitoring-Function.
    • Runtime: choose Python 3.6.
    • Select Create function.
  4. Select the Layers option.
    • Click Add a layer.
    • Choose Specify an ARN.
    • Enter the layer ARN (available here).
    • Select Verify.
    • Choose Add.
  5. Open the Configuration tab.
    • Select Edit.
    • Under Basic settings, change Timeout to 3 minutes.
    • Choose Save.
  6. Enter the following code into the Code source pane:
    import json
    import botocore
    from botocore.vendored import requests
    import botocore.exceptions
    import boto3
    import random
    import sys
    import csv
    import datetime
    
    HOST = 'https://console.cloudendure.com'
    headers = {'Content-Type': 'application/json'}
    
    # Create the Secrets Manager Client
    client = boto3.client('secretsmanager')
    
    def lambda_handler(event, context):
        session = {}
        endpoint = '/api/latest/{}'
    
        try:   
            get_secret_value_response = client.get_secret_value(
                SecretId="cloud_endure_credentials"
            )
        except botocore.exceptions.ClientError as e:
            raise e
        else:
            secret = json.loads(get_secret_value_response['SecretString'])
        
        
        login_data = {
            "userApiToken": secret["userAPIToken"]
        }
    
        cloudwatch = boto3.client('cloudwatch')
        
        r = requests.post(HOST + endpoint.format('login'), data = json.dumps(login_data), headers = headers)
        print(r)
        if r.status_code != 200 and r.status_code != 307:
            return {
            'statusCode': 200,
            'body': json.dumps('Bad login credentials')
            }
        
        # check if need to use a different API entry point
        if r.history:
            endpoint = '/' + '/'.join(r.url.split('/')[3:-1]) + '/{}'
            r = requests.post(HOST + endpoint.format('login'), data = json.dumps(login_data), headers = headers)
        
        session = {'session': r.cookies['session']}
        
        headers['X-XSRF-TOKEN'] = r.cookies['XSRF-TOKEN']
        
      
        r = requests.get(HOST + endpoint.format('projects'), headers = headers, cookies = session)
        if r.status_code != 200:
            return {
            'statusCode': 200,
            'body': json.dumps('Failed to fetch the project')
            }
      
        try:
            projects = json.loads(r.content)['items']
            for project in projects:
                if project['type'] == "DR": # This ensures we skip migration projects that may be in the account.
                    machines = False
                    project_id = project['id']
                    r = requests.get(HOST + endpoint.format('projects/{}/machines').format(project_id), headers = headers, cookies = session)
                    if r.status_code != 200:
                            return {
                                'statusCode': 200,
                                'body': json.dumps('Failed to fetch the machines')
                            }
    
                    machine = json.loads(r.content)['items']
                    if machines == []:
                            continue
    
                    for machine in json.loads(r.content)['items']:
                        backlog = 0
    
                        if 'name' not in machine['sourceProperties']:   # If the machine name is missing, then skip it
                                continue
                    
                        if 'backloggedStorageBytes' in machine['replicationInfo']:      # Check for 
                            backlog = machine['replicationInfo']['backloggedStorageBytes']
                        else:
                            backlog = 0
                        
                        if 'lastConsistencyDateTime' in machine['replicationInfo']:      # confirm element is present
                            last_consistent = machine['replicationInfo']['lastConsistencyDateTime']     #store last consistent backup time
                            last_consistent_dt = datetime.datetime.strptime(last_consistent[:19], '%Y-%m-%dT%H:%M:%S')  #format time for date-time processing
                            diff = datetime.datetime.utcnow()-last_consistent_dt    # Store the age of the last consistent backup - i.e. what is our current Recovery Point Objective actual 
                            lag = diff.total_seconds()/60
                        else:
                            lag = 0                                     # Server has not completed its initial sync
    
                        response1 = cloudwatch.put_metric_data(
                            MetricData = [
                                {
                                    'MetricName': 'MachineData',
                                    'Dimensions': [                                         # Dimensions provide meta data for sorting / organizing the information. 
                                        {
                                            'Name': 'PROJECT_NAME',                         # CE Project Name (note this will be the project name as per the CE Console)
                                            'Value': project['name']
                                        },
                                        {
                                            'Name': 'MACHINE_NAME',                         # Machine identifier as per the CE console
                                            'Value': machine['sourceProperties']['name']
                                        },
                                    ],
                                    'Unit': 'None',
                                    'Value': lag
                                },
                            ],
                            Namespace = 'CE-Replication-Lag'        # This is how you find the stored measures in the CloudWatch console.
                        )
                        #print response1    # uncomment this to troubleshoot put metric response for lag
                    
                        response2 = cloudwatch.put_metric_data(
                            MetricData = [
                            {
                                'MetricName': 'MachineData',
                                'Dimensions': [
                                    {
                                        'Name': 'PROJECT_NAME',
                                        'Value': project['name']
                                    },
                                    {
                                        'Name': 'MACHINE_NAME',
                                        'Value': machine['sourceProperties']['name']
                                    },
                                ],
                                'Unit': 'None',
                                'Value': backlog
                            },
                            ],
                            Namespace = 'CE-Replication-Backlog'        # This is how you will find the stored measures in the CloudWatch console.
                        )
                        #print response2    # uncomment this to troubleshoot put metric response for backlog in bytes
                else:
                    continue
    
        except:
            return {
                'statusCode': 200,
                'body': json.dumps('No associated projects')
            }
    
        return {
            'statusCode': 200,
        }
  7. Select Deploy.
  8. Select Test.

If your function reports “bad login credentials,” verify that you entered your API Token as it appears on the CloudEndure portal, but without the dashes.

If your function is completing with the message “No associated projects,” confirm that your API token is associated with an account that has CloudEndure projects created.

If your function completed without error, you’re ready to establish the repeatable schedule for the function. This populates your CloudWatch metrics.

Create a rule using Amazon EventBridge

  1. Search for Amazon Event Bridge in the services pane.
  2. Choose Create rule.
  3. Provide a name for the rule (we use CE-DR_Event-Rule).
  4. Under Define pattern, select Schedule.
    1. Choose Fixed rate every and set a rate of 5 minutes or less.
  5. Under Select targets, choose Lambda function from the drop down.
    1. Under Function choose CE-DR-Monitoring-Function.
  6. Choose Create.

Create your dashboard

  1. Navigate to the CloudWatch console.
  2. Select Create dashboard and provide a name for your new dashboard (for example, CloudEndure DR health).
  3. In the chart selection window, select a graph type (we recommend line to illustrate a continuous trend).
    1. Choose Metrics.

Create your alerts

You can create alerts for your metrics using this guide.

Cleaning up

To avoid incurring future charges, disable the Lambda rule and delete any supporting resources or alerting you’ve implemented.

Conclusion

CloudEndure Disaster Recovery has a long history of protecting workloads from disasters. Monitoring and providing notification of any replication interruptions has proven difficult in the past. CloudEndure’s APIs provide a powerful method to include specific instrumentation details of your resources under CloudEndure protection. This allows you to draw insights, enable operations and drive proactive alerting, and action using CloudWatch and other AWS services. With these in place, administrators can be sure they are continuously up to date with the health of the CloudEndure Disaster Recovery tool.

Thanks for reading this blog post on disaster recovery monitoring with CloudEndure and Amazon CloudWatch. If you have any comments or questions, don’t hesitate to leave them in the comments section.

Daniel Covey

Daniel Covey

Daniel Covey is a Solutions Architect with AWS who has spent the last 8 years helping customers protect their workloads during a Disaster. He has worked with CloudEndure before and after the acquisition by AWS, and continues to offer guidance to customers who want to ensure their data is safe from ransomware and disasters.

Dan Pavatich

Dan Pavatich

Dan Pavatich is a Senior Principal and certified AWS Solution Architect working with Slalom in Los Angeles. Dan has been working on technology transformation programs for over a decade, and when he’s not sitting in front of a computer he’s doing stand-up comedy.

Kyle Banks

Kyle Banks

Kyle Banks is a Solution Architect working with Slalom in Los Angeles, focused on serverless architectures and application modernization. He has been involved in full stack development for 10 years. Kyle is an avid University of Michigan sports fan, and has recently made the switch to PC gaming.