Enhanced disaster recovery monitoring with CloudEndure and Amazon CloudWatch

Update (March 4, 2022): Updated Amazon CloudWatch events section to utilize Amazon EventBridge rules instead, allowing you to further customize your serverless event architecture. AWS is also deprecating the requests module in botocore to improve flexibility and performance, and added the ability to use Layers within AWS Lambda to continue using this module. Code has also been slightly reformatted.

Monitoring and troubleshooting Disaster Recovery (DR) is a critical component of any DR strategy. The ability to receive alerts and status updates is imperative when it comes to limiting downtime and maintaining business as usual. Often, DR tools do not have deep monitoring and notification built in. Administrators need to be logged in to the tool to see if it’s falling into a lag or backlog state.

CloudEndure Disaster Recovery is a block-level replication DR solution, that aids in accelerating and automating DR failover. Amazon CloudWatch monitors AWS resources as they are being consumed within the account. In this blog, we show how to use existing CloudEndure APIs, Amazon CloudWatch, and Amazon EventBridge to build a customizable and detailed dashboard for CloudEndure Disaster Recovery. You’ll learn how to get a deeper view of your DR health, and receive notifications about critical updates using AWS Lambda.

Overview of solution

The architecture enabling the CloudEndure Disaster Recovery health dashboard is represented in the following diagram:

CloudEndure DR architecture unchanged from standard implementation

The standard CloudEndure Disaster Recovery architecture is unchanged from the standard implementation, and only the additions of AWS Lambda, Amazon CloudWatch, and Amazon EventBridge are required.

This example shows how to schedule a Lambda function to run every five minutes, which queries the CloudEndure API using your API credentials. This populates an Amazon CloudWatch metric. From here, you can use the full feature set of AWS and related services to create dashboards, implement alerts, or drive automation.

Walk-through

The following steps show how to store CloudEndure credentials in AWS Secrets Manager and create the necessary IAM roles and security permissions for Lambda function. Implementing the Lambda function pushes metrics to CloudWatch. The function will be scheduled to run on a regular basis to build a Cloud DR Health dashboard.

Prerequisites

For this walk-through, you should have the following:

An AWS account
AWS resources
A CloudEndure Disaster Recovery account with agents installed in a healthy state, as shown here
Access to the following AWS services:
- Amazon CloudWatch
- AWS Secrets Manager
- AWS IAM
- AWS Lambda

Securely store your CloudEndure account credentials in AWS Secrets Manager

Open the AWS Secrets Manager console.
On either the service introduction page or the Secrets list page, choose Store a new secret. On the Store a new secret page, choose Other type of secret. You choose this because your secret doesn’t apply to a database.
Under Specify the key/value pairs to be stored in the secret, create a key value pair with a key of userAPIToken. As a value, enter your CloudEndure API Token.
For Select the encryption key, choose DefaultEncryptionKey. Secrets Manager always encrypts the secret when you select this option and provides it at no charge to you. If you choose to use a custom KMS key, then AWS charges you at the standard AWS KMS rate.
Secrets Manager uses a unique encryption key that resides within the account and can only be used with Secrets Manager in the same Region. Choose Next.
Under Secret name, type a name for the secret in the text field. Use only alphanumeric characters and the characters /_+=.@-. For example, you can use a secret name, such as cloud_endure_credentials.
In the Description field, type a description of the secret. For Description, type, for example, CloudEndure <Account_Name> api credentials, generated mm-dd-yyy.
In the Tags section, add desired tags in the Key and Value – optional text fields. Choose Next.

You can leave tags blank. However, we recommend using tags as a best practice to help identify secrets.

For the purposes of this tutorial, you can leave automatic rotation disabled, however under Configure automatic rotation, you can enable Automatic rotation. It is good practice to rotate credentials, and instructions for implementing credential rotation are available in the Rotating Your AWS Secrets Manager Secrets guide.

Create IAM prerequisites

Create the IAM CloudWatch policy for your Lambda function:
- Navigate to the IAM console, select Policies, then Create Policy.
- Select the JSON tab and enter the following policy document:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "VisualEditor0",
            "Effect": "Allow",
            "Action": [
                "cloudwatch:PutMetricData"
            ],
            "Resource": "*"
        }
    ]
}

- Select Next: Review.
  - Enter name for the policy: CEDRMetricPolicy.
  - Enter description.
  - Enter tags.
- Select Create Policy.

Create the IAM AWS Secrets Manager policy for your Lambda function:
- Navigate to the IAM console and choose Create Policy.
- Select the JSON tab and enter the following policy document:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "VisualEditor0",
            "Effect": "Allow",
            "Action": [
                "secretsmanager:GetResourcePolicy",
                "secretsmanager:GetSecretValue",
                "secretsmanager:DescribeSecret",
                "secretsmanager:ListSecretVersionIds",
                "secretsmanager:ListSecrets"
            ],
            "Resource": "*"
        }
    ]
}

- Note that referencing the ARN of the secret created in the previous section, “cloud_endure_credentials,” would ensure that this policy provides access to only that secret.
- Select Next: Tags.
  - Add any tags to use for identification.
- Select Next: Review.
  - Enter name for the policy: CEDRSecretsPolicy.
  - Enter description.
  - Enter tags.
- Choose Create policy.

Navigate back to IAM console and select Roles, then select Create role:
- Select AWS Service, and under Common use cases select Lambda.
- Select Next: Permissions.
- Select the policies we created in the previous step:
  1. CEDRMetricPolicy
  2. CEDRSecretsPolicy
- Select Next: Tags.
  - Add appropriate tags.
- Select Next: Review.
- Enter role name: “ce-dr-monitoring-role.”
- Select Create role.

Create AWS Lambda function

Go to the AWS Lambda console.
If you are new to Lambda, you see a welcome page. Choose Get Started Now. Otherwise, choose Create function.
Select the Author from scratch option to create your Lambda function.
- Enter the function name, CE-DR-Monitoring-Function.
- Runtime: choose Python 3.6.
- Select Create function.
Select the Layers option.
- Click Add a layer.
- Choose Specify an ARN.
- Enter the layer ARN (available here).
- Select Verify.
- Choose Add.
Open the Configuration tab.
- Select Edit.
- Under Basic settings, change Timeout to 3 minutes.
- Choose Save.

Enter the following code into the Code source pane:

import json
import botocore
from botocore.vendored import requests
import botocore.exceptions
import boto3
import random
import sys
import csv
import datetime

HOST = 'https://console.cloudendure.com'
headers = {'Content-Type': 'application/json'}

# Create the Secrets Manager Client
client = boto3.client('secretsmanager')

def lambda_handler(event, context):
    session = {}
    endpoint = '/api/latest/{}'

    try:   
        get_secret_value_response = client.get_secret_value(
            SecretId="cloud_endure_credentials"
        )
    except botocore.exceptions.ClientError as e:
        raise e
    else:
        secret = json.loads(get_secret_value_response['SecretString'])
    
    
    login_data = {
        "userApiToken": secret["userAPIToken"]
    }

    cloudwatch = boto3.client('cloudwatch')
    
    r = requests.post(HOST + endpoint.format('login'), data = json.dumps(login_data), headers = headers)
    print(r)
    if r.status_code != 200 and r.status_code != 307:
        return {
        'statusCode': 200,
        'body': json.dumps('Bad login credentials')
        }
    
    # check if need to use a different API entry point
    if r.history:
        endpoint = '/' + '/'.join(r.url.split('/')[3:-1]) + '/{}'
        r = requests.post(HOST + endpoint.format('login'), data = json.dumps(login_data), headers = headers)
    
    session = {'session': r.cookies['session']}
    
    headers['X-XSRF-TOKEN'] = r.cookies['XSRF-TOKEN']
    
  
    r = requests.get(HOST + endpoint.format('projects'), headers = headers, cookies = session)
    if r.status_code != 200:
        return {
        'statusCode': 200,
        'body': json.dumps('Failed to fetch the project')
        }
  
    try:
        projects = json.loads(r.content)['items']
        for project in projects:
            if project['type'] == "DR": # This ensures we skip migration projects that may be in the account.
                machines = False
                project_id = project['id']
                r = requests.get(HOST + endpoint.format('projects/{}/machines').format(project_id), headers = headers, cookies = session)
                if r.status_code != 200:
                        return {
                            'statusCode': 200,
                            'body': json.dumps('Failed to fetch the machines')
                        }

                machine = json.loads(r.content)['items']
                if machines == []:
                        continue

                for machine in json.loads(r.content)['items']:
                    backlog = 0

                    if 'name' not in machine['sourceProperties']:   # If the machine name is missing, then skip it
                            continue
                
                    if 'backloggedStorageBytes' in machine['replicationInfo']:      # Check for 
                        backlog = machine['replicationInfo']['backloggedStorageBytes']
                    else:
                        backlog = 0
                    
                    if 'lastConsistencyDateTime' in machine['replicationInfo']:      # confirm element is present
                        last_consistent = machine['replicationInfo']['lastConsistencyDateTime']     #store last consistent backup time
                        last_consistent_dt = datetime.datetime.strptime(last_consistent[:19], '%Y-%m-%dT%H:%M:%S')  #format time for date-time processing
                        diff = datetime.datetime.utcnow()-last_consistent_dt    # Store the age of the last consistent backup - i.e. what is our current Recovery Point Objective actual 
                        lag = diff.total_seconds()/60
                    else:
                        lag = 0                                     # Server has not completed its initial sync

                    response1 = cloudwatch.put_metric_data(
                        MetricData = [
                            {
                                'MetricName': 'MachineData',
                                'Dimensions': [                                         # Dimensions provide meta data for sorting / organizing the information. 
                                    {
                                        'Name': 'PROJECT_NAME',                         # CE Project Name (note this will be the project name as per the CE Console)
                                        'Value': project['name']
                                    },
                                    {
                                        'Name': 'MACHINE_NAME',                         # Machine identifier as per the CE console
                                        'Value': machine['sourceProperties']['name']
                                    },
                                ],
                                'Unit': 'None',
                                'Value': lag
                            },
                        ],
                        Namespace = 'CE-Replication-Lag'        # This is how you find the stored measures in the CloudWatch console.
                    )
                    #print response1    # uncomment this to troubleshoot put metric response for lag
                
                    response2 = cloudwatch.put_metric_data(
                        MetricData = [
                        {
                            'MetricName': 'MachineData',
                            'Dimensions': [
                                {
                                    'Name': 'PROJECT_NAME',
                                    'Value': project['name']
                                },
                                {
                                    'Name': 'MACHINE_NAME',
                                    'Value': machine['sourceProperties']['name']
                                },
                            ],
                            'Unit': 'None',
                            'Value': backlog
                        },
                        ],
                        Namespace = 'CE-Replication-Backlog'        # This is how you will find the stored measures in the CloudWatch console.
                    )
                    #print response2    # uncomment this to troubleshoot put metric response for backlog in bytes
            else:
                continue

    except:
        return {
            'statusCode': 200,
            'body': json.dumps('No associated projects')
        }

    return {
        'statusCode': 200,
    }

Select Deploy.
Select Test.

If your function reports “bad login credentials,” verify that you entered your API Token as it appears on the CloudEndure portal, but without the dashes.

If your function is completing with the message “No associated projects,” confirm that your API token is associated with an account that has CloudEndure projects created.

If your function completed without error, you’re ready to establish the repeatable schedule for the function. This populates your CloudWatch metrics.

Create a rule using Amazon EventBridge

Search for Amazon Event Bridge in the services pane.
Choose Create rule.
Provide a name for the rule (we use CE-DR_Event-Rule).
Under Define pattern, select Schedule.
1. Choose Fixed rate every and set a rate of 5 minutes or less.
Under Select targets, choose Lambda function from the drop down.
1. Under Function choose CE-DR-Monitoring-Function.
Choose Create.

Create your dashboard

Navigate to the CloudWatch console.
Select Create dashboard and provide a name for your new dashboard (for example, CloudEndure DR health).
In the chart selection window, select a graph type (we recommend line to illustrate a continuous trend).
1. Choose Metrics.

Create your alerts

You can create alerts for your metrics using this guide.

Cleaning up

To avoid incurring future charges, disable the Lambda rule and delete any supporting resources or alerting you’ve implemented.

Conclusion

CloudEndure Disaster Recovery has a long history of protecting workloads from disasters. Monitoring and providing notification of any replication interruptions has proven difficult in the past. CloudEndure’s APIs provide a powerful method to include specific instrumentation details of your resources under CloudEndure protection. This allows you to draw insights, enable operations and drive proactive alerting, and action using CloudWatch and other AWS services. With these in place, administrators can be sure they are continuously up to date with the health of the CloudEndure Disaster Recovery tool.

Thanks for reading this blog post on disaster recovery monitoring with CloudEndure and Amazon CloudWatch. If you have any comments or questions, don’t hesitate to leave them in the comments section.

Select your cookie preferences

AWS Storage Blog