Build a resilience reporting dashboard with AWS Resilience Hub and Amazon QuickSight

You might have heard the phrase “10,000 foot view” at some point during your career. This typically refers to having a broad, high-level understanding of a system or organization’s technology infrastructure and how all its components fit together. It is a way of looking at the big picture without getting bogged down in the details. This can help identify areas of weakness or inefficiency, and can also be useful in planning and strategy development. It can also help in identifying opportunities for improvement, innovation and cost savings.

In this blog post, I walk through how you can get a 10,000 foot view of the resilience posture of applications across your entire portfolio of products and services by building a reporting dashboard.

AWS Resilience Hub is a service that helps you understand the resilience of your applications and provides recommendations on how you can achieve your resilience targets. The assessments and recommendations provided by Resilience Hub are tailored for your specific applications based on the services/resources you are using, and the resilience targets you have set. While this data is useful for individual teams, leadership and decision makers need a broader view, a 10,000 foot view of the resilience posture of products and services across their line of business, or even the entire organization. To do this, I will use the Resilience Hub APIs to collect and aggregate resilience data for applications defined within Resilience Hub, and use this information to perform analytics and set up a dashboard using Amazon QuickSight.

Architecture overview

Architecture diagram of the solution

After applications have been defined and assessed within Resilience Hub, a time-based event from Amazon EventBridge invokes an AWS Lambda function.
The Lambda function makes API calls to Resilience Hub, and retrieves and aggregates resilience data for all applications defined within Resilience Hub and generates a CSV file.
The Lambda function uploads this CSV file to an Amazon Simple Storage Service (Amazon S3) bucket.
This data in S3 is then ingested into SPICE within QuickSight.
A dashboard is created that contains various visuals to provide an aggregate view of resilience across all applications defined in Resilience Hub.

Prerequisites

Before implementing this solution, you need to complete the following prerequisites:

Define and assess one or more applications using Resilience Hub. Check out this blog post for information on how to do this.
If you do not already have one, Sign up for an Amazon QuickSight subscription.
Create or use an existing Amazon S3 bucket where the Lambda function will write resilience data.

Walkthrough

Resilience Hub can be programmatically accessed using its APIs. These APIs can be used to perform a variety of tasks such as defining and onboarding applications, setting recovery targets, running assessments, and in this case, retrieving application and assessment data.

Extract, transform, load resilience data

Start by creating a Lambda function that makes API calls to collect resilience data for applications defined and assessed in Resilience Hub. The function does some minor processing of the responses before generating a CSV file and storing it in an Amazon S3 bucket. The main Resilience Hub APIs used by the function are:

ListApps – retrieves all applications defined
DescribeApp – describes an application
ListAppAssessments – retrieves list of assessments for an application
DescribeAppAssessment – describes an assessment for an application

Create Lambda function

Navigate to the Lambda console and select Create function.
Select the option to Author from scratch, enter resilience-reporter for the function name, and choose Python 3.9 for the runtime.
Click Create function.

Create function wizard of the AWS Lambda console with the following data – Author from scratch selected, resilience-reporter as the function name, Python 3.9 as the runtime and x86_64 as the architecture

After the function page loads on the console, paste the following code into the code editor under Code source. This code makes API calls to Resilience Hub to retrieve application resilience data and stores it in an S3 bucket. Click Deploy to update the function code.

import boto3
import csv
import json
import os

def lambda_handler(event, context):
    # Print incoming event
    print('Incoming Event:' + json.dumps(event))

    # Get list of regions
    ec2 = boto3.client('ec2')
    regions = []
    region_details = ec2.describe_regions()['Regions']
    for region_detail in region_details:
        regions.append(region_detail['RegionName'])

    # Generate a new CSV file and populate headers
    with open('/tmp/resilience-report.csv', 'w', newline='') as file:
        writer = csv.writer(file)
        writer.writerow(["Application Name", "Assessment Name", "Assessment Compliance Status", "End Time", "Estimated Overall RTO", "Estimated Overall RPO", "Target RTO", "Target RPO", "Application Compliance Status", "Resiliency Score", "Last Assessed", "Application Tier", "Region"])

    # Loop over each region
    for region in regions:
        try:
            arh = boto3.client('resiliencehub', region_name=region)
            apps = arh.list_apps()
        except:
            continue

        # Get list of apps in a region
        app_arns = []
        for app in apps['appSummaries']:
            app_arns.append(app['appArn'])
        while 'NextToken' in apps:
            apps = arh.list_apps(NextToken = apps['NextToken'])
            for app in apps['appSummaries']:
                app_arns.append(app['appArn'])
        
        # Loop over list of apps to retrieve details
        for app in app_arns:
            app_details = arh.describe_app(appArn=app)['app']
            app_name = app_details['name']
            app_compliance = app_details['complianceStatus']
            app_res_score = app_details['resiliencyScore']
            app_tier = 'unknown'
            
            # Check if a resiliency policy is associated with the application
            if 'policyArn' in app_details:
                app_res_policy = app_details['policyArn']
                policy_details = arh.describe_resiliency_policy(
                    policyArn=app_res_policy
                )
                app_tier = policy_details['policy']['tier']
            
            # Check if an application has been assessed
            if app_compliance == 'NotAssessed':
                with open('/tmp/resilience-report.csv', 'a', newline='') as file:
                        writer = csv.writer(file)
                        writer.writerow([app_name, '', 'NotAssessed', '', '', '', '', '', app_compliance, app_res_score, '', app_tier, region])
                continue
            
            app_last_assessed = app_details['lastAppComplianceEvaluationTime']
            
            # Get list of assessments for the application
            assessment_summaries = arh.list_app_assessments(appArn=app)['assessmentSummaries']
            
            while 'NextToken' in assessment_summaries:
                assessment_summaries.append(arh.list_app_assessments(appArn=app, NextToken = assessment_summaries['NextToken'])['assessmentSummaries'])
            
            # Loop over list of assessments to get details
            for assessment in assessment_summaries:
                assessment_arn = assessment['assessmentArn']
                assessment_status = assessment['assessmentStatus']
                
                # Get assessment details if it is a successful assessment
                if assessment_status == 'Success':
                    assessment_details = arh.describe_app_assessment(assessmentArn=assessment_arn)

                    if assessment_details['assessment']['compliance'] == {}:
                        continue
                        
                    assessment_name = assessment_details['assessment']['assessmentName']
                    assessment_compliance_status = assessment_details['assessment']['complianceStatus']
                    end_time = assessment_details['assessment']['endTime']
                    current_rto_az = assessment_details['assessment']['compliance']['AZ']['currentRtoInSecs']
                    current_rto_hardware = assessment_details['assessment']['compliance']['Hardware']['currentRtoInSecs']
                    current_rto_software = assessment_details['assessment']['compliance']['Software']['currentRtoInSecs']
                    current_rpo_az = assessment_details['assessment']['compliance']['AZ']['currentRpoInSecs']
                    current_rpo_hardware = assessment_details['assessment']['compliance']['Hardware']['currentRpoInSecs']
                    current_rpo_software = assessment_details['assessment']['compliance']['Software']['currentRpoInSecs']
                    target_rto_az = assessment_details['assessment']['policy']['policy']['AZ']['rtoInSecs']
                    target_rto_hardware = assessment_details['assessment']['policy']['policy']['Hardware']['rtoInSecs']
                    target_rto_software = assessment_details['assessment']['policy']['policy']['Software']['rtoInSecs']
                    target_rpo_az = assessment_details['assessment']['policy']['policy']['AZ']['rpoInSecs']
                    target_rpo_hardware = assessment_details['assessment']['policy']['policy']['Hardware']['rpoInSecs']
                    target_rpo_software = assessment_details['assessment']['policy']['policy']['Software']['rpoInSecs']
                    
                    # Aggregate RTO and RPO values for current and target
                    current_rto = max(current_rto_az, current_rto_hardware, current_rto_software)
                    current_rpo = max(current_rpo_az, current_rpo_hardware, current_rpo_software)
                    target_rto = min(target_rto_az, target_rto_hardware, target_rto_software)
                    target_rpo = min(target_rpo_az, target_rpo_hardware, target_rpo_software)
                    
                    # Check if application is multi-region and updated aggregates accordingly
                    if 'Region' in assessment_details['assessment']['policy']['policy']:
                        current_rto_region = assessment_details['assessment']['compliance']['Region']['currentRtoInSecs']
                        current_rpo_region = assessment_details['assessment']['compliance']['Region']['currentRpoInSecs']
                        target_rto_region = assessment_details['assessment']['policy']['policy']['Region']['rtoInSecs']
                        target_rpo_region = assessment_details['assessment']['policy']['policy']['Region']['rpoInSecs']
                        
                        if current_rto < current_rto_region:
                            current_rto = current_rto_region
                        if current_rpo < current_rpo_region:
                            current_rpo = current_rpo_region
                        if target_rto > target_rto_region:
                            target_rto = target_rto_region
                        if target_rpo > target_rpo_region:
                            target_rpo = target_rpo_region
                    
                    # Populate data into the CSV file    
                    with open('/tmp/resilience-report.csv', 'a', newline='') as file:
                        writer = csv.writer(file)
                        writer.writerow([app_name, assessment_name, assessment_compliance_status, end_time.strftime("%Y-%m-%d %H:%M:%S"), current_rto, current_rpo, target_rto, target_rpo, app_compliance, app_res_score, app_last_assessed.strftime("%Y-%m-%d %H:%M:%S"), app_tier, region])
                        
    # Write data to S3
    bucketName = os.environ['bucketName']
    s3 = boto3.resource('s3')
    write_response = s3.Bucket(bucketName).upload_file('/tmp/resilience-report.csv', 'resilience-report.csv')
    print(write_response)

Environment variables

To upload data to Amazon S3, the function looks for the bucket name in its environment variables.

Select the Configuration tab on the function overview page and then select Environment variables.
Click Edit and then Add environment variable.
Enter bucketName (case-sensitive) for the Key and the name of an existing S3 bucket where you would like to store resilience data as the Value.
Click Save.

Configuration tab for the Lambda function where a new environment variable is added

Function configuration

Navigate to the General configuration section under the Configuration tab and click Edit. Scroll down to the Timeout section and increase the timeout to 2 min. Note that you might have to increase this further based on the number of applications and assessments you have within Resilience Hub.

Under the section Existing role click on View the resilience-reporter-role-xxxxxx role on the IAM console. This will open a new browser tab with the role summary page on the IAM console. Navigate back to the Lambda console tab and click Save to update the function configuration.

Function timeout set to 2 min on the configuration tab

Function permissions

Switch to the tab that has the IAM console opened with the Lambda role summary. Under Permissions policies, you should see a policy already associated with this role. This policy provides permission for the Lambda function to publish execution logs to CloudWatch Logs. The function still needs permissions to read data from Resilience Hub and publish the results to Amazon S3.

Click Add permissions and select Create inline policy.
On the Create policy wizard, select the JSON

In the following policy, replace the string YOUR_BUCKET_NAME with the name of the S3 bucket where the resilience data needs to be stored. This should be the same bucket name as what was specified in the environment variables for the Lambda function.

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": "s3:PutObject",
            "Resource": [ "arn:aws:s3:::YOUR_BUCKET_NAME", "arn:aws:s3:::YOUR_BUCKET_NAME/*" ]
        },
        {
            "Effect": "Allow",
            "Action": [
                "resiliencehub:ListApps",
                "resiliencehub:DescribeResiliencyPolicy",
                "resiliencehub:ListAppAssessments",
                "ec2:DescribeRegions",
                "resiliencehub:DescribeApp",
                "resiliencehub:DescribeAppAssessment"
            ],
            "Resource": "*"
        }
    ]
}

After replacing the bucket name, paste the updated policy into the JSON tab and click Review policy.
Enter resilience-data-access for the policy name and click Create policy.

Function trigger

Lastly, the function needs to configured with an EventBridge trigger to periodically invoke it and generate a fresh report.

Navigate back to the Lambda console and under Function overview, choose Add trigger. Under Trigger configuration, choose EventBridge (CloudWatch Events).
Under Rule, choose Create a new rule and then enter resilience-reporting-schedule for the rule name.
For Rule type, choose Schedule expression, enter rate(12 hours), and then choose Add. To adjust the rate as appropriate for your use case, check Schedule Expressions for Rules.

Add EventBridge (CloudWatch Events) trigger to the Lambda function with create new rule selected and resilience-reporting-schedule as the rule name. Rule type is set to Schedule expression with the value rate(12 hours)

Generate data

Although the Lambda function has been configured to be invoked periodically, we can invoke it manually to generate data for the purpose of this walkthrough.

To do this, navigate to the Test on the Lambda console and click Test. This will invoke the function with dummy input. After the function execution is complete, you can navigate to the S3 bucket you specified as an environment variable and see that a CSV file called resilience-report.csv has been generated.

Test tab of the Lambda function to perform a test invocation of the function

Visualize data with QuickSight

Now that the raw data is available in an S3 bucket, it can be visualized to create a reporting dashboard with QuickSight.

Grant access for QuickSight to access S3

Sign in to the QuickSight console.
In the upper right corner of the console, choose Admin/username, and then choose Manage QuickSight.
Choose Security and permissions.
Under QuickSight access to AWS services, choose Manage.
Check the box next to Amazon S3, and click Select S3 buckets.
On the dialog box that pops up, select the S3 bucket where the resilience report is being stored (this will be the same bucket that was specified as the environment variable for the Lambda function).
Click Finish to close the dialog box and finally click Save.

Amazon S3 selected in the “Allow service access” wizard on QuickSight. The resilience-reporter-blog S3 bucket selected in the pop-up

Create a dataset

Create a text file on your local machine with the following contents. Make sure you replace the string YOUR_BUCKET_NAME with the name of the S3 bucket where the resilience report is being stored (this will be the same bucket that was specified as the environment variable for the Lambda function):

{"fileLocations": [{"URIs": ["s3://YOUR_BUCKET_NAME/resilience-report.csv"]}]}

Save the file with the name manifest.json.
On the QuickSight console, select Datasets from the menu on the left and then click New dataset.
Select S3 as the data source.
Enter resilience-data for the data source name.
Select Upload as the choice next to Upload a manifest file and then select the manifest.json file that you created.
Click Connect.
After the connection has been established, click Visualize to finish dataset creation.

S3 selected as the data source, and resilience-data entered as the data source name and the manifest file uploaded from the local machine

On the analysis page, you can select the types of visuals you want to include, and what data needs to be included in those visuals. The following data will be available to you (this is contained in the CSV):

Application Name – the name of the application as defined within Resilience Hub
Assessment Name – the name of an assessment for an application
Assessment Compliance Status – indicates if the associated assessment was evaluated as being compliant with the resiliency policy or not
End Time – identifies the end time of an assessment
Overall RTO – indicates the estimated achievable RTO at the time of assessment (the max value across all disruption types is selected)
Overall RPO – indicates the estimated achievable RPO at the time of assessment (the max value across all disruption types is selected)
Target RTO – indicates the targeted RTO based on the associated Resiliency policy (the min value across all disruption types is selected)
Target RPO – indicates the targeted RPO based on the associated Resiliency policy (the min value across all disruption types is selected)
Application Compliance Status – indicates if the application (based on the latest assessment) was evaluated as being compliant with the resiliency policy or not
Resiliency Score – the resiliency score of the application
Last Assessed – timestamp of the most recent assessment
Application Tier – application tier as defined by the resiliency policy associated with the application
Region – the region in which the application is defined within Resilience Hub

After creating visuals using the available data, you can publish a dashboard by clicking on the Share option on the top right corner and selecting Publish dashboard. The following figure shows an example of a dashboard.

Sample QuickSight dashboard consisting of various KPIs, tables, and charts

Deploy the entire solution

The solution described in this blog post is available as an easily deployable, automated solution for your convenience. Deploy the resilience-reporter solution which is provided as CloudFormation templates for you. The aws-resilience-hub-tools repository also contains other assets and solutions that can enhance your experience using AWS Resilience Hub.

Cleanup

Delete the following resources that were created as part of the walkthrough:

QuickSight dashboard
QuickSight analysis
QuickSight dataset
Lambda function
Resilience data stored in S3 bucket

Conclusion

Having a “10,000 foot view” of the resilience posture of applications is crucial for identifying areas of weakness and opportunities for improvement. AWS Resilience Hub runs assessments and provides recommendations tailored to individual applications, but to have a comprehensive understanding, data from Resilience Hub can be collected, aggregated, and analyzed. This can be achieved using Resilience Hub APIs and creating a dashboard using Amazon QuickSight, providing leadership and decision makers with a big-picture view of the resilience posture of their entire portfolio of products and services.

AWS Cloud Operations & Migrations Blog