AWS Cloud Operations Blog
Build a resilience reporting dashboard with AWS Resilience Hub and Amazon QuickSight
You might have heard the phrase “10,000 foot view” at some point during your career. This typically refers to having a broad, high-level understanding of a system or organization’s technology infrastructure and how all its components fit together. It is a way of looking at the big picture without getting bogged down in the details. This can help identify areas of weakness or inefficiency, and can also be useful in planning and strategy development. It can also help in identifying opportunities for improvement, innovation and cost savings.
In this blog post, I walk through how you can get a 10,000 foot view of the resilience posture of applications across your entire portfolio of products and services by building a reporting dashboard.
AWS Resilience Hub is a service that helps you understand the resilience of your applications and provides recommendations on how you can achieve your resilience targets. The assessments and recommendations provided by Resilience Hub are tailored for your specific applications based on the services/resources you are using, and the resilience targets you have set. While this data is useful for individual teams, leadership and decision makers need a broader view, a 10,000 foot view of the resilience posture of products and services across their line of business, or even the entire organization. To do this, I will use the Resilience Hub APIs to collect and aggregate resilience data for applications defined within Resilience Hub, and use this information to perform analytics and set up a dashboard using Amazon QuickSight.
Architecture overview
- After applications have been defined and assessed within Resilience Hub, a time-based event from Amazon EventBridge invokes an AWS Lambda function.
- The Lambda function makes API calls to Resilience Hub, and retrieves and aggregates resilience data for all applications defined within Resilience Hub and generates a CSV file.
- The Lambda function uploads this CSV file to an Amazon Simple Storage Service (Amazon S3) bucket.
- This data in S3 is then ingested into SPICE within QuickSight.
- A dashboard is created that contains various visuals to provide an aggregate view of resilience across all applications defined in Resilience Hub.
Prerequisites
Before implementing this solution, you need to complete the following prerequisites:
- Define and assess one or more applications using Resilience Hub. Check out this blog post for information on how to do this.
- If you do not already have one, Sign up for an Amazon QuickSight subscription.
- Create or use an existing Amazon S3 bucket where the Lambda function will write resilience data.
Walkthrough
Resilience Hub can be programmatically accessed using its APIs. These APIs can be used to perform a variety of tasks such as defining and onboarding applications, setting recovery targets, running assessments, and in this case, retrieving application and assessment data.
Extract, transform, load resilience data
Start by creating a Lambda function that makes API calls to collect resilience data for applications defined and assessed in Resilience Hub. The function does some minor processing of the responses before generating a CSV file and storing it in an Amazon S3 bucket. The main Resilience Hub APIs used by the function are:
- ListApps – retrieves all applications defined
- DescribeApp – describes an application
- ListAppAssessments – retrieves list of assessments for an application
- DescribeAppAssessment – describes an assessment for an application
Create Lambda function
- Navigate to the Lambda console and select Create function.
- Select the option to Author from scratch, enter
resilience-reporter
for the function name, and choose Python 3.9 for the runtime. - Click Create function.
After the function page loads on the console, paste the following code into the code editor under Code source. This code makes API calls to Resilience Hub to retrieve application resilience data and stores it in an S3 bucket. Click Deploy to update the function code.
import boto3
import csv
import json
import os
def lambda_handler(event, context):
# Print incoming event
print('Incoming Event:' + json.dumps(event))
# Get list of regions
ec2 = boto3.client('ec2')
regions = []
region_details = ec2.describe_regions()['Regions']
for region_detail in region_details:
regions.append(region_detail['RegionName'])
# Generate a new CSV file and populate headers
with open('/tmp/resilience-report.csv', 'w', newline='') as file:
writer = csv.writer(file)
writer.writerow(["Application Name", "Assessment Name", "Assessment Compliance Status", "End Time", "Estimated Overall RTO", "Estimated Overall RPO", "Target RTO", "Target RPO", "Application Compliance Status", "Resiliency Score", "Last Assessed", "Application Tier", "Region"])
# Loop over each region
for region in regions:
try:
arh = boto3.client('resiliencehub', region_name=region)
apps = arh.list_apps()
except:
continue
# Get list of apps in a region
app_arns = []
for app in apps['appSummaries']:
app_arns.append(app['appArn'])
while 'NextToken' in apps:
apps = arh.list_apps(NextToken = apps['NextToken'])
for app in apps['appSummaries']:
app_arns.append(app['appArn'])
# Loop over list of apps to retrieve details
for app in app_arns:
app_details = arh.describe_app(appArn=app)['app']
app_name = app_details['name']
app_compliance = app_details['complianceStatus']
app_res_score = app_details['resiliencyScore']
app_tier = 'unknown'
# Check if a resiliency policy is associated with the application
if 'policyArn' in app_details:
app_res_policy = app_details['policyArn']
policy_details = arh.describe_resiliency_policy(
policyArn=app_res_policy
)
app_tier = policy_details['policy']['tier']
# Check if an application has been assessed
if app_compliance == 'NotAssessed':
with open('/tmp/resilience-report.csv', 'a', newline='') as file:
writer = csv.writer(file)
writer.writerow([app_name, '', 'NotAssessed', '', '', '', '', '', app_compliance, app_res_score, '', app_tier, region])
continue
app_last_assessed = app_details['lastAppComplianceEvaluationTime']
# Get list of assessments for the application
assessment_summaries = arh.list_app_assessments(appArn=app)['assessmentSummaries']
while 'NextToken' in assessment_summaries:
assessment_summaries.append(arh.list_app_assessments(appArn=app, NextToken = assessment_summaries['NextToken'])['assessmentSummaries'])
# Loop over list of assessments to get details
for assessment in assessment_summaries:
assessment_arn = assessment['assessmentArn']
assessment_status = assessment['assessmentStatus']
# Get assessment details if it is a successful assessment
if assessment_status == 'Success':
assessment_details = arh.describe_app_assessment(assessmentArn=assessment_arn)
if assessment_details['assessment']['compliance'] == {}:
continue
assessment_name = assessment_details['assessment']['assessmentName']
assessment_compliance_status = assessment_details['assessment']['complianceStatus']
end_time = assessment_details['assessment']['endTime']
current_rto_az = assessment_details['assessment']['compliance']['AZ']['currentRtoInSecs']
current_rto_hardware = assessment_details['assessment']['compliance']['Hardware']['currentRtoInSecs']
current_rto_software = assessment_details['assessment']['compliance']['Software']['currentRtoInSecs']
current_rpo_az = assessment_details['assessment']['compliance']['AZ']['currentRpoInSecs']
current_rpo_hardware = assessment_details['assessment']['compliance']['Hardware']['currentRpoInSecs']
current_rpo_software = assessment_details['assessment']['compliance']['Software']['currentRpoInSecs']
target_rto_az = assessment_details['assessment']['policy']['policy']['AZ']['rtoInSecs']
target_rto_hardware = assessment_details['assessment']['policy']['policy']['Hardware']['rtoInSecs']
target_rto_software = assessment_details['assessment']['policy']['policy']['Software']['rtoInSecs']
target_rpo_az = assessment_details['assessment']['policy']['policy']['AZ']['rpoInSecs']
target_rpo_hardware = assessment_details['assessment']['policy']['policy']['Hardware']['rpoInSecs']
target_rpo_software = assessment_details['assessment']['policy']['policy']['Software']['rpoInSecs']
# Aggregate RTO and RPO values for current and target
current_rto = max(current_rto_az, current_rto_hardware, current_rto_software)
current_rpo = max(current_rpo_az, current_rpo_hardware, current_rpo_software)
target_rto = min(target_rto_az, target_rto_hardware, target_rto_software)
target_rpo = min(target_rpo_az, target_rpo_hardware, target_rpo_software)
# Check if application is multi-region and updated aggregates accordingly
if 'Region' in assessment_details['assessment']['policy']['policy']:
current_rto_region = assessment_details['assessment']['compliance']['Region']['currentRtoInSecs']
current_rpo_region = assessment_details['assessment']['compliance']['Region']['currentRpoInSecs']
target_rto_region = assessment_details['assessment']['policy']['policy']['Region']['rtoInSecs']
target_rpo_region = assessment_details['assessment']['policy']['policy']['Region']['rpoInSecs']
if current_rto < current_rto_region:
current_rto = current_rto_region
if current_rpo < current_rpo_region:
current_rpo = current_rpo_region
if target_rto > target_rto_region:
target_rto = target_rto_region
if target_rpo > target_rpo_region:
target_rpo = target_rpo_region
# Populate data into the CSV file
with open('/tmp/resilience-report.csv', 'a', newline='') as file:
writer = csv.writer(file)
writer.writerow([app_name, assessment_name, assessment_compliance_status, end_time.strftime("%Y-%m-%d %H:%M:%S"), current_rto, current_rpo, target_rto, target_rpo, app_compliance, app_res_score, app_last_assessed.strftime("%Y-%m-%d %H:%M:%S"), app_tier, region])
# Write data to S3
bucketName = os.environ['bucketName']
s3 = boto3.resource('s3')
write_response = s3.Bucket(bucketName).upload_file('/tmp/resilience-report.csv', 'resilience-report.csv')
print(write_response)
Environment variables
To upload data to Amazon S3, the function looks for the bucket name in its environment variables.
- Select the Configuration tab on the function overview page and then select Environment variables.
- Click Edit and then Add environment variable.
- Enter
bucketName
(case-sensitive) for the Key and the name of an existing S3 bucket where you would like to store resilience data as the Value. - Click Save.
Function configuration
Navigate to the General configuration section under the Configuration tab and click Edit. Scroll down to the Timeout section and increase the timeout to 2 min
. Note that you might have to increase this further based on the number of applications and assessments you have within Resilience Hub.
Under the section Existing role click on View the resilience-reporter-role-xxxxxx role on the IAM console. This will open a new browser tab with the role summary page on the IAM console. Navigate back to the Lambda console tab and click Save to update the function configuration.
Function permissions
Switch to the tab that has the IAM console opened with the Lambda role summary. Under Permissions policies, you should see a policy already associated with this role. This policy provides permission for the Lambda function to publish execution logs to CloudWatch Logs. The function still needs permissions to read data from Resilience Hub and publish the results to Amazon S3.
- Click Add permissions and select Create inline policy.
- On the Create policy wizard, select the JSON
- In the following policy, replace the string
YOUR_BUCKET_NAME
with the name of the S3 bucket where the resilience data needs to be stored. This should be the same bucket name as what was specified in the environment variables for the Lambda function.{ "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Action": "s3:PutObject", "Resource": [ "arn:aws:s3:::YOUR_BUCKET_NAME", "arn:aws:s3:::YOUR_BUCKET_NAME/*" ] }, { "Effect": "Allow", "Action": [ "resiliencehub:ListApps", "resiliencehub:DescribeResiliencyPolicy", "resiliencehub:ListAppAssessments", "ec2:DescribeRegions", "resiliencehub:DescribeApp", "resiliencehub:DescribeAppAssessment" ], "Resource": "*" } ] }
- After replacing the bucket name, paste the updated policy into the JSON tab and click Review policy.
- Enter
resilience-data-access
for the policy name and click Create policy.
Function trigger
Lastly, the function needs to configured with an EventBridge trigger to periodically invoke it and generate a fresh report.
- Navigate back to the Lambda console and under Function overview, choose Add trigger. Under Trigger configuration, choose EventBridge (CloudWatch Events).
- Under Rule, choose Create a new rule and then enter
resilience-reporting-schedule
for the rule name. - For Rule type, choose Schedule expression, enter rate(12 hours), and then choose Add. To adjust the rate as appropriate for your use case, check Schedule Expressions for Rules.
Generate data
Although the Lambda function has been configured to be invoked periodically, we can invoke it manually to generate data for the purpose of this walkthrough.
To do this, navigate to the Test on the Lambda console and click Test. This will invoke the function with dummy input. After the function execution is complete, you can navigate to the S3 bucket you specified as an environment variable and see that a CSV file called resilience-report.csv
has been generated.
Visualize data with QuickSight
Now that the raw data is available in an S3 bucket, it can be visualized to create a reporting dashboard with QuickSight.
Grant access for QuickSight to access S3
- Sign in to the QuickSight console.
- In the upper right corner of the console, choose Admin/username, and then choose Manage QuickSight.
- Choose Security and permissions.
- Under QuickSight access to AWS services, choose Manage.
- Check the box next to Amazon S3, and click Select S3 buckets.
- On the dialog box that pops up, select the S3 bucket where the resilience report is being stored (this will be the same bucket that was specified as the environment variable for the Lambda function).
- Click Finish to close the dialog box and finally click Save.
Create a dataset
- Create a text file on your local machine with the following contents. Make sure you replace the string
YOUR_BUCKET_NAME
with the name of the S3 bucket where the resilience report is being stored (this will be the same bucket that was specified as the environment variable for the Lambda function):
{"fileLocations": [{"URIs": ["s3://YOUR_BUCKET_NAME/resilience-report.csv"]}]}
- Save the file with the name manifest.json.
- On the QuickSight console, select Datasets from the menu on the left and then click New dataset.
- Select S3 as the data source.
- Enter
resilience-data
for the data source name. - Select Upload as the choice next to Upload a manifest file and then select the manifest.json file that you created.
- Click Connect.
- After the connection has been established, click Visualize to finish dataset creation.
On the analysis page, you can select the types of visuals you want to include, and what data needs to be included in those visuals. The following data will be available to you (this is contained in the CSV):
- Application Name – the name of the application as defined within Resilience Hub
- Assessment Name – the name of an assessment for an application
- Assessment Compliance Status – indicates if the associated assessment was evaluated as being compliant with the resiliency policy or not
- End Time – identifies the end time of an assessment
- Overall RTO – indicates the estimated achievable RTO at the time of assessment (the max value across all disruption types is selected)
- Overall RPO – indicates the estimated achievable RPO at the time of assessment (the max value across all disruption types is selected)
- Target RTO – indicates the targeted RTO based on the associated Resiliency policy (the min value across all disruption types is selected)
- Target RPO – indicates the targeted RPO based on the associated Resiliency policy (the min value across all disruption types is selected)
- Application Compliance Status – indicates if the application (based on the latest assessment) was evaluated as being compliant with the resiliency policy or not
- Resiliency Score – the resiliency score of the application
- Last Assessed – timestamp of the most recent assessment
- Application Tier – application tier as defined by the resiliency policy associated with the application
- Region – the region in which the application is defined within Resilience Hub
After creating visuals using the available data, you can publish a dashboard by clicking on the Share option on the top right corner and selecting Publish dashboard. The following figure shows an example of a dashboard.
Deploy the entire solution
The solution described in this blog post is available as an easily deployable, automated solution for your convenience. Deploy the resilience-reporter solution which is provided as CloudFormation templates for you. The aws-resilience-hub-tools repository also contains other assets and solutions that can enhance your experience using AWS Resilience Hub.
Cleanup
Delete the following resources that were created as part of the walkthrough:
- QuickSight dashboard
- QuickSight analysis
- QuickSight dataset
- Lambda function
- Resilience data stored in S3 bucket
Conclusion
Having a “10,000 foot view” of the resilience posture of applications is crucial for identifying areas of weakness and opportunities for improvement. AWS Resilience Hub runs assessments and provides recommendations tailored to individual applications, but to have a comprehensive understanding, data from Resilience Hub can be collected, aggregated, and analyzed. This can be achieved using Resilience Hub APIs and creating a dashboard using Amazon QuickSight, providing leadership and decision makers with a big-picture view of the resilience posture of their entire portfolio of products and services.