AWS Cloud Operations & Migrations Blog

How to use the AWS Resilience Hub score

Time to read 10 minutes
Time to complete 1 hour
Cost to complete $15 per day (WordPress Multi-AZ application, AWS ResilienceHub Application and recommendations
Learning level 200 – Intermediate
Services used AWS ResilienceHub, AWS CloudFormation, Amazon CloudWatch, AWS Fault Injection Simulator

AWS Resilience Hub provides a central place to define, validate, and track the resiliency of your AWS applications using AWS Well-Architected best practices. Customers can get a comprehensive view of their overall application portfolio resilience status, their associated resilience scores, and actionable recommendations.

This resilience score is designed to assess the readiness of your environment for resilience, not only from a technical perspective with resiliency policies and alarms, but also by validating the successful completion of your recommended Standard Operating Procedures (SOPs) and fault injection experiments with AWS Fault Injection Simulator (FIS).

In this post, I will show you how to leverage the resilience score to improve the resilience of your applications.

You will understand how to get and maintain a resilience score of 100%, and customize the recommendations made by Resilience Hub from your assessments and investigate unexpected drops in your score. Note: getting a score of 100% is not mandatory for every customer. For example, if you do not wish to use FIS experiments, then your maximum reachable score will be 80%.

We will use a multi-tier, scalable WordPress solution to illustrate how the resilience score works. The associated AWS CloudFormation templates are available in the AWS CloudFormation documentation. Choose your AWS Region (today I will be using us-west-2, Oregon) and deploy the WordPress scalable and durable template.

Figure 1: A multi-tier, scalable AWS EC2 and AWS RDS-based environment

Figure 1: A multi-tier, scalable AWS EC2 and AWS RDS-based environment

Prerequisites

How the AWS Resilience Hub score works

Before diving into the resilience score itself, let us review the main steps of the Resilience Hub workflow. After you add an application to Resilience Hub, you can run an assessment on the supported application components and receive resiliency and operational recommendations.

Resiliency recommendations will help you meet/optimize your RPO and RTO targets defined in your resiliency policy on multiple levels, called disruption types: Application, Infrastructure, Availability Zone (AZ) and Region.

Operational recommendations will provide you with Alarms, SOPs and FIS experiments, all deployable within minutes with AWS CloudFormation.

The resilience score is an Amazon CloudWatch metric ranging from 0 (minimum) to 100 (maximum) that is calculated every time a new assessment is run. Assessments can be run manually or on daily basis with scheduled assessments.
To get a resilience score of 100%, your application must:

  • Be fully compliant with your configured resiliency policy for all the disruption types. If your application is deployed within a single region, the optional region disruption type will be ignored and will not impact your score.
  • Have its recommended alarms both implemented and in the ‘OK’ or ‘Alarm – not missing data’ state.
  • Have its recommended SOPs both implemented and successfully executed within the past 30 days.
  • Have its recommended FIS experiments both implemented and successfully executed within the past 30 days.

As mentioned earlier, getting a score of 100% may not be a requirement for your organization. If you do not wish to implement all the recommendations provided by Resilience Hub, your actual target will be lower.

The following table shows the weight for each of these recommendations:

Recommendation type Weight
Meeting resiliency policy 40 percent
Alarms 20 percent
SOPs 20 percent
FIS experiments 20 percent

The resiliency policy recommendation, which accounts for 40% of the total resilience score, is calculated based on the disruption types with the following weights:

Disruption type Weight
Region 10 percent
Availability Zone 20 percent
Infrastructure 30 percent
Application 40 percent

Any non-compliant disruption type, triggered alarm or failed SOP/FIS experiment will result in partial points being granted during the assessment.

For more information on the resilience score, please refer to the Resilience Hub documentation.

Example: Improving the resilience of a multi-tier application using the AWS Resilience Hub score

In this example I have deployed a multi-tier WordPress application through CloudFormation and added the resulting stack to Resilience Hub. For this scenario I am using a suggested resiliency policy named Critical Application, which is a single region policy with a 1h RPO/RTO for the Infrastructure and Availability Zone disruption type, and 1h RPO / 4h RTO for the Application disruption type.

Refer to Measure and Improve Your Application Resilience with AWS Resilience Hub to get started with Resilience Hub.

Step 1: Run your first assessment

Figure 2: Running our first resiliency assessment from the Resilience Hub console

Figure 2: Running our first resiliency assessment from the Resilience Hub console

Our first step is to run our very first assessment. This assessment will look at your application components (in our case the database instance and web server group), validate our resiliency policy, and come up with actionable recommendations.

Since this is our first resiliency assessment, I am not expecting to get any points for the alarms, SOPs and FIS experiments (20% each) since the tool is just about to give me its first architectural recommendations. If your application meets your resiliency policy for all the disruption types (Application, Infrastructure, Availability Zone and Region (optional)), you can expect to get a 40% score for now.

Figure 3: Resilience score graph after a first resiliency assessment

Figure 3: Resilience score graph after a first resiliency assessment

Figure 4: Meeting our resiliency policy for every disruption type, in a single region deployment.

Figure 4: Meeting our resiliency policy for every disruption type, in a single region deployment.

Note: If any of the disruption types displayed in Figure 4 did not satisfy the requirements of our resiliency policy, you would only have received partial points for the resiliency policy recommendation.

Step 2: Implementing alarms, SOPs and FIS experiment templates

The assessment report includes the operational recommendations that are now deployable with CloudFormation. I recommend that you start with the alarms first, as CloudWatch alarms will be used by FIS experiment templates to validate the tests later.

The recommendations provided by Resilience Hub will be specific to your environment. Here you will notice that Resilience Hub has provided several FIS experiments to test our Amazon Relational Database Service (Amazon RDS) database, AWS Auto-Scaling Group and multi-AZ design. I have also received 10 recommended alarms and 3 SOPs (not shown here).

Figure 5: Resilience Hub recommended fault injection experiments for a multi-AZ application

Figure 5: Resilience Hub recommended fault injection experiments for a multi-AZ application

Figure 6: An AWS CloudFormation stack with its associated 3 recommendation templates from Resilience Hub

Figure 6: An AWS CloudFormation stack with its associated 3 recommendation templates from Resilience Hub

Step 3: Implementing prerequisites for the alarms

Some alarms will require manual configuration to work properly. For example, specific alarms may need operational metrics from your Amazon Elastic Compute Cloud (Amazon EC2) instances, like memory utilization and require a specific CloudWatch agent configuration.

You can access the setup instructions by clicking on the red ‘“Configuration” warning sign.

Figure 7: Example of a recommended alarm requiring manual configuration.

Figure 7: Example of a recommended alarm requiring manual configuration.

Step 4: Customizing alarms, SOPs and FIS experiment templates

You may need to customize your recommendation settings to get a proper resilience strategy that fits your environment. Take some time to review and customize your alarms and FIS experiment templates based on your requirements.

For example, you may want to extend the duration of your stress tests, terminate a specific process, or update the expected recovery time in your FIS experiment template.

Step 5: Validating alarms, SOPs and FIS experiments

Now that you have deployed and configured all the recommendations provided by Resilience Hub, you will need to successfully run your SOPs and FIS experiments to increase your score. Note that your CloudWatch alarms must also be in the ‘OK’ or ‘Alarm – not missing data’ state to receive the maximum resilience score for your application.

Your resilience score will update on the main dashboard after your next assessment.

You will need to run your SOPs and FIS experiments at least every 30 days to keep your resilience score from drifting.

Figure 8: Application score resilience dashboard

Figure 8: Application score resilience dashboard

Troubleshooting a drifting resilience score

Resilience Hub is a service that can be used to frequently assess the resilience of your infrastructure, the status of your SOPs and FIS experiments. Achieving a score of 100% is an important first step, but you need to remember that without proper maintenance your score may decrease over time.

Here are some of the common explanations for a drifting resilience score:

  1. Your application is no longer meeting your resiliency policy: check the resiliency recommendations section of your latest assessment to learn more or verify that your resiliency policy was not updated by another administrator.
  2. One or more of your SOPs or FIS experiments have failed to complete: it is crucial for an application to continue to operate after unexpected events. If your application is taking too much time to scale out, recover, or stops operating during the test campaign, your experiments will fail and your score will decrease.
  3. You have not run one or more SOPs or FIS experiment in the past 30 days: it is important to periodically test your resiliency strategy to confirm that your security mechanisms are able to prevent issues proactively and remain up-to-date.
  4. One or more of your alarms have been triggered: you will need to investigate in your application or potentially customize your alarm settings to make them relevant to your environment.
  5. New recommendations are available in your latest assessment or Resilience Hub may have new alarms, SOPs or FIS tests as your application is evolving and growing. Check the operational recommendations section of your latest assessment and confirm that nothing is in the “Not implemented” state.

Cleanup

If you deployed a test application to discover Resilience Hub, do not forget to delete any existing resources to avoid unnecessary charges.

  1. Remove your application from the Resilience Hub dashboard.
  2. Delete the CloudFormation stacks (alarms, SOPs, FIS experiments) deployed from Resilience Hub
  3. If you used the multi-tier WordPress infrastructure, delete the CloudFormation template that deployed your application.
  4. Delete your remaining AWS resources that you implemented to run the recommendations: AWS Simple Notification Service (SNS) topics, AWS CloudWatch canaries etc.

Conclusion

Having good visibility on your application resilience mechanisms and actionable tools to validate your strategy is critical to keep your services operational over time. Assessing your applications and testing your Standard Operating Procedures (SOPs) periodically will help you keep your resilience posture up-to-date and validated.

In this blog post we saw how the resilience score can help you quickly understand the status of your resilience strategy. We learnt how the score is calculated, how to maximize it and troubleshoot drifting scores.

Let us know your feedback and get started with AWS Resilience Hub today.

About the author:

Pierre Collard

Pierre Collard is a Partner Solutions Architect at Amazon Web Services (AWS) based in Toronto, Canada. With extensive experience in networking and cybersecurity, Pierre helps organizations from the Canadian public sector accelerate their digital transformation through the power of cloud computing. // Pierre Collard est un Architecte de Solutions Partenaires chez Amazon Web Services (AWS), basé à Toronto, au Canada. Il possède une vaste expérience dans les domaines de la mise en réseau et de la cybersécurité, Pierre aide les organisations du secteur public canadien à accélérer leur transformation numérique grâce aux technologies de l’infonuagique.