Identifying resilience drift using AWS Resilience Hub

Most people think of disaster recovery as a mechanism to protect their applications against big events. However, in the fast-paced world of development where new code and infrastructure changes are occurring several times a month, it is important to put mechanisms in place to proactively understand impacts to the resilience posture of your applications.

In this post, we discuss how you can use AWS Resilience Hub’s recently launched capability, resilience drift detection, to identify potential changes to your application’s resilience posture and to remedy the issues which caused the drift. The resilience of an application refers to its ability to maintain availability and recover from software and operational disruption within a specified target, and is commonly measured in terms of Recovery Time Objective (RTO), also known as down time, and Recovery Point Objective (RPO), also known as data loss. In this blog, we will review the importance of enabling resilience drift detection, an example of setting up resilience drift detection, and interpret its outcome.

What is AWS Resilience Hub and resilience drift detection?

AWS Resilience Hub (ARH) is a managed service that gives you a central place to define, validate, and track the resilience of your AWS applications. It is integrated with AWS Fault Injection Service (FIS), a resilience testing service, which allows you to purposefully inject faults into your applications to see how they respond. Using AWS Resilience Hub, you can assess your applications to uncover potential architectural resilience enhancements. This will allow you to validate your applications’ RTO and RPO, and optimize business continuity while potentially reducing recovery costs. AWS Resilience Hub also provides APIs for its assessment and testing, allowing you to add into your CI/CD pipelines for ongoing resilience validation.

On August 2^nd, 2023, we released a new capability in AWS Resilience Hub – resilience drift detection. Resilience drift detection allows you to proactively detect and react to changes to the resilience posture of your application on AWS. Once you opt in for resilience drift detection, you are notified when your application is no longer meeting its resilience policy (i.e. the application will potentially not meet its RTO and/or RPO). You are also directed to the latest resilience assessment that identified the drifts, and are provided with remediation actions.

Solution overview

We will use an example to walkthrough the resilience drift detection solution. In the example, we will do the following: 1. describe an application, 2. define a resilience policy, 3. enable resilience drift detection, 4. execute a resilience assessment, and 5. identify resilience drifts from the last steady state. The example below is presented in the AWS Console with the assessment results and drift detection presented in both console and CLI formats.

The application

We begin by adding your application into AWS Resilience Hub. An application is a collection of AWS resources. Resources can be located in multiple AWS Regions and provisioned under multiple AWS accounts. You can describe your application on AWS using AWS CloudFormation, AWS Resource Groups, HashiCorp Terraform, AWS AppRegistry, or as an AWS Elastic Kubernetes Service (EKS).

The following diagram illustrates the application (an e-commerce website) we use in this blog:

Figure 1 – Diagram of the application running in AWS and used throughout this blog

Step 1 – Describe the application in AWS Resilience Hub

We start by clicking ‘Add application’ in the AWS Resilience Hub console page.

In our example, we use an application that is described as an AWS CloudFormation stack. We name our sample application ‘DriftBlogDemoApplication’ and choose the stacks that describe it.

Figure 2 – AWS Resilience Hub page to describe an application

Step 2 – Choose resilience policy and setup permissions

Next, we choose a resilience policy (RTO and RPO) for our application. To learn more about setting RTO and RPO, please visit our blog Establishing RPO and RTO Targets for Cloud Applications. We set up the correct permissions for AWS Resilience Hub to be able to assess the resilience of the application.

Running AWS Resilience Hub’s resilience assessment (step 4 below) will indicate whether or not the application meets or breaches its resilience policy. ‘Policy met’ indicates that AWS Resilience Hub estimates that the application is configured to recover within its estimated workload RTO and RPO, and ‘Policy breached’ indicates the opposite.

AWS Resilience Hub offers estimated workload RTO and RPO for 4 different disruptions – Application (loss of a required software service or process), Infrastructure (loss of hardware, such as Amazon EC2 instances), Availability Zone (one or more Availability Zones are unavailable), and Region (one or more Regions are unavailable). For an application to meet its policy, all disruption types must meet their targeted RTO and RPO (please refer to Managing Resilience Policies).

In our example, the policy is set to RTO of 1 hour and RPO of 1 hour for software, infrastructure, and Availability Zone disruptions. We didn’t set a Region disruption policy because our sample application is deployed in a single region.

Note: In addition to indicating whether or not the application would meet its resilience policy, AWS Resilience Hub generates resilience (e.g., AWS resources configuration changes) and operational recommendations (e.g., Amazon CloudWatch alarms, AWS Fault Injection Service experiments, and recovery Standard Operating Procedures). The purpose of the recommendations is to improve the resilience of the application and help it switch from a ‘breaches’ to ‘meets’ state.

To learn more about AWS Resilience Hub’s resilience and operational recommendations, please visit ‘Reviewing assessment report’.

Figure 3 – Setting up a resilience policy in ARH

Step 3 – Enable resilience drift detection

We enable resilience drift detection by selecting both ‘Automatically assess this application daily’ and ‘Get notification of any resilience policy breach’ as show in Figure 4. The AWS Resilience Hub resilience assessment will execute every 24 hours and notify you if the assessment result moves from ‘Policy Meet’ to ‘Policy Breach’.

Resilience drift detection uses Amazon Simple Notification Service (Amazon SNS) topic to alert of potential drifts (i.e., receive an email notification or trigger an Amazon Lambda function). Therefore, we provide an Amazon SNS topic and give AWS Resilience Hub permissions to publish to it.

Figure 4 – Setting up resilience drift detection in ARH

Step 4 – Publish and assess the application

Once resilience drift detection is enabled, we can publish the application and make sure that we included or excluded all the application resources we want to assess. For the full list of supported AWS resources please visit https://docs.aws.amazon.com/resilience-hub/latest/userguide/supported-resources.html.

Figure 5 – Publish the described application in ARH

We then run the resilience assessment. To view the results, click on the assessment’s name in the ‘Assessment’ tab.

From this point onward, the example we use in this blog will be presented for both AWS Console and CLI users. To learn more about using AWS Resilience Hub, please visit our API guide https://docs.aws.amazon.com/resilience-hub/latest/userguide/using-api.html

For CLI users, executing the API list-apps and list-app-assessment will generate the following output indicating that the resilience drift status in ‘Not drifted’:

Figure 6 – CLI outcome of list-apps API showing that no resilience drift was detected

Step 5 – Review assessment results and implement recommendations

As discussed above, AWS Resilience Hub’s resilience assessment will indicate whether or not the application meets or breaches its resilience policy. In our example, the policy is met with an estimated workload RTO of 30 minutes and RPO of 1 hour (vs. policy of RTO of 1 hour and RPO of 1 hour).

Figure 7 – ARH resilience assessment results indicating that the application meets its resilience policy

CLI users will run the API list-app-component-compliances and list-app-component-recommendations which return no indication that a drift is detected.

Figure 8 – CLI outcome of API list-app-component-compliance indicating that resilience policy is met

Step 6 – Detecting drifts and making adjustments

If resilience drift detection is enabled (see step 3 above), changes detected in the state of the ‘Compliance status’ (i.e. policy changed from ‘met’ to ‘breached’) will be alerted in the applications state and published in the selected SNS topic.

Figure 9 – ARH resilience indicates that application has drifted from its resilience policy

From the CLI, the API list-app-assessments will return the following, indicating that a drift is detected:

Figure 10 – CLI output of API list-app-assessment indicates that resilience drift has been detected

In our example, the Amazon Elastic Block Storage (EBS) backup schedule changed and caused the estimated RPO to drift from 1 to 2 hours.

Figure 11 – Resilience assessment of ARH indicates that EBS backup schedule caused the resilience drift

Through the console, the API will generate the following outcome,

list-app-component-compliances, indicating that the resilience policy is breached.

Figure 12 – CLI outcome of API list-app-component-compliances indicating that resilience policy is breached

And list-app-assessment-compliance-drifts indicating the expected vs. current RTO and RPO values for each component (example refers to EBS Volume).

Figure 13 – CLI outcome of API list-app-assessment-compliance-drifts indicating that EBS Volume is causing the drift

Furthermore, AWS Resilience Hub will generate resilience recommendations that will help you update your application in order to get back into ‘Policy Meet’ state. In our example, the changes are related to the EBS volume and can be optimized for minimal changes, cost, or Availability Zone RTO/RPO.

Figure 14 – Resilience assessment in ARH generated recommendations to move the application back to ‘policy met’ state

For CLI users, the API call list-app-component-recommendations will return the recommended remedy for the EBS Volume; in our case, “Modify the frequency of AWS Backup plan associated with your EBS volume to comply with your defined RPO target”.

Summary

In this blog, we reviewed a new capability in AWS Resilience Hub – resilience drift detection. Resilience drift detection allows you to identify and receive alerts for changes to the configuration of your application that might impact its resilience posture and cause it to potentially breach its resilience policy (i.e. not meet the application’s RTO and RPO). We introduced how to enable resilience drift detection, as well as how to use it for both AWS Console and CLI users.

Resilience drift detection runs by default on daily basis, or any time you execute a resilience assessment. You can also integrate the assessment into your CI/CD pipelines. Resilience drift detection identifies changes in the application’s resilience posture. To help remediate drifts, AWS Resilience Hub generates both resilience and operational recommendations.

To get started, please visit https://aws.amazon.com/resilience-hub

AWS Cloud Operations & Migrations Blog

Identifying resilience drift using AWS Resilience Hub

Resources

Follow