Track application resiliency in public sector organizations using AWS Resilience Hub

With the primary focus on citizen impact workloads, resiliency of applications is of paramount importance for any public-serving organization. For a lot of mission critical workloads, the service provider has to ensure that the entire system and processes are monitored 24/7/365 and notifications and support is maintained round-the-clock. Moreover, there are often stringent legal requirements, adherence to service level agreements (SLAs), and contractual obligations between multiple stakeholders. In such circumstances it becomes imperative to have a robust mechanism in place for managing and improving the resilience posture of your applications.

The Amazon Web Services (AWS) Resilience Hub provides you with a single place to define your resilience goals, assess your resilience posture against those goals, and implement recommendations for improvement. In this post, we discuss how we can track the resiliency of software applications and infrastructure using AWS Resilience Hub to provide “always available” services and monitor changes to the application availability. We also highlight the need to differentiate between cloud service providers’ SLAs and business application SLAs.

Identifying the recovery time objective (RTO) and recovery point objective (RPO)

Determining how to protect and recover an application can often be easier than determining how quickly your business needs that application recovered. Establishing the correct recovery objective targets at an application level is a critical part of business continuity planning. In the public sector, the applications and the respective infrastructure, whether on premises or in the cloud, should be able to cater to the RTO and RPO for the applications.

RTO is a measure of how quickly an application must be available again after an outage. RPO refers to how much data loss your application can tolerate. Another way to think about RPO is how old the data can be when this application is recovered. With both RTO and RPO, the targets are measured in hours, minutes, or seconds, with lower numbers representing less downtime or less data loss. Within the context of a business continuity plan, applications having similar RTO targets are grouped together in tiers, with Tier 0 having the lowest RTO.

A straight blue line represents time. In the middle of the line is a flame that signifies disaster. To the left of the flame is the recovery point (RPO) with data loss leading all the way to eventual disaster. To the right of the flame is the recovery time, with downtime leading all the way to eventual disaster.

Figure 1. Data loss is measured from the most recent backup to the point of disaster. Downtime is measured from the point of disaster until it is fully recovered and available for service.

Most of public sector decisions are made with a long-term vision for usage and running the applications, which makes it a difficult but critical task for application owners to track the resiliency and uptime of applications over a period of time. To learn more about establishing RTO and RPO for the applications, read this Establishing RTO and RPO Targets for Cloud Applications post.

How AWS Resilience Hub helps in tracking application resiliency

AWS Resilience Hub helps customers establish RPO and RTO targets per application and then analyze applications against those targets. RPO and RTO objectives are defined in resiliency policies within AWS Resilience Hub. This can be done by selecting from a list of predefined policies, as shown in the following Figure 2.

screenshot showing suggested resiliency policies in the AWS Resilience Hub with seven different policies that range from non-critical applications up to foundational core services

Figure 2. RPO and RTO objectives can be defined in resiliency policies within AWS Resilience Hub by selecting from a list of predefined policies, as shown in this screenshot.

It can also be done by creating a custom policy based on your business needs, as shown in the following Figure 3. There are a number of disruption types that you can capture, including application disruption, dependency on a single piece of infrastructure, and Availability Zone, and optionally a Region.

Figure 3. Screenshot when setting up a custom policy in AWS Resilience Hub.

Resiliency policies are assigned to one or more applications, creating a tier. Applications are then assessed against their tier’s targets either through a direct request from a user, by a scheduled assessment, or as part of your continuous integration and delivery (CI/CD) pipeline. The resiliency assessment uses best practices from the AWS Well-Architected Framework to analyze the components of the application and uncover potential resiliency weaknesses. Following this assessment, AWS Resilience Hub provides a breakdown of what individual components within your application meet, exceed, or fall short of the targeted objective. AWS Resilience Hub then provides recommendations on how to remediate those components to bring them in line with the policy while also providing the estimated resulting RTO, RPO, and costs for each remediation option. The recommendation is that it is better to integrate the Resilience Hub in your CI/CD pipeline, although you can do this from the AWS console. Automatically running a resiliency assessment within CI/CD pipelines, development teams can fail fast and understand quickly if a change negatively impacts an application’s resiliency. The pipeline can stop the deployment into further environments such as quality assurance or user acceptance testing and production until the resiliency issues have been improved.

AWS Resilience Hub recommendations

AWS Resilience Hub provides recommendations in three areas, standard operating procedures (SOPs), alarms, and fault injection experiments, which can help you implement the recommended best practices to improve the resiliency of the application. Resiliency recommendations evaluate application components and recommend how to optimize by estimated workload RTO and estimated workload RPO, costs, and minimal changes (Figure 4).

Figure 4. Screenshot showing AWS Resilience Hub resiliency recommendations. The recommendations are for optimizing by estimated workload RTO/RPO, costs, and for minimal changes.

Amazon CloudWatch alarms monitor your application resiliency and get notified in case of any disruption of the service. This also enables us to take proactive measures during disruption and early intervention to remediate any issues. SOPs enable timely recovery in the event of an operational outage and efficiently recover your application in the event of an outage or alarm. AWS Fault Injection Service (FIS) experiments allow you to test the resiliency of your AWS resources and the amount of time it takes to recover from application, infrastructure, Availability Zone, and AWS Region incidents. For example, you can test whether an application recovers during automatic recovery processes, such as automatic scaling or load balancing, because of network issues. You can test whether application alarms are triggered when resources reach their limits or inject disruptions using FIS and whether SOPs helped you recover from the outage in an efficient manner.

RTO and RPO drift detection

AWS Resilience Hub also provides the ability to track the changes to the application using drift detection, which allows you to opt into notifications when your application is no longer meeting the recovery objectives set by your business. Application resiliency drift detection adds the capability for you to subscribe to an automatic service that will run an assessment and inform you if the estimated workload recovery objectives have moved away from meeting the application’s recovery policy set in AWS Resilience Hub. Customers can opt in for running the assessment daily and now can add a notification using Amazon Simple Notification Service (Amazon SNS) (Figure 5). These assessments can be included as part of the contract to force vendors to continuously track resiliency and send alerts if there is a change in resiliency posture.

Figure 5. Screenshot when configuring notifications in AWS Resilience Hub for drift detection. Users can select daily assessments and opt-in for notifications of any resiliency policy breaches.

Monitoring for compliance

Specifically relevant for public sector organizations, the AWS Resilience Hub helps meet contractual and regulatory requirements by keeping an audit trail of events during planned and unplanned outages. The assessments and recommendations provided by AWS Resilience Hub are tailored for your specific applications based on the services and resources you are using and the resiliency targets you have set. These dashboards can then be exposed to regulatory authorities, compliance authorities, and auditors to generate confidence and enable business continuity to help meet compliance and regulatory requirements. To do this, we can use the AWS Resilience Hub APIs to collect and aggregate resiliency data for applications defined within AWS Resilience Hub and use this information to perform analytics and set up a dashboard using Amazon QuickSight (Figure 6). To learn more about creating AWS Resilience Hub dashboard for the applications, visit Build a resilience reporting dashboard with AWS Resilience Hub and Amazon QuickSight.

Figure 6. Screenshot of an Amazon QuickSight dashboard view to collect and aggregate resiliency data for applications defined within AWS Resilience Hub. The dashboard shows the number of applications, average target RTO/RPO in minutes, and an average resiliency score.

If you are working with a number of applications, AWS Resilience Hub generates a resiliency score per application using a scale that indicates the level of implementation for recommended resiliency tests, alarms, and recovery SOPs. This score can be used to measure resiliency improvements over time. You can view the score on the AWS Resilience Hub dashboard.

Conclusion and next steps

In this post, we discussed the importance of the resiliency of applications for public-serving organizations to provide uninterrupted critical services to its citizens in its emergency response systems, medical facilities, and during times of disputes. As opposed to on premises or data center business continuity and disaster recovery, which uses semi-annual or annual resiliency testing, AWS Resilience Hub provides continual validation and tracking of resiliency. You can decide the frequency of tests (every release, monthly, once per sprint) and you can run them more often. You can implement the SOPs, alarms, and fault injection experiments to continually test the application and enable compliance with SLAs.

As a next step, examine your workloads and determine what the RTO and RPO objectives are, then run a AWS Resilience Hub assessment to find out your resiliency score and how you can improve the resiliency posture of your applications. Look at your existing architecture and determine if you can remove bottlenecks or other single points of failure. If so, look to remediate using different AWS services.

Learn more about the engineering patterns AWS uses to build systems in the Amazon Builders’ Library. Read the AWS Fault Isolation Boundaries whitepaper to learn how AWS is built and how you can build a workload on AWS to support your resiliency goals.

AWS Public Sector Blog