AWS Cloud Operations Blog

Validating and Improving the RTO and RPO Using AWS Resilience Hub

“Everything fails, all the time”, a famous quote from Werner Vogels, VP and CTO of Amazon.com. When you design and build an application, a typical goal is to have it working, the next is to keep it running, no matter what disruptions may occur. It is crucial to achieve resiliency, but you need to consider how to define it first and which metric to use to determine your application’s resiliency against. Resiliency can be defined in terms of metrics called RTO (Recovery Time Objective) and RPO (Recovery Point Objective). RTO is a measure of how quickly can your application recover after an outage and RPO is a measure of the maximum amount of data loss that your application can tolerate.

To learn more on how to establish RTO and RPO for your application, please refer “Establishing RTO and RPO Targets for cloud applications

AWS Resilience Hub is a new service launched in November 2021. This service is designed to help you define, validate and track the resilience of your applications on the AWS cloud.

You can define the resilience policies for your applications. These policies include RTO and RPO targets for applications, infrastructure, availability zone, and region disruptions. Resilience Hub’s  assessment uses best practices from the AWS Well-Architected Framework.  It will analyze the components of an application such as compute, storage, database and
network and uncover potential resilience weaknesses.

In this blog we will show you how Resilience Hub can help you validate RTO and RPO at component level for four types of disruptions, which in-turn can help you improve the resiliency of your entire application stack.

  1. Customer Application RTO and RPO
  2. AWS Infrastructure RTO and RPO
  3. Cloud Infrastructure Availability Zone (AZ) disruption
  4. AWS Region disruption

Customer Application RTO and RPO

Application outages occur when the infrastructure stack (hardware) is healthy but the application stack (software) is not. This outage may be caused by configuration changes, bad code deployments, integration failures, etc. Determining RTO and RPO for application stacks depends on the criticality and importance of the application, as well as your compliance requirements. For example, mission critical application could have an RTO and RPO of 5 minutes

Example: Your critical business application is hosted from Amazon Simple Storage Service (Amazon S3) bucket and you set it up without cross region replication and versioning. Figure 1 shows that application RTO and RPO are unrecoverable based on a target of 5 minutes RTO and RPO

Figure 1. Resilience Hub assessment of the Amazon S3 bucket against Application RTO

Figure 1. Resilience Hub assessment of the Amazon S3 bucket against Application RTO

After running the assessment, Resilience Hub provides recommendation to enable versioning on Amazon S3 bucket as shown in Figure 2.

Figure 2. Resilience recommendation for Amazon S3

After enabling the versioning, you can achieve the estimated RTO of 5m and RPO of 0s. Versioning allows you to preserve, retrieve, and restore any version of any object stored in a bucket improving your application resiliency.

Resilience Hub also provides the cost associated with implementing the recommendations. In this case, there is no cost for enabling versioning on Amazon S3 bucket. Normal S3 pricing applies to each version of an object. You can store any number of versions of the same object, so you may want to implement some expiration and deletion logic if you plan to make use of versioning.

Resilience Hub can provide one or more than one recommendation to satisfy the requirements such as cost, high availability and least changes. As shown in Figure 2, adding versioning for S3 bucket satisfies both high availability optimization and best attainable architecture with least changes.

Cloud Infrastructure RTO and RPO

Cloud Infrastructure outage occurs when the underlying components for infrastructure, such as hardware fail. Consider a scenario where a partial outage occurred because of a component failure.

For example, one of the components in your mission critical application is an Amazon Elastic Container Service (ECS) running on Elastic Compute Cloud (EC2) instance, and your targeted infrastructure RTO  and RPO of 1 second .Figure 3 shows that you are unable to meet your targeted infrastructure RTO of 1 second

Figure 3. Resilience Hub assessment of the ECS application against Infrastructure

Figure 4 shows that Resilience Hub’s recommendation to add  AWS Auto scaling Groups and Amazon ECS Capacity Providers in multiple AZs.

Figure 4. Resilience Hub recommendation for ECS cluster - to add Auto Scaling Groups and Capacity providers in multiple AZs.

Figure 4. Resilience Hub recommendation for ECS cluster – to add Auto Scaling Groups and Capacity providers in multiple AZs.

AWS Auto Scaling monitors your applications and automatically adjusts capacity to maintain steady, predictable performance at the lowest possible cost. Amazon ECS capacity providers are used to manage the infrastructure the tasks in your clusters use. It can use AWS Auto Scaling groups to automatically manage the Amazon EC2 instances registered to their clusters. By applying the Resilience Hub recommendation, you will achieve an estimated RTO and RPO of near zero seconds and the estimated cost for the change is $16.98/month.

AWS Infrastructure Availability Zone (AZ) disruption

The AWS global infrastructure is built around AWS Regions and Availability Zone to achieve High Availability (HA). AWS Regions provide multiple physically separated and isolated Availability Zones, which are connected with low-latency, high-throughput, and highly redundant networking.

For example, you have setup a public NAT gateway in Single-AZ to allow instances in a private subnet to send outbound traffic to the internet. You have deployed Amazon Elastic Compute Cloud (Amazon EC2) instances in multiple availability zones.

Figure 5. Resilience Hub assessment of the Single-Az NAT gateway

Figure 5 shows Availability Zone disruptions as unrecoverable and does not meet the Availability Zone RTO goals. NAT gateways are fully managed services, there is no hardware to manage, so they are resilient (0s RTO) for infrastructure failure. However, deploying only one NAT gateway in Single AZ leaves the architecture vulnerable. If the NAT gateway’s Availability Zone is down, resources deployed in other Availability Zones lose internet access.

Figure 6. Resilience Hub recommendation to create Availability Zone-independent NAT Gateway architecture.

Figure 6 shows Resilience Hub’s recommendation to deploy NAT Gateways into each Availability Zone where corresponding EC2 resources are located.

Following Resilience Hub’s recommendation, you can achieve the lowest possible RTO and RPO of 0 seconds in the event of an Availability Zone disruption and create an Availability Zone-independent architecture run on $32.94 per month.

NAT Gateway deployment in multiple Azs can achieve the lowest RTO/RPO for Availability Zone disruption, the lowest cost, and the minimal changes, so the recommendation is the same for all three options.

AWS Region disruption

An AWS Region consists of multiple, isolated, and physically separated AZs within a geographical area. This design achieves the greatest possible fault tolerance and stability. For a disaster event that includes the risk of disruption of multiple data centers or a regional service disruption, it’s a best practice to consider multi-region disaster recovery strategy to mitigate against natural and technical disasters that can affect an entire Region within AWS. If one or more Regions or regional service that your workload uses are unavailable, this type of disruption can be resolved by switching to a secondary Region. It may be necessary to define a regional RTO and RPO if you have a Multi-Region dependent application.

For example, you have a Single-AZ Amazon RDS for MySQL as part of a global mission-critical application and you have configured 30 min RTO and 15 minute RPO for all four disruption types. Each RDS instance runs on an Amazon EC2 instance backed by an Amazon Elastic Block Store (Amazon EBS) volume for storage. RDS takes daily snapshots of the database, which are stored durably in Amazon S3 behind the scenes. It also regularly copies transaction logs to S3—up to 5 min utes intervals—providing point-in time-recovery when needed.

If an underlying EC2 instance suffers a failure, RDS automatically tries to launch a new instance in the same Availability Zone, attach the EBS volume, and recover. In this scenario, RTO can vary from minutes to hours. The duration depends on the size of the database, and failure and recovery approach. RPO is zero in the case of recoverable instance failure because the EBS volume was recovered. If there is an Availability Zone disruption, you can create a new instance in a different Availability Zone using point-in-time recovery. Single-AZ does not give you protections against regional disruption. Figure 7 shows that you are not able to meet regional RTO of 30 min and RPO of 15 mins.

Figure 7. Resilience Hub assessment for the Amazon RDS

Figure 8. Resilience Hub recommendation to achieve region level RTO and RPO

As shown in Figure 8, Resilience Hub provides you three recommendations to optimize in order to handle Availability Zone disruptions, be cost effective and to have minimal changes.

Recommendation 1 “Optimize for Availability Zone RTO/RPO”: The changes recommended under this option will help you achieve the lowest possible RTO and RPO in the event of an Availability Zone disruption. For a Single-AZ RDS, Resilience Hub recommends to change the Database to Aurora and add two read replica same region to achieve targeted RTO and RPO for Availability Zone failure. It also recommends to add a read replica in different region to achieve resiliency for regional disruption. Estimated cost for these changes as shown in Figure 8 is $66.85 per month.

Amazon Aurora read replicas share the same data volume as the original database instance. Aurora handles the Availability Zone disruption by fully automating the failover with no data loss. Aurora creates highly available database cluster with synchronous replication across multiple AZs. This is considered to be the better option for production databases where data backup is a critical consideration.

Recommendation 2 “Optimize for cost”: These changes will optimize your application to reach the lowest cost that will still meet your targeted RTO and RPO. The recommendation here is to keep a Single-AZ Amazon RDS and create the read replica in primary region with additional read replica in the secondary / different region. The estimated cost for these changes is $54.38 per month. You can promote a read replica to a standalone instance as a disaster recovery solution if the primary DB instance fails or unavailable during region disruption.

Recommendation 3 “Optimize for minimal changes”: These changes will help you to meet targeted RTO and RPO while keeping implementation changes to minimal. Resilience Hub recommends to create a Multi-AZ writer and a Multi-AZ read replica in two different regions. Estimate cost for changes is $81.56 per month. When you provision a Multi-AZ Database instance, Amazon RDS automatically creates a primary Database instance and synchronously replicates the data to a standby instance in a different Availability Zone. In case of an infrastructure failure, Amazon RDS performs an automatic failover to the standby Database instance. Since the endpoint for your Database instance remains the same after a failover, your application can resume database operation without the need for manual administrative intervention

Although all three recommendations help you achieve a targeted application RTO and RPO of 30 mins, the estimated costs and efforts may vary.

Conclusion

To build a resilient workload, you need to have right best practices in place. In this post, we showed you how to improve the resiliency of your business application and achieve targeted RTO and RPO for application, infrastructure, Availability Zone, and Region disruptions using recommendations provided by Resilience Hub. To learn more and try the service by yourself, visit AWS Resilience Hub page.

Authors:

Monika Shah

Monika Shah is a Technical Account Manager at AWS where she helps customers navigate through their cloud journey with a focus on financial and operational efficiency. She worked for Fortune 100 companies in the networking and telecommunications fields for over a decade. In addition to her Masters degree in Telecommunications, she holds industry certifications in networking and cloud computing. In her free time, she enjoys watching thriller and comedy TV shows, playing with children, and exploring different cuisines.

Divya Balineni

Divya is a Sr. Technical Account Manager with Amazon Web Services. She has over 10 years of experience managing, architecting, and helping customers with resilient architectures to meet their business continuity needs. Outside of work, she enjoys gardening and travel.