AWS Public Sector Blog

A pragmatic approach to RPO zero

Nobody wants to lose data—and setting a Recovery Point Objective (RPO) to zero makes this intent clear. Customers with government mission-critical systems often need to meet this requirement, since any amount of data loss will cause harm. RPO covers both resilience and disaster recovery—everything from the loss of an individual physical disk to an entire data center.

For example, financial systems need to support RPO zero. Once a customer commits a transaction and gets confirmation of the transfer of funds, they expect that transaction is never lost. Existing systems support RPO zero through a combination of architecture patterns (including resilient messaging) and on-premises legacy databases.

Frequently interpreted as a database or storage requirement, providing for RPO zero requires thinking about the entire system. To do so, you can use Amazon Web Services (AWS) services and architecture patterns, which provide resilience to failure with clustering, auto scaling, and failover across multiple data centers within one region.

There are scenarios that will break RPO zero—an asteroid strike could lead to the simultaneous loss of all data centers in a region. Multi-region architectures can address this risk, but data sovereignty requirements may preclude using multiple regions.

RPO complements Recovery Time Objective (RTO), how long the system will take to recover from a failure or disaster. You may be able to cope with hours of RTO but still need RPO zero. Also, you should plan for patching and other maintenance activities.

AWS native resilience

Amazon Aurora database provides RPO zero at the storage level by requiring at least four of the six storage nodes to acknowledge receipt before confirming the transaction. Aurora splits the six storage nodes across Availability Zones (AZs) in an AWS Region. Amazon Relational Database Service (Amazon RDS) Multi-AZ (except SQL Server) provides close to RPO zero at the storage level independently of the database. It writes each block synchronously to two Amazon Elastic Block Storage (Amazon EBS) volumes in two different AZs. However, under some degradation circumstances, the service may write the block to only one Amazon EBS volume, creating a risk of data loss. Amazon RDS Multi-AZ SQL Server uses the native Always On Availability Groups.

Architecture patterns for RPO Zero

To provide higher confidence in RPO zero, there are two processing patterns to consider: batch and online. Batch processing may read from either an external or an internal source. If internal, we can achieve RPO zero by restarting the batch process. If external, you can implement the following pattern:

To provide higher confidence in RPO zero, there are two processing patterns to consider: batch and online. Batch processing may read from either an external or an internal source. If internal, we can achieve RPO zero by restarting the batch process. If external, you can implement the following pattern.

The external data is immediately stored in Amazon Simple Storage Service (Amazon S3) before it is confirmed to the external party that the data was received. This process assures that you have a resilient copy of the data first. Amazon S3 provides read-after-write consistency for new objects and stores the objects across multiple availability zones in a region, providing protection against loss of an entire data center. The batch process would then process the data from Amazon S3 to the database—this can be any form of compute and any database.

If you have a failure at any point after receiving the data, you can replay the batch process from the Amazon S3 bucket. This works even if the batch process completed its commit to the database but the database did not finish its own resilience (either read replica or backup). We eliminate the dependence on RPO zero at the database level. Given that the data will be available in Amazon S3 in its original state for re-processing, the batch process needs to be idempotent. This means that you can restart the process at any time and from any point without having to worry that it would re-commit any records.

The online processing can follow a similar pattern. When a user submits a transaction, they will only have confidence that the transaction is complete when they get a response. Otherwise, they may suspect that something has gone wrong and retry the transaction, especially for modern browser and mobile applications, as the communication is asynchronous:

The online processing can follow a similar pattern. When a user submits a transaction, they will only have confidence that the transaction is complete when they get a response. Otherwise, they may suspect that something has gone wrong and retry the transaction, especially for modern browser and mobile applications, as the communication is asynchronous.When the user submits their transaction, you first store the transaction in Amazon S3. You can process the remainder of the transaction through business logic and then commit it to the database. Any processing or database can be used—for example Amazon Elastic Compute Cloud (Amazon EC2) and Amazon RDS. The user then gets a confirmation of the result. If the user does not get a confirmation, they will assume the transaction failed and try again (and a smart user app could allow this retry without re-entering the data). You do not need to consider RPO zero until after the user gets the confirmation. RPO zero means the user knows the transaction is accepted and expects that data to be retained.

Even when considering all the failure scenarios, you can always revert to the Amazon S3 transaction and replay to maintain the RPO zero requirement. Similar to the earlier batch processing scenario, transaction processing must be idempotent to avoid processing duplicate transactions. You can create a transaction identifier within the interface application to support this. The Amazon S3 write may introduce additional latency to the user. However, it is likely that the Amazon S3 write will be faster than business logic and database commit. You can consider running these in parallel to maintain existing latency.

Removing the need for RPO zero at the database level allows you to use the standard, high performance, and consistent databases available in AWS. If you insist on RPO zero at the database level, then you are necessitating a two-phase commit across data centers to provide the resiliency. Two-phase commit is an anti-pattern for performance, and the two-phase commit mechanism itself introduces complexity and a risk of resiliency failure.

Start planning for disaster recovery

I reframed the requirement for RPO zero at the overall system architecture level to show how you can use standard and performance-efficient consistent databases while still delivering on the RPO zero requirement. You can take advantage of Amazon S3 read-after-write consistency and resilience with minimal impact to existing processing logic. This pattern allows you to migrate to standard and purpose-built AWS database technologies from legacy cross-data center database products.

Reach out to AWS for help in migrating to cloud. Learn more by visiting the AWS Organizational Resiliency & Continuity Help Center, and check out more stories on disaster recovery.