AWS for Industries

Architecting Critical Payment Systems for Multi-Region Resiliency

A wide range of critical systems architected to run on AWS Cloud use active-active patterns for high availability (HA) across multiple AWS Availability Zones (AZs) and active-passive patterns for disaster recovery (DR) across multiple AWS Regions. Financial services customers choose to utilize the active-active approach across multiple Regions to achieve near-zero Recovery Time Objective (RTO). As network latency is directly proportional to distance between Regions, maintaining strong data consistency across Regions becomes difficult. Consensus algorithms such as RaftPaxos and Two-Phase Commit achieve consistency, but do not satisfy performance and latency requirements for the payment businesses.

In this blog post, we explain the design of a critical payment system taking advantage of the ISO 20022 messaging standard and AWS Serverless capabilities. We demonstrate how to achieve exactly-once processing by employing an active-active multi-Region approach with all steps of the same transaction executed within the same Region. We describe the process of cross-Region failure detection combined with a self-healing mechanism by leveraging the ISO 20022 status codes.

Business requirements

By design, a critical payment system moves currency from one entity to another. The ISO 20022 messaging standard makes it easier to achieve exactly-once processing by providing a well-defined list of status codes for each step of the transactional workflow. At a macro level we break down the transaction of currency movement between entities into small-sized steps and use the ISO 20022 status codes (see Figure 1) to reflect each transactional step.

Figure 1 ISO 20022 status codes (partial list)Figure 1: ISO 20022 status codes (partial list)

We chose the ISO 20022 solution guidance as an example to describe a critical transactional system for payment processing. This solution implements the following transactional steps:

1. Initiate new transactions;

2. Receive incoming payment message;

3. Check for technical validation and business rules such as Know Your Customer (KYC), Anti-Money Laundering (AML), Foreign Currency Exchange (FX), Fraud, Sanctions, Liquidity, etc.;

4. Release outgoing payment message.

In case of a regional service impairment, a recovery process reroutes the traffic to the healthy Region and cancels in-flight transactions from the unhealthy Region. We utilize ISO 20022 status codes to ensure that each step of the transactional workflow executes exactly-once, without any duplication or omission. These status codes indicate if the system accepts, rejects, or cancels the transactional steps and the overall payment. A timeout process identifies transactions exceeding the end-to-end processing SLA and marks them as rejected.

Design considerations

To achieve exactly-once processing we use transactional metadata that includes ISO 20022 status codes and replicate it across multiple Regions. Each transactional step loads the data points for previously-executed steps of the same transaction and compares them with expected statuses at that stage. The system reconciles any discrepancies utilizing transactional metadata with ISO 20022 status codes.

In this ISO 20022 solution we use Event-Driven Architecture (EDA) where each transactional step is a small-sized service (also known as microservice, or MSA), decentralized and messaging-enabled, bounded by context, autonomously developed, and independently deployable. To meet certain regulatory, compliance, or business requirements that translate to an increased high availability design, we deployed the ISO 20022 solution across multiple Regions (see Figure 2) in active-active mode described below. The code sample is available on GitHub.

Figure 2 ISO 20022 Messaging Workflows on AWS (multi-Region)Figure 2: ISO 20022 Messaging Workflows on AWS (multi-Region)

AWS Services

Next, let’s dive deeper into the step-by-step logic of the ISO 20022 messaging workflows and emphasize the AWS services and features used for implementation.

Step 1

Amazon Route 53 manages DNS queries for global and regional API endpoints. Regional endpoints leverage Route 53 failover policy routing with Route 53 health checks. In this step, API Consumers make an initial request to the global endpoint and receive a payload which includes the regional endpoint. For all subsequent calls, API Consumers issue API requests of the same transaction to the same regional endpoint.

Step 2

Amazon Cognito secures API endpoints with a built-in authorization flow based on OAuth 2.0 security standard. In this step, API Consumers make a request to the Cognito token endpoint and receive a payload with an Authorization Token, which they use for each regional API request.

Step 3

Amazon API Gateway manages and secures communications between API Consumers and ISO 20022 solution. API Gateway leverages custom domains to provision global and regional endpoints in each Region. AWS Lambda integrates with API Gateway to allow API Consumers to call the Transaction API (or similar API), which proxies the request to the Transaction MSA (or similar Lambda function). In this step, API Consumers request and receive a new unique transaction ID.

Step 4

We chose Amazon DynamoDB for its strong consistency single-digit millisecond latency across multiple AZs within a single Region and global tables with 99.999% availability SLA for its eventual consistency with replication latency up to 1 second across multiple Regions. Transactional metadata stored in DynamoDB include the partition key, Region ID, transaction ID, and transaction status for each transactional step. In this step, the application creates a new metadata item in DynamoDB with a new unique transaction ID and sets the transaction status to ACCP.

Step 5

Amazon Simple Queue Service (Amazon SQS) stores event-driven messages and triggers event-driven microservices for each transactional step. Each SQS queue integrates with a corresponding Lambda function, which triggers when the queue receives a new message. In this step, the application sends the ISO 20022 incoming message to the SQS queue for incoming payment messages.

Step 6

In this step, SQS triggers the Incoming MSA (a Lambda function) to store the ISO 20022 incoming message in Amazon Simple Storage Service (Amazon S3) and to retrieve payment details such as message type or unique identifier. It checks the unique identifier and the transaction ID to avoid duplicative processing. If successful, the application sends the ISO 20022 incoming message to the SQS queue for consuming payment messages and creates a new metadata item in DynamoDB with transaction status ACTC, otherwise RJCT.

Step 7

In this step, SQS triggers the Consuming MSA (a Lambda function) to run the ISO 20022 incoming message through a set of event-driven microservices collectively labeled as check for technical validation and business rules (e.g., KYC, AML, FX, Fraud, Sanctions, Liquidity, etc.). If successful, the application sends the ISO 20022 outgoing message to the SQS queue for releasing payment messages and creates a new metadata item in DynamoDB with transaction status ACSP, otherwise RJCT.

Step 8

In this step, SQS triggers the Releasing MSA (a Lambda function) to store the ISO 20022 outgoing message in Amazon S3 and to notify the API Consumers using Amazon Simple Notification Service (Amazon SNS) when a specific transaction completes processing. If successful, the application creates a new metadata item in DynamoDB with transaction status ACSC, otherwise RJCT.

Step 9

At the end of the transactional workflows, API Consumers make a request to the regional endpoint for Outgoing API and receive the ISO 20022 outgoing message as payload. If successful, the application creates a new metadata item in DynamoDB with transaction status RCVD.

Step 10

Amazon EventBridge triggers Timeout MSA (a Lambda function), an independent microservice that looks for transactions exceeding the end-to-end processing SLA on a regular schedule same and creates a new metadata item per transaction in DynamoDB with transaction status RJCT.

Step 11

Similar to previous step, Amazon EventBridge triggers Recover MSA (a Lambda function) to check the health of the opposite Region. If the application receives multiple consecutive failures, it marks the opposite Region as unhealthy and routes all future traffic to the healthy Region without making any DNS control plane actions. The application finds in DynamoDB all in-flight transactions originated in the unhealthy Region and creates a new metadata item per transaction with transaction status CANC.

Failure scenarios

Routing

The failover strategy for routing relies on the data plane capabilities of Route 53 health checks associated with Amazon S3 get object requests as health check signals. To mark the opposite Region as unhealthy, an application-level process or a human operator simply removes the Amazon S3 health check object from its predefined path. In this case, Route 53 health checks receive HTTP 404 (Not Found) instead of HTTP 200 (OK) and become unhealthy.

In case of failover, the DNS query for API endpoint associated with the PRIMARY record from Region A resolves to the SECONDARY record from Region B. All future traffic directed to the unhealthy Region routes to the healthy Region without making any DNS control plane actions (see Figure 3).

Figure 3 Route 53 routing failoverFigure 3: Route 53 routing failover

Application and metadata

The failover strategy for application and metadata relies on the ISO 20022 status codes and DynamoDB built-in capabilities such as strongly consistent reads within a single Region and global tables for cross-Region replication. The size of each metadata item is intentionally small and limited only to the data points relevant to the transactional decision-making process. For example, the ISO 20022 status codes are relevant to the decision-making process, while the ISO 20022 payment messages — although critical data for the overall payment system — are not relevant to the decision-making process. In this case, the metadata stores the ISO 20022 status code and the path to Amazon S3 object where the ISO 20022 payment message persists.

In case of failover, if a service impairment affects DynamoDB reads or writes, the entire application within a single Region becomes quickly as unhealthy and routing failover redirects the traffic to the opposite healthy Region. On the other hand, if a service impairment affects DynamoDB replication, the application continues to function properly except for the cross-Region recovery process. When replication restores, there are no concerns for collisions or data loss as the partition keys and sort keys in DynamoDB are cross-Region unique, as well as the application layer avoids usage of UPDATE and DELETE operations.

If any other AWS service is partially or fully impaired, the Recover MSA from the opposite Region is able to quickly detect these failures and mark the Region unhealthy. Any potential transactions stuck in-flight are either rejected by the Timeout MSA in the unhealthy Region (if DynamoDB not impaired) or canceled by the Recover MSA in the healthy Region. For any other gray failures or hard to detect failures, a human operator has the ability to remove the Amazon S3 health check object and mark the opposite Region as unhealthy.

For example, let’s say an API Consumer requests transaction ID and submits successfully the ISO 20022 incoming message to the Incoming queue when suddenly both EventBridge and SQS experience impairment in Region A. Potentially, this service impairment affects the trigger of Lambda functions such as Incoming MSA and Timeout MSA and causes all in-flight transactions to become stuck. Meanwhile, Recover MSA in Region B receives multiple consecutive failed responses and marks the Region A as unhealthy, as well as searches for in-flight transactions originated in Region A and cancels them in Region B. API Consumer receives a notification with transaction ID and regional endpoint for Region B, and makes a request to retrieve the ISO 20022 outgoing message with transaction status CANC. When SQS recovers in Region A, it triggers Incoming MSA and retrieves from DynamoDB the current transaction statuses as ACCP, CANC, and RCVD. However, the function expects only the ACCP status, which stops the execution with no further actions (see Figure 4).

Figure 4 Cross-Region application failoverFigure 4: Cross-Region application failover

Finally, by leveraging the Serverless Computing on AWS we are able to make this multi-Region resiliency affordable and cost effective.

Conclusion

These key design considerations improved the high availability and disaster recover objectives while providing the same level of performance across multiple AZs within a single Region and increased reliability across multiple Regions:

1. Routing

a. Enhance your APIs with cross-locations endpoint and location-specific endpoints (e.g., a location is either a Region or an AZ)

b. Instruct your API Consumers to send first API request to cross-locations endpoint and receive location-specific endpoint where it sends all subsequent API requests for the same transaction

c. Enhance your API endpoints with health checks to detect failures and autoshift traffic from an unhealthy endpoint to a healthy one without making any DNS control plane actions (e.g., DNS failover and/or zonal autoshift)

2. Metadata

a. Design your transactional metadata to include partition key unique across locations, transaction ID unique per location, location-specific endpoint (or location ID), and transaction status code; partition keys unique across locations help avoid data replication collisions

b. Design your transactional metadata to be as small as possible; reduced payload decreases replication time, regardless if it’s synchronous or asynchronous

c. Avoid using UPDATE and DELETE operations on transactional metadata; in other words, act like an accountant who is only allowed to CREATE and RETRIEVE metadata

3. Application

a. Enhance your application code to return a location-specific endpoint for your first API request and instruct your API consumer to route all subsequent API requests to the same location for the same transaction; in other words, do NOT process different steps of the same transaction across locations

b. Enhance your application code to create a new transactional metadata item for each stateless microservice execution from your transactional workflow (e.g., we use ISO 20022 status codes)

c. Enhance your application code to constantly search for transactions exceeding the end-to-end processing SLA and time them out (e.g., we use the RJCT status code)

d. Enhance your application code to constantly check the health of the opposite location and when multiple consecutive failures detected, retrieve in-flight transactions originated in the opposite location and roll them back in the healthy location (e.g., we use the CANC status code)

e. Enhance your application code for each transactional step to retrieve whatever status codes it finds in transactional metadata associated with the same transaction ID and stop or reverse transactional workflow if it detects an unexpected status code (e.g., RJCT or CANC)

The multi-Region active-active pattern described in this blog post is not a zero-sum game, but arguably a reference architecture applicable to a wide range of critical transactional systems running on AWS Cloud. To continue learning, visit Advanced Multi-AZ Resilience Patterns and AWS Multi-Region Fundamentals.

Eugene Istrati

Eugene Istrati

Eugene is a Global Solutions Architect at AWS for Financial Services. Based in New York City, he spends most of his time with Global Financial Services customers to help them achieve their business goals through cloud enabled technology solutions. Outside work, Eugene plays soccer (read: football) and travels the world with his family.

Jack Iu

Jack Iu

Jack is a Global Solutions Architect at AWS Financial Services. Jack is based in New York City, where he works with Financial Services customers to help them design, deploy, and scale applications to achieve their business goals. In his spare time, he enjoys badminton and loves to spend time with his wife and Shiba Inu.

Tarik Makota

Tarik Makota

Tarik Makota is a Principal Solutions Architect with Amazon Web Services. He provides technical guidance, design advice, and thought leadership to AWS’ customers across the US Northeast. He holds an M.S. in Software Development and Management from Rochester Institute of Technology.