Highly Resilient Affirm Checkout Architecture

Introduction

This blog will explain the current Affirm Checkout Architecture and how we are making it highly resilient.

Affirm’s mission is to deliver honest financial products that improve their customer’s lives. Affirm is reinventing credit to make it more honest and friendly, giving consumers the flexibility to buy now and pay-over-time without any hidden fees or compounding interest.

As the popularity of Affirm’s products has grown since it was founded over a decade ago, there is a need to re-architect the technology platform to meet the evolving scale, performance and reliability requirements.

Objective

Affirm is customer focused and wants their service to be available whenever customers look to transact. Therefore, an important goal for the Architecture Group is providing four 9s of availability; four 9s refers to a high availability system that is available 99.99% of the time. This translates to 52 minutes of downtime per year.

Current Architecture

During Checkout, Affirm makes a real-time credit decision for every transaction, deciding on the financing that allows consumers to make purchases and pay for them over time. Affirm’s Checkout service handles this end-to-end process, including the underwriting decision. This service runs on Amazon EKS and consists of many microservices. The microservices in turn use multiple AWS Services such as Application Load Balancer (ALB), Amazon Aurora, and STS.

Architectural Decisions & Analysis

To achieve four 9s, the general recommendation from AWS is to use a single-region architecture. For Affirm, this recommendation raises two crucial questions:

1. How should Affirm think about large-scale events that affect an entire region?

2. How should Affirm think about building four 9s availability in a single-region when regional AWS services have a mix of four 9s and three 9s availability SLA?

To address the first question, Affirm, being a data-driven company, conducted a thorough examination of all major outages on AWS, and to address the second question, we looked at the Affirm Checkout service and its dependencies on AWS services. The findings revealed that, since 2011, there have been no instances of correlated disruptions across AWS regions. While multi-region architectures have their advantages for increasing availability and disaster recovery, a single-region architecture can provide a sufficient level of availability for many applications while offering greater simplicity, cost-effectiveness, and optimized resource allocation. Affirm concluded that by architecting correctly in a single-region, they can reduce the likelihood of being affected by region wide disasters to once every 2–4 years. The specific architectural decisions made by Affirm are:

1. To achieve four 9s, we rely on a single-region with multi-AZ redundancy. This allows us to tolerate disasters in a single Availability Zone (AZ). It is rare for disasters in one AZ to spread to multiple AZs, and the round trip latency between AZs in the same Region generally is single digit milliseconds.

2. The checkout data plane relies on a minimal number of AWS services. Furthermore, the data plane relies only on AWS services with a four 9s+ availability SLA. We decided our control plane can rely on services with three 9s, but we worked closely with AWS solution architects and service specialists to configure services optimally for our applications. For example, AWS Security Token Service, a three 9s service that we use on the control plane, was affected during the recent AWS Lambda outage. However, setting the token refresh interval to a large value, minimizes the likelihood of a data plane outage which could impact availability.

3. All microservices degrade gracefully. All the microservices involved in the checkout data plane will be classified as either optional (for example, rewards program) or required (e.g., identify verification). Optional microservices fail in a sane non-blocking manner, so that Checkout succeeds even when any optional microservices are down. This further reduces the dependency of Checkout availability on AWS services.

4. We maintain a Disaster Recovery (DR) plan to failover to another region in case there is a large-scale event in a single-region. While the expectation is to failover no more frequently than once every 2-4 years, the DR plan is tested more frequently to verify correctness.

5. We architect Affirm microservices to isolate between various workloads that they serve. For example, we achieve compute isolation by using separate Kubernetes deployments for both Checkout and non-Checkout services. Istio VirtualService is used to route the traffic to the right set of pods, preventing non-Checkout traffic from impacting Checkout traffic.

Affirm evaluated an active-active multi-region architecture. However, such an architecture required a large investment at the application layer. This is because using multi-region strongly consistent storage requires significant application changes. These changes include reducing chattiness between application and database to minimize the impact of higher latency on writes. Alternatively, using multi-region eventually consistent storage requires applications to handle correctly the implications of replication delay for example lost writes.

Conclusion

The strategy to build four 9s in a single-region with design choices like graceful degradation and isolation allowed Affirm to focus on architecting against application-level failures, the biggest source of unavailability for Affirm. There is always the desire to improve availability, so Affirm considers this as a journey. Affirm will invest in active-active multi-region architecture when we architect for 5 9s. Multi-region is an important tool but they want to be careful in not reaching out for it too soon, thereby making their system more complex than needed, as we fundamentally believe that simpler architectures result in more reliable systems.

AWS for Industries