Skip to main content
2025

Improving Resilience Using AWS with Capital One

Learn how financial institution Capital One is strengthening its technology resilience by using automated solutions on AWS.

Overview

In an industry where technology and banking have traditionally existed in separate spheres, Capital One is breaking the mold. Continually pushing the boundaries of what’s possible, the corporation operates under a philosophy of being a technology company that happens to be in banking. 

Capital One completed an all-in migration to Amazon Web Services (AWS) and has since taken major steps to continue refining and optimizing its technology stack. In its journey to build a technology company, Capital One wanted to transform its approach to resilience. Working closely alongside the AWS team, the company developed innovative solutions to improve reliability and strengthen its systems.

About Capital One

Capital One is on a mission to change banking for good. As one of the leading digital banks in the United States, the company considers technology central to how it delivers value to more than 100 million customers.

Opportunity | Using AWS to transform resilience for Capital One

As one of the largest financial institutions in the United States, Capital One serves about 100 million customers through a broad portfolio of products and services. The company has embarked on an ever-evolving journey to take advantage of cloud architecture deployments. From using AWS serverless technologies to adopting open source frameworks, Capital One is committed to solving complex challenges for its customers.

With millions of transactions and customer interactions daily, Capital One needs to maintain exceptional system reliability. Every moment of downtime can affect customers who are trying to access their accounts, make payments, or complete critical financial transactions. At Capital One, maintaining high levels of availability while continuing to innovate with speed is of critical importance. Getting to this state requires a significant undertaking when managing thousands of applications and components across the company’s technology system.

Capital One needed to evolve how it approached resilience and build systems that could automatically prevent, detect, and recover from potential disruptions. Using AWS, the company saw an opportunity to improve its system resilience.

Solution | Reducing critical-severity events by 80–90 percent

Capital One worked closely alongside the AWS team to enhance its strategy for customer reliability and resilience. The company implemented automated failover mechanisms by using AWS services, established regular training sessions on resilience for its engineering teams, and developed new testing methodologies. A key part of this strategy involved identifying critical applications and requiring them to maintain automated failover capabilities and undergo rigorous testing. This standard now extends to all new applications, which are designed with quality in mind.

The company also engages the AWS team to help drive innovation through strategic implementations, rigorous proofs of concept, and comprehensive evaluations of cutting-edge services and solutions. Using insights from successful proofs of concept in individual lines of business, Capital One systematically scales these learnings across its enterprise. As a result, the company empowers its vast network of engineers, unlocking potential across numerous components and applications.

Building on this foundation, Capital One implemented automated Regional failover capabilities across a number of applications. The company can seamlessly transition services between AWS Regions in response to disruptions, reducing the impact of network issues or system failures on its customers. Using the automated failover capability has helped Capital One reduce the number of events significantly from the start.

To further enhance its resilience capabilities, Capital One developed a central hub for recovery engineering. This solution automates the deployment of runbooks, manages failover processes, and coordinates recovery operations across the company’s entire application portfolio. The hub centralizes many of the capabilities that Capital One gets from AWS. By building this tooling on top of AWS, the company improved its time to restore while still using core AWS services. With this centralized approach, Capital One can orchestrate complex failovers across thousands of components in the correct dependency order, reducing its recovery time from hours to minutes.

The company has also embraced a culture of continuous testing and improvement. Capital One conducts a monthly AWS GameDay—which is a fun, gamified, and interactive learning experience—where teams validate their systems’ ability to handle various failure scenarios. These exercises have evolved from quarterly failover tests between Regions to monthly chaos engineering experiments, helping teams build confidence in their automation and identify potential issues before they affect customers. This proactive approach, combined with automated recovery capabilities, has helped Capital One reduce critical-severity events by 80–90 percent.

Outcome | Driving innovation through enhanced resilience

With a strong foundation in place, Capital One is continuing to evolve its resilience strategy. For example, the company is developing new tools to automatically detect when system configurations deviate from their intended state and test different failure scenarios across thousands of applications simultaneously.

“We believe that innovation should be at the speed of well-managed systems,” says Parvez Naqvi, managing vice president, resilience and reliability engineering, at Capital One. “Using near real-time data from AWS, coupled with our systems’ domain models, we can make intelligent decisions quickly and facilitate successful customer transactions. Using AWS has empowered us to solve some really fun and complex problems, and we continue to explore new ways of improving the strength and reliability of our systems.”

Missing alt text value
Using AWS has empowered us to solve some really fun and complex problems, and we continue to explore new ways of improving the strength and reliability of our systems.”

Parvez Naqvi,

Managing Vice President, Resilience and Reliability Engineering, Capital One

Did you find what you were looking for today?

Let us know so we can improve the quality of the content on our pages.