Automating and Scaling Chaos Engineering using AWS Fault Injection Simulator

Regulators like the Financial Conduct Authority in the UK are increasingly focused on how financial services institutions respond to and recover from operational disruptions. They’re looking for a comprehensive approach to operational resilience, making sure that they and the financial sector as a whole prevent, adapt, respond to, recover, and learn from operational disruption. Technology continues to be a crucial enabler to achieve strong reliability and operational resilience measures for contemporary financial services firms. Technology teams that just react to service disruptions and accept system downtime can’t guarantee systems reliability or meet the robust operational resilience requirements. A more proactive approach is required to build robust technology platforms that can scale swiftly to meet evolving customer demands and remain competitive in the financial services market.

Chaos Engineering and its impact for Financial Service Customers

Chaos Engineering as a practice can help financial services institutions bridge the operational resilience gap by demonstrating to regulators the ability of production scale distributed software systems to withstand turbulent phases. Chaos Engineering uses experiments-based failure testing to surface systemic weaknesses in distributed software systems. It teaches how to identify, manage, and remediate system failures at an organisational level. Besides reducing downtime by improving system reliability and resilience, Chaos Engineering can help organisations build trust, expose technical debt, and foster a culture of learning and continuous improvement.

Chaos Engineering at Nationwide Building Society

Nationwide is the world’s largest building society. It is owned by its 16 million members and exists to serve their needs. The Society is one of the UK’s largest providers for mortgages, savings and current accounts, as well as being a major provider of ISAs, credit cards, personal loans, insurance and investments.

As a mutual organization, Nationwide Building Society uses its unique position to help rebuild society by making a positive difference to the lives of its members and the communities in which they live. It is why Nationwide still values a branch network, supports communities through charitable grants, and places a premium on helping people thrive financially – whether helping them into a home of their own or giving them the financial support they need. Taking a stand and making a difference is what sets Nationwide apart. None of which would be possible without the dedicated service of its 18,000 employees.

Nationwide operate a number of production services on AWS. Making sure that these services remain resilient to failures, either natural disasters or human-made, is essential for Nationwide. Furthermore, Chaos Engineering is one of the approaches adopted by Nationwide to make sure that they test and validate resilience for these workloads.

Automating and scaling Chaos Engineering at Nationwide and the key benefits of adopting an “as-a-service” pattern

Nationwide have built their own Chaos Engineering framework that enables cloud development teams to probe and run structured auditable experiments against their infrastructure. This has resulted in a greater understanding of how they react to failure modes against individual components and also systems at scale. Site reliability engineering and architecture functions are traditionally responsible for advocating resilience measures through the conception of a platform or service. However, Nationwide wanted to go a step further to probe and learn about possible issues before they happen to build a more resilient and robust service. Implementing Chaos Engineering as a service offering has enabled wider adoption of Chaos Engineering practices for workloads and teams, even if they have minimal existing Chaos Engineering skills.

This offering is maintained by Nationwide’s Chaos Engineering Team that operates as part of their Cloud Engineering team. The Chaos Engineering Team builds and maintains a centralized repository of chaos accelerators, thereby allowing teams within Nationwide to get a head-start when introducing Chaos Engineering into their service landscape.

A centralized Chaos Team enables standardization on chaos tooling and experimentation templates, as well as SME support to setup and run Chaos experiments. Furthermore, the Chaos Engineering Team is constantly enhancing the centralized experiment library that teams within Nationwide can use and feedback into as they conduct their own experiments.

A Chaos Engineering service – powered by AWS Cloud Native Services

The Nationwide Chaos Engineering offering is built using AWS cloud native services, including AWS Fault Injection Simulator (FIS), Amazon Systems Manager, Amazon CloudWatch, and AWS Lambda.

A platform architecture for the Nationwide Chaos Engineering framework and the various services and tools used to build the platform are demonstrated as follows:

Figure 1 Chaos Engineering at Nationwide powered by AWS

AWS Services used in the Chaos Engineering framework

AWS FIS is a fully-managed service for running fault injection experiments to improve an application’s performance, observability, and resiliency. Customers set up controlled fault injection experiments across components of their workload to test failure scenarios and validate that their workloads can achieve required resilience objectives. Examples of faults injected include abrupt termination and rebooting of Amazon Elastic Compute Cloud (Amazon EC2) instances to validate their systems can identify and recover from such failures. FIS enabled Nationwide to integrate Chaos Engineering into their existing cloud infrastructure and extend their capability beyond that of other toolsets in this domain. As Nationwide can pool together numerous services and link them, it in effect has created a handy multi-tool for experimentation.

FIS integrates with CloudWatch to provide customers with an in-depth view of the impact of their experiments on workloads. CloudWatch metrics are used to define baselines for each resource that customers are targeting, and customers can observe how these baselines are impacted during experiments to identify resilience issues. More details of how customers can use CloudWatch with FIS can be found here.

Amazon Simple Notification Service (Amazon SNS) is used to implement enhanced monitoring during experiments. FIS emits notifications for changes in the state of experiments. These notifications are available as events through Amazon EventBridge. Customers can configure EventBridge to invoke SNS topics to send an email notification which provides increased visibility on the status of your experiment.

FIS comes bundled with a set of actions which are fault injection activities that you can run on a target. Customers can enhance this list by adding custom fault types using Systems Manager agent-based integration. More details on how this integration works can be found here.

AWS Lambda can be used to inject custom faults using Systems Manager automation integration with FIS. For instance, in a use-case be to inject faults in the network configuration of a particular AZ and VPC. An example of this can be found here.

Building and maintaining the Nationwide Chaos Repo

Nationwide Chaos Engineering Team have built a repository of chaos experiments that model the most common failure scenarios that Nationwide cloud deployments are expected to address. The repo includes:

Experiment documentation – This describes failure scenarios, business and technical impacts of the failure, observability controls to identify and report on the failure, as well as remediations for each failure.
FIS experiments templates – These are the actual experiment templates that FIS will execute and are modelled using experiment documentation. More details on FIS experiment templates can be found here.
Chaos Service Pack – This includes service configuration for services that are part of the Chaos platform including FIS bundled into Terraform scripts that customers can use to start running experiments in their own accounts

Nationwide have also introduced performance-based experiments utilizing Stress-NG to enable extended scenario coverage. Furthermore, Nationwide have incorporated Locust to put load onto systems in tandem with running FIS.

GitHub is used to operate this repo, and Nationwide’s Chaos Engineering Team populate and maintain experiment documentation and templates. This repository is constantly expanding and will be updated based upon trends, requests, and business requirements.

Automated deployment of experiments for cloud initiatives

Cloud project teams at Nationwide engage with the Chaos Engineering Team from their design phase onward as they start identifying failure scenarios for their architecture. Project teams download experiment templates from the Chaos repo for experiments that they want to conduct as well as the Chaos Service pack that configures AWS Services such as FIS to execute experiments. The service pack uses Terraform Infrastructure-as-Code (IaC) and, in conjunction with automated pipelines (Buildkite/Jenkins), projects can rapidly build out the infrastructure required to start experimenting on their deployments. The reason for automating the deployment was to make it as easy as possible for customers to accelerate implementation of Chaos Engineering for their initiatives.

Once built, Nationwide project teams can run FIS experiments via a Pipeline post build as an additional step to validate resiliency and highlight any future areas to probe. They can also run the experiments manually through AWS Command Line Interface (AWS CLI). However, this approach is limited to isolated experimentation.

Projects can get SME support on Chaos Engineering from the Chaos Engineering Team. Therefore, they don’t inherently require Chaos Engineering expertise. Projects are encouraged to update experiment templates if they identify enhancements, in turn fostering a collaborative approach to Chaos Engineering at Nationwide.

The future of Chaos Engineering in Nationwide

The utilization of Chaos Engineering within Nationwide has allowed them to run large-scale Gamedays within their lower environments to build confidence in system designs and against production-like infrastructure. Nationwide have set their sights on expanding Chaos Engineering through automating extended chained event Gamedays, third-party tooling integration, and expanding the monitoring solutions to incorporate machine learning (ML) models/Anomaly Detection to locate and probe possible future faults before they materialize. By leveraging additional AWS Services, such as AWS Resilience Hub, CloudWatch anomaly detection, and Amazon QuickSight, Nationwide want to develop a comprehensive dashboard that links to our other monitoring solutions.

Conclusion

Far from being chaotic, Chaos Engineering is a systematic, data-driven method for designing experiments that reveal system vulnerabilities and help organizations improve their operational resilience posture. Nationwide was able to learn, make improvements, and build confidence through Chaos Engineering by making it as simple as possible. This creates a level playing field across teams so that the same tooling and processes can be used rather than multiple complex methods. Moreover, this results in cost savings, up to 90% reduced MTTR (Mean Time To Repair), and teams that have experience with varied failure modes/scenarios that can proactively fix or modify configurations.

Get started on implementing Chaos Engineering using FIS for your cloud initiatives. Check out our AWS Fault Injection Simulator User Guide.

AWS for Industries