AWS Cloud Operations & Migrations Blog

Using the Fault Tolerance Analyser Tool to Identify Potential Issues

Introduction

Ensuring resilience, the ability for a system to recover from a failure induced by load, attacks, and other issues, is a shared responsibility that underpins the reliability of your workloads. While AWS provides the resilient underlying cloud infrastructure, customers are tasked with maintaining the resilience of their applications. In this landscape of joint responsibility, we introduce an open-source tool: Fault Tolerance Analyser Tool. It simplifies the process by inspecting resources in an account and verifies adherence to fault tolerance best practices based on the multiple Availability Zones (multi-AZ) architecture.

The AWS Well-Architected Framework helps cloud architects build secure, high-performing, resilient, and efficient infrastructure for a variety of applications and workloads. Built around six pillars—operational excellence, security, reliability, performance efficiency, cost optimization, and sustainability—AWS Well-Architected provides a consistent approach for customers and partners to evaluate architectures and implement scalable designs.

The Reliability Pillar within the AWS Well-Architected Framework outlines two key best practices which we’d like to highlight, specifically around Failure Management:

  • The best practice, REL10-BP01: Deploy the workload to multiple locations, advises deploying workloads to multiple locations to avoid single points of failure and increase availability. This includes distributing data and resources across multiple AZs within an AWS Region.
  • The best practice, REL10-BP02: Select the appropriate locations for your multi-location deployment, advises that to maintain high availability and resilience in your system, it’s crucial to deploy your workload components across multiple AZs within a single AWS Region. This multi-AZ deployment serves as a robust line of defense against a range of disruptions, from minor technical issues to significant natural disasters. It’s worth noting that while a multi-AZ architecture can introduce some complexity, it typically provides sufficient resilience for most workloads without resorting to multi-Region deployment.

Effectively managing your part of the shared responsibility model for resilience is crucial for creating fault tolerant AWS architectures, which in turn safeguards your workloads from disruptions. The Fault Tolerance Analyser Tool, working within the guidelines of the AWS Well-Architected Framework – Reliability Pillar, empowers you to monitor the different components of your AWS architecture. Stay tuned as we explore in detail how this tool can contribute to building more robust, efficient, and fault tolerant workloads in the AWS Cloud.

How is this tool different from AWS Trusted Advisor, AWS Resilience Hub and AWS Well-Architected tool?

The Fault Tolerance Analyser Tool described here is a fully open-source tool, released under the MIT license, designed to generate a list of potential fault tolerance issues specific to different AWS services. It focuses on identifying potential issues related to fault tolerance and provides a detailed report that helps customers assess how tolerant to failures in a specific location the services being used are. Check out our Frequently Asked Question (FAQ) on our GitHub repository for more information.

AWS Trusted Advisor is a service provided by AWS that helps customers optimize their AWS infrastructure, improve performance, and enhance security. It provides recommendations based on best practices and helps identify potential issues. However, it focuses on general best practices with some fault tolerance checks available.

AWS Resilience Hub offers a centralized location to define, validate, and track the resilience of your AWS applications. It helps protect your applications from disruptions, reduces recovery costs, and optimizes business continuity. You can describe your applications using AWS CloudFormation, Terraform state files, AWS Resource Groups, or choose from applications already defined in AWS Service Catalog AppRegistry.

AWS Well-Architected Tool is a service designed to help customers align their workloads with AWS best practices by leveraging the AWS Well-Architected Framework. This tool aids customers in assessing and enhancing the efficiency, scalability, and performance of their workloads. It facilitates architectural reviews, application of best practices, and enables customized reviews with unique lenses based on customers’ organizational priorities. By utilizing the AWS Well-Architected Tool, customers are adopting a trusted framework for continuous improvement and tracking of their architecture’s progress, ensuring adherence to the principles of the AWS Well-Architected Framework.

Fault Tolerance Analyser Tool

The Fault Tolerance Analyser Tool helps customers identify AWS resources that are configured with a single point of failure, thus potentially reducing the availability of workloads. It runs within a single account and runs for the services and Regions you specify as options.

Features of the Fault Tolerance Analyser Tool

This tool can identify findings in the following 15 resources/services:

The details pertaining to the findings for each service can be found in the Functional Design section of the README file.

Running the tool

The README file has detailed information on how to execute the tool, but in this blog post we are going to explain briefly how this is done. The first step is to clone the repo https://github.com/aws-samples/fault-tolerance-analyser to a local folder.

As a first step to run the tool directly, run the following commands from the directory into which the repo was cloned to install the prerequisites.

#Install the requirements
pip install -r requirements.txt

Once done, you can run the tool with the following commands, Note: This command runs the tool for 2 regions and 4 services.

#Change into the src folder
cd src

#Invoke the tool
python3 account_analyser.py \
    --regions us-east-1 us-east-2 \
    --services lambda opensearch docdb rds

This command generates two output .csv files in the output subfolder from where you ran the tool. In this case, it was run under the src directory.

  • Fault_Tolerance_Findings_2023_05_15.csv
  • Fault_Tolerance_Findings_2023_05_15_run_report.csv

The first .csv file has the findings from the tool, while the second .csv file (with “run_report” in the name) shows the the time taken by the tool to analyse the resources of a given region and service.

We also offer a Dockerfile and instructions for building and running the tool as a container.

  • Below are some other options the tool provides:
    Write findings from multiple accounts to a single file or keep them in separate files by adding the account ID to the file names using the —filename-with-accountid flag.
  • Control the maximum number of threads that are actively making API calls at any given time. You also have the option to process all regions and services sequentially in a single thread.
  • Report only those resources that have findings and exclude the rest.

Please refer to the README file for the full set of options.

Understand the findings report

The tool provides three options to report findings:

  • Save it to a local file in the output sub folder. This folder name can be customized with the --output command line parameter.
  • Save the folder to an S3 bucket using the --bucket parameter. This will be in addition to the file being stored in the local subfolder.
  • Post the findings to an AWS Eventbridge event bus with the --event-bus-arn parameter.

Below is a screenshot of the csv file containing the findings.

Each finding has a boolean column called potential_issue. A TRUE value in the column indicates there is a potential fault tolerance issue and needs to be looked at. The message column explains what the potential issue is. For example, the first record identifies a VPC enabled Lambda function in only one subnet. This is an issue because the lambda cannot access the resources in the VPC if the subnet’s AZ experiences interruption.

Screenshot of the CSV findings report. It contains multiple columns such as service, region, resource_name, message and others.

Findings posted to the event bus will have the same fields as the .csv file, but will have the following format.

{
    "account_id": "123456789101",
    "account_name": "TestAccount",
    "payer_account_id": "999456789101",
    "payer_account_name": "TestParentAccount",
    "service": "rds",
    "region": "us-east-1",
    "timestamp": "2022_11_29_16_20_44+0000",
    "resource_id": "",
    "resource_name": "aurora-mysql-multiaz",
    "resource_arn": "arn:aws:rds:us-east-1:123456789101:cluster:aurora-mysql-multiaz",
    "engine": "aurora-mysql",
    "potential_issue": true,
    "message": "RDS Cluster aurora-mysql-multiaz has MultiAZ disabled"
}

What it is not

The Fault Tolerance Analyser Tool checks your environment against pre-defined configurations to validate if there are any potential issues related to fault tolerance. While this tool is a good starting point to identify issues, it is not exhaustive. For example, if you have a VPC enabled Lambda function in two AZs, the tool will see it as a Lambda function set up in accordance with the Multi-AZ best practice. But if the Lambda function accesses an EC2 hosted service that exists only in one AZ, the workload is still not tolerant to an AZ-wide issue. In that situation, the service hosted on EC2 will not be flagged by this tool as an issue.

This tool also does not look at cross-Region fault tolerance. For example, in case of Direct Connect, you might have redundancy by having two Direct Connect connections in two different regions that allow your on-premises networks to access the same AWS resources redundantly. However, since this tool looks only at a single region at a time, the tool might see only one connection in each region and flag it as an issue in each of those two regions. In this case, the issues would be false positives and can be ignored.

Flowchart of the tool

Here is a simple flow chart of how the tool operates. Note how each service and region is processed in its own thread.

Flow chart explaining how the Fault Tolerance Analyser Tool operates, noting how each service and region is processed in its own thread

Conclusion

In closing, the importance of leveraging the Fault Tolerance Analyser Tool cannot be overstated. As we navigate through the ever-evolving landscape of cloud computing, ensuring the fault tolerance of your workloads running on AWS has become an absolute necessity. The tool simplifies and automates the process of reviewing your resources against fault-tolerant configurations, facilitating more robust and reliable applications in the cloud.

In conjunction with the existing AWS services like the AWS Well-Architected Tool, AWS Trusted Adivsor and AWS Resilience Hub, the Fault Tolerance Analyser Tool further enriches your toolset for achieving your fault tolerance objectives. It brings added flexibility to the management of your goals.

Different teams can use this tool in a manner that suits their needs. For example, we recommend that central teams run the Fault Tolerance Analyser Tool on an automated, scheduled basis. Regular checks with this tool, potentially leveraging services like AWS CodeBuild or Amazon Elastic Container Service (Amazon ECS) to run the tool, can help proactively identify potential fault tolerance issues. Infrastructure teams can incorporate this tool into their CI/CD pipelines too, to make sure that any components they build adhere to basic fault tolerance best practices. Change management teams can also use this tool as a part of their gate keeping process.

Since the Fault Tolerance Analyser Tool is an open-source project on GitHub, we enthusiastically invite and encourage users to contribute. Whether it’s reporting issues or creating pull requests, your active involvement can enrich the tool, making it more versatile and effective. By working collaboratively, we can continue to enhance and refine the tool, helping us all achieve our goals of improved fault tolerance in the AWS Cloud. Let’s work together in fostering more resilient architectures and shaping a more reliable digital future.

About the authors:

Samuel Baruffi

Samuel Baruffi is a seasoned technology professional with over 15 years of experience in the Information Technology industry. Currently, he works at Amazon Web Services (AWS) as a Senior Solutions Architect, providing valuable support to Global Financial Services organizations. His vast expertise in cloud-based solutions is validated by numerous industry certifications. Away from cloud architecture, Samuel enjoys soccer, tennis, and travel.

Madhav Vishnubhatta

Madhav is a Technical Account Manager at AWS.