AWS Partner Network (APN) Blog

Improving System Resilience and Observability: Chaos Engineering with AWS FIS and AWS DLT

By Jyothi Goudar, Manager, Partner Solutions Architect – AWS
By Kavin Arvind, Performance Engineering Architect – Cognizant

Cognizant-AWS-Partners-2022
Cognizant
Connect with Cognizant-2

For modern enterprises, business continuity continues to be a high priority. Unforeseen events like the Covid-19 pandemic emphasize the significance of having critical systems targeting 100% availability. Hence, it’s important to anticipate platform behavior at scale throughout the development lifecycle.

By automating performance testing and including chaos testing, organizations can identify failure scenarios in systems before they develop and cause downtime. This improves the system’s reliability, resilience, and stakeholder confidence.

The Distributed Load Testing on AWS (DLT) solution automates performance testing of your applications at scale, which can help identify bottlenecks before releasing your application to production.

DLT supports scheduled and concurrent tests and handles increased numbers of simultaneous users who can generate many requests per second. With DLT, you can simulate thousands of users connecting to your application, so you can better understand your application performance profile.

AWS Fault Injection Simulator (AWS FIS) is an AWS managed service for performing controlled chaos engineering experiments on Amazon Web Services (AWS) resources. By simulating faults such as server failures, network latency, and resource exhaustion, FIS identifies weaknesses in applications and infrastructure by subjecting them to random and unpredictable behavior.

FIS streamlines chaos testing, allowing teams to proactively address potential issues before they impact production environments. This proactive approach to resilience testing helps ensure applications and infrastructure can withstand real-world catastrophic failures.

Chaos engineering is a practice of intentionally injecting faults into a system to test its resilience. The goal is to build confidence in the system’s capability to withstand failures by identifying failure points and correcting them before they cause an actual outage and disrupt business.

This post discusses how to use AWS FIS and DLT on AWS to improve system resilience. Cognizant, an AWS Premier Tier Services Partner and Managed Service Provider (MSP), is deeply invested in understanding, innovating, and deploying the latest technological advancements. The Quality Engineering and Assurance (QEA) business unit at Cognizant ensures that systems and applications not only meet but exceed performance, scalability, and reliability expectations.

Solution Overview

The following diagram illustrates the architecture of this solution for a sample three-tier application being tested.

Cognizant-Chaos-Engineering-1

Figure 1 – AWS architecture diagram of a sample three-tier application.

  • DLT on AWS for load simulation: Set up DLT to simulate realistic traffic patterns and load on your application services using JMeter scripts. The necessary distributed load testing infrastructure is provisioned automatically after deploying the DLT solution, including an Amazon Simple Storage Service (Amazon S3) bucket, Amazon DynamoDB table, and AWS Fargate cluster. The included DLT web user interface (UI) allows for upload of load testing scripts and test scenarios.
  • AWS FIS for fault simulation: Create and introduce controlled faults under the system load generated by DLT into your infrastructure using AWS FIS to evaluate resiliency aspects like mean time to recovery (MTTR), auto scaling and load balancing behavior, inter-service dependencies, fault tolerance, database resilience and system alerts, and monitoring health checks.
  • Amazon Managed Grafana for monitoring and visualization: Configure and set up Amazon Managed Grafana to use both InfluxDB and Amazon CloudWatch as data sources. This enables real-time tracking and visualization of key performance indicators and system health metrics throughout your FIS and DLT experiments.

Prerequisites

Load testing is usually run in a test environment without affecting end users in an actual production environment. Load tests are typically run on smaller but comparable setups you can use to draw assumptions about the application behavior in production.

You’re also responsible for the cost of the AWS services used while running the DLT experiments. The total cost for running this solution depends on the number of load tests run, duration of those load tests, and amount of data used as a part of the tests. For more information related to cost, see the AWS documentation.

  • JMeter scripts that test the performance of the application should be created and validated to enable DLT to sufficiently scale performance testing.
  • DLT should be deployed using the predefined AWS CloudFormation template.

Step 1: Simulate Traffic Patterns in the Application

Using the DLT console, upload the JMeter script that generates the load for performance testing. In the DLT framework, JMeter runs inside Docker containers that are managed by Amazon Elastic Container Service (Amazon ECS).

Specifically, AWS uses the Fargate launch type, which lets you run containers without having to manage the underlying Amazon Elastic Compute Cloud (Amazon EC2) instances. So, when a load test is started, ECS schedules and runs these JMeter containers as tasks on Fargate. You can configure the number of concurrent users, number for Fargate tasks, and test duration in the console.

Cognizant-Chaos-Engineering-2

Figure 2 – AWS Distributed Load Testing console.

AWS DLT incorporates CloudWatch monitoring, providing a comprehensive overview of ongoing tests. For an in-depth analysis of the system being tested, integrating InfluxDB with JMeter enables real-time reporting and examination of results.

InfluxDB, a time series database, is leveraged for the storage of performance metrics. To facilitate this, we’ll provision an EC2 instance and install InfluxDB on it.

Step 2: Introduce Faults in the Application

Create AWS Fault Injection Simulator experiment templates for the desired fault scenarios. These scenarios can include randomly terminating EC2 instances within a specific zone, inducing high CPU usage across multiple instances, introducing network latency and many more.

In an FIS experiment template, under Actions, click Add action. Choose an action type from the Action type drop-down menu. Depending on the selected action type, you may need to fill in additional parameters. For example, if you select Terminate instances for the action type, you may need to specify the instances filter criteria and number of instances in the Targets section.

Cognizant-Chaos-Engineering-3

Figure 3 – AWS FIS experiment template.

Next, click Add target. Define the resources (such as EC2 instances or ECS services) the action will affect. You can select resources using resource tags or manually enter resource IDs.

Cognizant-Chaos-Engineering-4

Figure 4 – Experiment template target selection.

Step 3: Set Up EC2 Instance and Grafana for Observability

On your Amazon EC2 instance, download and install InfluxDB using its respective Linux or Windows installation packages. Be sure to follow security best practices for InfluxDB installation.

After installation, start InfluxDB and create a new database for your data. Ensure your EC2 instance’s security group rules allow inbound traffic on the necessary ports (by default, InfluxDB uses port 8086).

In your JMeter test plan, add a Backend Listener by right-clicking on the test plan and selecting Add > Listener > Backend Listener. Set the Backend Listener implementation to ‘org.apache.jmeter.visualizers.backend.influxdb.InfluxdbBackendListenerClient’ and configure the parameters to point to your InfluxDB instance.

Cognizant-Chaos-Engineering-5

Figure 5 – JMeter InfluxDB Backend Listener configuration example.

Next, navigate to the Amazon Managed Grafana console to configure InfluxDB and Amazon CloudWatch as data sources. Choose Create workspace, fill in the required details, and then select Create workspace.

Step 4: Set Up Grafana Dashboard

In your Grafana workspace, navigate to Configuration > Data Sources > Add data source. Choose InfluxDB from the list, and in the settings fill in the URL of your InfluxDB instance and database name. Save and test the configuration to ensure Grafana can connect to InfluxDB.

To import Grafana dashboard, click on the + icon on the left side of the screen and select Import. To import a specific dashboard ID, in the Import via grafana.com field, enter the dashboard ID (5496 for the JMeter dashboard), and then click Load.

To select the data source, choose the correct InfluxDB data source from the drop-down menu you previously configured for JMeter results, and then click Import. The JMeter dashboard should now appear in your Grafana instance.

Cognizant-Chaos-Engineering-6

Figure 6 – Sample load testing Grafana dashboard.

Step 5: Set Up Amazon CloudWatch Dashboard

To select the data source, choose Amazon CloudWatch from the list. Fill in the details for your AWS credentials, default region, and any other specific settings. Then click Save & Test to ensure Grafana can connect to CloudWatch.

To view the EC2 metrics dashboard, in the Import via grafana.com field, enter the dashboard ID 11265 for the EC2 dashboard, and then click Load.

To view the Elastic Load Balancing (ELB) metrics dashboard, in the Import via grafana.com field, enter the dashboard ID 16071 for the ELB dashboard, and then click Load.

Cognizant-Chaos-Engineering-7

Figure 7 – Sample CloudWatch Grafana dashboard for ELB.

Step 6: Start the Load and Chaos Tests

Initiate a load test in DLT through the interface for the created scenario. Once the load test is underway, start a controlled FIS experiment (choose a time duration for your testing) to inject chaos into your system, monitoring the application closely through the Grafana dashboards.

During a chaos test, the observability dashboards set up for JMeter and AWS services serve critical roles in monitoring, visualizing, and analyzing system behaviors in real-time. You can monitor performance metrics like response times, failure rate, hits per second, system metrics including CPU and memory utilization, network throughput of your infrastructure, load balancing metrics (like request count per target), HTTP response codes, and latency which may be affected during chaos testing.

Some detailed metrics to be captured during the tests include:

Resilience Metrics

  • MTTR: This is a measure of how long it takes a system to recover from a failure. The lower the mean time to recovery, the quicker your system can recover from failures, thus demonstrating a high level of resilience.
  • Auto scaling turnaround time: This metric measures the time it takes for your AWS auto scaling policy to spin up new instances/tasks/pods in response to increased demand or instance failure. The quicker your system can auto scale, the more resilient it is to sudden changes in load or failure of individual instances.
  • Performance SLA/SLO: This could refer to several metrics related to system performance, such as response time, throughput, or latency. Example: “System latency should not exceed 200 milliseconds under a CPU stress chaos experiment.”
  • Fault tolerance SLA/SLO: This defines the number of errors a system can handle before experiencing significant service degradation. Example: “The system should withstand the failure rate of 1% without significant impact on service under a chaos experiment.”
  • Capacity SLA/SLO: This defines the load the system should handle. Example: “The system should handle 10,000 simultaneous users under a zone outage.”

Infrastructure Metrics

  • CPU utilization: Percentage of total CPU capacity utilized.
  • Memory usage: Percentage of total memory capacity utilized.
  • Disk I/O: Read and write operations performed on your disk.
  • Network I/O: Amount of data sent and received over the network.
  • Instance status checks: Status of the instances in terms of system reachability and instance reachability.
  • Load balancer metrics: Including request count, HTTP response codes (2xx, 3xx, 4xx, 5xx) and average latency.
  • Auto scaling metrics: Including number of instances, average CPU utilization of the auto scaling group, and any scaling activities.

Application Metrics

  • Response time: Time it takes for a request to be processed by your application.
  • Error rate: Percentage of requests that result in errors.
  • Throughput: Number of requests processed per unit time by your application.
  • Active threads: Number of concurrent users or requests.

Cognizant-Chaos-Engineering-8

Figure 8 – Grafana dashboard under load and with fault simulation.

Cleanup

Follow the below steps to clean up resources after the testing is complete so you don’t incur additional charges:

  • To uninstall the DLT solution, sign in to the AWS CloudFormation console and on the Stacks page, select the solution’s installation stack and choose Delete.
  • Navigate to the Amazon EC2 console and choose the EC2 instance that’s running the InfluxDB and from the Instance state, and choose to Terminate instance.
  • Navigate to the Amazon Managed Grafana console and choose the workspace you created. Delete the workspace.

Conclusion

Throughout this discussion about chaos engineering with AWS Fault Injection Simulator (FIS) and Distributed Load Testing (DLT) on AWS, we covered setting up FIS and DLT, generating controlled fault injections for improving resilience and performance by simulating realistic traffic with DLT and chaos experiments using FIS.

We also touched on monitoring application and infrastructure metrics with Amazon Managed Grafana by fetching data from JMeter InfluxDB and Amazon CloudWatch.

By combining the power of AWS FIS and DLT, organizations can effectively perform comprehensive resilience testing and continuously validate their systems’ robustness. With Amazon Managed Grafana integration, teams gain real-time insights into their application and infrastructure performance, enabling them to pinpoint weaknesses and optimize resource usage.

Scheduling recurring experiments helps ensure systems can withstand real-world failures, continuously validating improvements, and proactively addressing potential issues. Ultimately, adopting chaos engineering with AWS FIS, DLT, and Amazon Managed Grafana empowers organizations to build and maintain more resilient, scalable, and reliable systems.

.
Cognizant-APN-Blog-Connect-2022
.


Cognizant – AWS Partner Spotlight

Cognizant is an AWS Premier Tier Services Partner and MSP that transforms customers’ business, operating, and technology models for the digital era by helping organizations envision, build, and run more innovative and efficient businesses.

Contact Cognizant | Partner Overview | Case Studies