Building a fault tolerant architecture with a Bulkhead Pattern on AWS App Mesh

When packaging and deploying APIs into containers services, it is common for each service to serve more than one responsibility or many downstream dependencies. In such scenarios, the failure during the execution of one responsibility can often spread to the entire application and causing a systemic failure.
Let’s look at an example: imagine an e-commerce application to manage listing prices that has a REST API with two main endpoints being served by the same code, distributed in a container:

GET /price/$id – reads the latest listing price from an in-memory cache – lightweight short running requests
POST /price – creates/updates a listing price – long-running requests since users need to guarantee the listing price was persisted and updated in cache

The write endpoint is more resource-intensive, having to wait for a database to persist the update and also make sure the cache is purged. If a lot of traffic is sent to the write endpoint at the same time, it will reserve, for example, all of the connection-pool and memory from the entire application, clogging requests to other endpoints as well. That affects not only customers that want to update prices but also customers that only want to get the latest listings price.

To solve this problem, it is necessary to separate specific duties across different resource pools. This article showcases how the bulkhead pattern at the service-mesh level can help achieve it and elaborate an implementation in Amazon Elastic Kubernetes Service (Amazon EKS).

The Bulkhead Pattern

The bulkhead pattern gets its name from a naval engineering practice where ships have internal chambers isolating their hull so that if a rock is to crack it, water can’t spread to the entire ship causing it to sink. See Figure 1 for a visual illustration.

Figure 1

In software development, the practice is focused on isolating resources and dependencies so that systemic failure is circumvented, making systems generally more available but also more fault-tolerant to noisy neighbours since it is easier to recover from failures when they are less likely to cascade.

The isolation lays on separating resources into pools to split actions based on CPU, memory, network, or any other bursting resource that might be exhausted for other actions.

Overview of the solution

The following solution is an example of how the proposed scenario problem would be implemented by using AWS App Mesh in combination with Amazon EKS. This is one specific example to illustrate the practice, which could be used all the way from edge traffic serving down to code lines, isolating resources to minimize the blast radius of failures.

The focus here is to be pragmatic, rather than installing libraries and changing code in all applications that could benefit from this, implementing it in the infrastructure level reduces the complexity and also the transition is facilitated by avoiding the undifferentiated heavy lifting.

To improve the resiliency of the solution, it is possible to use the recently released App Mesh Circuit Breaker Capabilities, setting a resistance level to your application nodes on how much traffic they can handle. That prevents the functionality from being overwhelmed with traffic that is known to cause issues, favoring serving some customers rather than allowing failure to corrupt the entire feature.

The described architecture looks like this:

Diagram of an AWS App Mesh level bulkhead isolating resources by routes

Solution

As mentioned before, the solution uses Amazon EKS with AWS App Mesh, regardless of that, it is possible to implement it using other container orchestration tools, like Amazon Elastic Container Service (Amazon ECS) or even on pure in compute units, using Amazon EC2. The service mesh is the infrastructure layer segregating access whereas the Kubernetes Deployments are segregating compute reservations. Once AWS App Mesh and Amazon EKS are set up, follow the steps prescribed below:

Deploy setup
Test bulkhead failure isolation
Configure the circuit breaker
Test additional resiliency

Pre-requisites

For the solution, it is required to have an Amazon EKS cluster running with AWS App Mesh configured. To set configure that setup, the Getting started with AWS App Mesh and Amazon EKS blog is a good place to start.

Here are the prerequisites:

An AWS account
An Amazon EKS cluster
AWS App Mesh configured to work with Amazon EKS
A terminal with
- The latest version of the AWS CLI
- Docker
- git
- jq
- kubectl
- httpie

Set up deployment

Start by deploying the demo application into your cluster. Make sure to configure kubectl to point to the desired cluster and to have permissions for the cluster / mesh assigned to your default AWS CLI user.

The deployment code will:

Build a docker-image for the sample application
Create an ECR repository
Push the image to the ECR repository
Create an EKS namespace called “bulkhead-pattern”
Create and configure the application with 2 deployments, price-read and price-write
Create and configure a mesh called bulkhead-pattern with virtual-gateway, virtual service, virtual routers and specific virtual nodes
Create a load balancer to expose the virtual-gateway through a public URL
Create a Vegeta Load Testing deployment to support failure simulation (Vegeta is the tool used for load testing)

Cloning the repository and deploying the code

Clone the repo code with the article code:
git clone https://github.com/aws/aws-app-mesh-examples.git
Enter the article folder:
cd blogs/eks-bulkhead-pattern-circuit-breaker/
Run the deploy command, replacing the account id and default region accordingly:
AWS_ACCOUNT_ID=? AWS_DEFAULT_REGION=? ./deploy.shAWS_ACCOUNT_ID=111 AWS_DEFAULT_REGION=eu-central-1 ./deploy.sh
Check the deployed resources:
kubectl –n bulkhead-pattern get deployments

NAME          READY   AVAILABLE
ingress-gw    1/1     1
price-read    1/1     1
price-write   1/1     1
vegeta        1/1     1
Get the load balancer endpoint:
PRICE_SERVICE=$(kubectl -n bulkhead-pattern get services | grep ingress-gw | sed 's/\|/ /' | awk '{print $4}')
Test the API GET endpoint. (Note that it might take a few minutes for the DNS to take effect):
http GET $PRICE_SERVICE/price/7

HTTP/1.1 200 OK
server: envoy
x-envoy-upstream-service-time: 2

{
"value": "23.10"
}
Test the API POST endpoint; it takes around five seconds to respond:
http POST $PRICE_SERVICE/price

HTTP/1.1 200 OK
server: envoy
x-envoy-upstream-service-time: 5001

{
"status": "created"
}

To facilitate failure simulation, the price-write deployment sets a container memory limit of 8 MB and has an environment variable called DATABASE_DELAY that simulates a network latency in seconds, “5s” is the configured delay.

Note that the requests are being served by Envoy Proxy from the Virtual Gateway. Also, the second request takes around five seconds to finish, which is the simulated network latency.

Failure Isolation test

Now that the application is running, it is possible to start simulating failures and checking if the failures are restricted to a single endpoint and not to the whole application.

Breaking it

Let us see how sending enough requests to the endpoints can cause the service to fail. To do that, use the already deployed load testing tool to flood the pods with requests. See:

Get the load testing pod:
VEGETA_POD=$(kubectl -n bulkhead-pattern get pods | grep vegeta | cut -d' ' -f1)
Start load testing the write endpoint:
kubectl -n bulkhead-pattern exec $VEGETA_POD -- /bin/sh -c "echo POST http://$PRICE_SERVICE/price | vegeta attack -rate=50 -duration=1m | vegeta report"
While the command above runs, on another tab, you can start checking:
kubectl -n bulkhead-pattern get pods | grep price-write
price-write-69c65dfdcc-wfcr4   2/2     Running
kubectl -n bulkhead-pattern get pods | grep price-write
price-write-69c65dfdcc-wfcr4   2/2     OOMKilled
kubectl -n bulkhead-pattern get pods | grep price-write
price-write-69c65dfdcc-wfcr4   2/2     CrashLoopBackOff

Due to the latency requests pile up, using the limited memory reservations and eventually the pod has crashes with OOM (Out of Memory), it is not serving any requests anymore. A detailed analysis will follow, but first let us check if the failure has spread.

Checking if failure is isolated

Run a GET request to the price-read endpoint:
http GET $PRICE_SERVICE/price/7
HTTP/1.1 200 OK
server: envoy
x-envoy-upstream-service-time: 2

{
"value": "23.10"
}

Notice that not only the service is still up, but its response time is not impacted by the failures of the price-write endpoint. The failure has been contained to the chamber now that there are individual reservations of both the network with the service mesh and the compute with the Kubernetes Deployment.

Checking the damage

Interrupt the load test tab and see the test results:

Requests      [total]        2700
Latencies     [p99, max]     5.253s, 5.365s
Status Codes [code:count]   200:359 503:2341
Error Set:
503 Service Unavailable

From all the requests performed during the test, 359 finished successfully with a 200 (OK) status code while other 2341 requests returned a 503 Service Unavailable error. If that was a real case the system would’ve ignored 2341 price changes and then stopped working completely, all requests from that point in time would be errored and its content lost, impacting all clients of that API.

Here is a timeline of the load test highlighting latency on successful and error requests:

Notice that some requests are served then the pod runs out of memory due to the high number of requests combined with the sustained network latency. Requests pile up occupying space in memory that is limited; the Kubernetes scheduler tries to restart the pod a few times and then it starts backing off, leaving the endpoint unhandled. Failures are lengthy and end up making the outage worst for the application and its underlying dependencies.

Even by using a Horizontal Pod Autoscaler or even Cluster Autoscaler, the errors will continue since they are caused by network latency. Autoscaling Pods or Nodes, in this case, might even make it worse since it will apply more pressure (more requests) to the already impaired/crashed downstream dependency.
The outcome is a general outage of the endpoint, even that the price-read wasn’t affected.

To increase resiliency of this specific endpoint and aim for impaired performance rather than complete outage, there is yet another component necessary.

That’s when the AWS App Mesh Circuit Breaker enters the picture, it will act as a pressure-valve, controlling how much pressure, or how many requests per second, the system is able to handle. Exceeding requests will be ignored, in order to avoid a complete breakdown of the system.
Let us see how that works in practice.

Update the code

As AWS App Mesh’s documentation shows, the outlier detection can be configured through the Virtual Node’s Listener, setting the maximum number of connections and the maximum number of pending requests.

Setup outlier-detection

Run the update script in the root folder of the article:
./update.sh
Check if the price-write-node has the correct configuration:
kubectl -n bulkhead-pattern describe virtualnode price-write-node
The output should include the listener configuration with the connection pool for HTTP:
Name:         price-write-node
Namespace:    bulkhead-pattern
Kind:         VirtualNode
Spec:
Listeners:
    Connection Pool:
      Http:
        Max Connections:       20
        Max Pending Requests: 5

See it in action

Restart the deployment to reset the CrashLoopBackOff:
kubectl -n bulkhead-pattern rollout restart deployment price-write
Wait a few seconds, then start load testing the write endpoint again:
kubectl -n bulkhead-pattern exec $VEGETA_POD -- /bin/sh -c "echo POST http://$PRICE_SERVICE/price | vegeta attack -rate=20 -duration=35s | vegeta report"
While the command above runs, on another tab, you can start checking:
kubectl -n bulkhead-pattern get pods | grep price-write
price-write-69c65dfdcc-wfcr4 2/2 Running
This time the pod won’t crash. Leave it running for a few minutes then see the test results:
Requests [total] 700
Latencies [p99, max] 9.259s, 9.301s
Status Codes [code:count] 200:128 503:572
Error Set:
503 Service Unavailable

In this example, running it for 35 seconds there were 128 requests that updated prices properly for the API’s clients. By making the performance and limits measurable is possible to scale accordingly and protect downstream dependencies from backpressure.

The main difference here is that even though the latency in the network impaired the pod’s performance on serving traffic, it didn’t cause a complete outage, still being able to continuously serve customers. See the following graph:

Since there are requests being held by Envoy (through the pending request pool), the latency is a bit higher, which can be configured as well.

Notice that failures are now faster since the circuit breaker opens when the connection pool grows unsustainably. That allows the system to recover and to continuously keep serving a percentage of the traffic, not taking the system down. That’s useful given that the network latency issue could be temporary and restored after a few minutes, the bulkhead combined with circuit breaker would have protected the database from having a massive backlog of queries this possibly prolong its impairment.

The improvement above is mainly focused on further optimizing the price-write endpoint, since, as mentioned before, the price-read endpoint hasn’t failed so far, it was not impacted by failures of the price-write endpoint.

Moving forward, as requests are now possibly dropped, it is important to be aware that other dependent services or clients should handle failures or even take advantage of AWS App Mesh Retry Policies to further improve resiliency of requests that eventually fail.

Cleaning up

To cleanup all resources used during the tests you can run:

./cleanup.sh

That will delete the Amazon ECR repository, the service mesh, Kubernetes Namespace, and all underlying attachments.

Conclusion & Next Steps

The article has shown how to use AWS App Mesh to implement the bulkhead pattern in the network layer, avoiding redundant work on your application code thus providing a simple and powerful way of isolating system’s functions in order to avoid generalized outages.

After understanding the endpoint couldn’t handle too many requests at once, the fact the failure was isolated made it more straightforward to act on, using the circuit-breaker to prevent the endpoint from an outage, favoring a controlled impairment.

AWS App Mesh provides a breath of options to enable further optimization on reliability such as timeout configuration, outlier-detection and retry on-failure policies.

To learn more about App Mesh:

Take a look at the AppMesh Workshop for self-paced labs
Check out the aws-app-mesh-examples repo on GitHub
Use the App Mesh Developer Guide for reference docs, tips and tricks

Containers