Autonomous ML-based detection and identification of root cause for incidents in microservices running on EKS

This blog was co-written with Gavin Cohen, VP of Product at Zebrium.

Overview

If you’ve never experienced the frustration of hunting for root cause through huge volumes of logs, then you’re one of the few lucky ones! The process typically starts by searching for errors around the time of the problem and then scanning for new or rare events that might provide a clue as to what was happening. Depending on what you find, you might then have to look in many different directions, and that’s where you will need to draw on your experience, intuition, and maybe blind luck. Unfortunately, this process is only getting more difficult as apps become more complex and distributed.

Solutions Walkthrough

In this blog, we will demonstrate an alternative approach that uses the Zebrium machine learning (ML) platform to automatically find root cause in logs generated by an application deployed in Amazon EKS.

The blog will cover:

Installing the Sock Shop microservices demo app in an EKS cluster
Installing Zebrium log collector
Breaking the demo app (using a chaos engineering tool) and verifying that the Zebrium platform automatically finds the root cause

Prerequisites

An active AWS account.
AWS CLI with the IAM user having admin permission or having all the permissions to execute the setup.
A free Zebrium trial account.

The entire process outlined in this blog will take approximately 30–40 minutes.

Step 1. Create and configure an EKS cluster

We will assume that you already have an Amazon EKS cluster up and running. If not, you can follow this link to get started with Amazon EKS cluster setup.

Step 2. Create a Zebrium account and install the log collector

You’ll need a free Zebrium trial account (sign up here). Create a new account, set your password and then advance to the Send Logs page.

Important: Do not install the log collector yet as we’re going to modify the install command!

1. Copy the Helm command from the Zebrium Send Logs page.

Important: You can just delete this part of the line:

zebrium.timezone=KUBERNETES_HOST_TIMEZONE,zebrium.deployment=YOUR_DEPLOYMENT_NAME

See the example below (make sure to substitute XXXX for your actual token).

# Install the Ze log collector. Copy the cmds from your browser & make the above changes. Make sure you use the correct token and URL

helm upgrade -i zlog-collector zlog-collector --namespace zebrium --create-namespace --repo https://raw.githubusercontent.com/zebrium/ze-kubernetes-collector/master/charts --set zebrium.collectorUrl=https://cloud-ingest.zebrium.com,zebrium.authToken=XXXX

2. The Zebrium UI should detect that logs are being sent within a minute or two. You’ll see a message like this:

We’re now done with the Zebrium installation and setup. The machine learning will begin structuring and learning the patterns in the logs from your newly created K8s environment.

Step 3. Install and fire up the Sock Shop demo app

Sock Shop is a really good demo microservices application as it simulates the key components of the user-facing part of an e-commerce website. It is built using Spring Boot, Go kit, and Node.js and is packaged in Docker containers. Visit this GitHub page to learn more about the application design.

A little bit later, we’re going to install and use the Litmus Chaos Engine to “break” the Sock Shop application. So, we are going to install Sock Shop using a YAML config file that contains annotations for the Litmus Chaos Engine.

# Install Sock Shop from yaml. This version of the yaml has Litmus chaos annotations.
 
kubectl create -f https://raw.githubusercontent.com/zebrium/zebrium-sockshop-demo/main/sock-shop-litmus-chaos.yaml

Wait until all the pods are in a running state (this can take a few minutes):

# Check if everything has started - this takes a few minutes. Keep checking and don't move on until all pods are in a running state

kubectl get pods -n sock-shop

NAME                            READY   STATUS    RESTARTS   AGE
carts-5b79655688-rvlsc          1/1     Running   0          66s
carts-db-98f655766-fd8ks        1/1     Running   0          67s
catalogue-7f567b89ff-92xnq      1/1     Running   0          81s
catalogue-db-64bbcd4bf-tjr5j    1/1     Running   0          81s
front-end-7db75779b6-wjb94      1/1     Running   0          80s
orders-7f966b7dcc-wbb4r         1/1     Running   0          80s
orders-db-7c64447b77-8j7v6      1/1     Running   0          80s
payment-845b744cdf-5wqkb        1/1     Running   0          80s
queue-master-798897f97b-svpxk   1/1     Running   0          80s
rabbitmq-66bf858977-wg8p5       1/1     Running   0          79s
shipping-77cd4d7cf5-4mmz6       1/1     Running   0          79s
user-5f858cb484-bgrnd           1/1     Running   0          79s
user-db-57d4bd5d56-68h86        1/1     Running   0          79s
user-load-59b58b868f-2nz57      1/1     Running   0          78s

When all the services are running, you can bring up the app in your browser. You will need to set up port forwarding and get the front-end IP address and port by running the command below. You should do this in a separate shell window.


#Get pod name of front-end

kubectl get pods -n sock-shop | grep front-end
front-end-77c5479795-f7wfk      1/1     Running   1          20h

#Run port forward in a new shell window
#Use pod name from the above command in place of XXX’s


kubectl port-forward front-end-XXXX-XXXX 8079:8079 -n sock-shop

Forwarding from 127.0.0.1:8079 -> 8079
Forwarding from [::1]:8079 -> 8079
Handling connection for 8079



#Open address:port in browser (127.0.0.1:8079 in example above)

Now open the ip_address:port from above (in this case: 127.0.0.1:8079) in a new browser tab. You should now be able to interact with the Sock Shop app in your browser and verify that it’s working correctly.

You can also go to CloudWatch in the AWS Console and visit the Resources page under Container Insights to verify that everything looks healthy. Details on how to do this can be found at the Amazon CloudWatch User Guide.

Step 4. Install the Litmus Chaos Engine

We’re going to use the open-source Litmus chaos tool to run a chaos experiment that “breaks” the Sock Shop app. Install the required Litmus components using the following commands:

# Install Litmus and create an appropriate RBAC role for the pod-network-corruption test


helm repo add litmuschaos https://litmuschaos.github.io/litmus-helm/

helm upgrade -i litmus litmuschaos/litmus-core -n litmus --create-namespace

kubectl apply -f "https://hub.litmuschaos.io/api/chaos/1.13.6?file=charts/generic/experiments.yaml" -n sock-shop

# Setup service account with the appropriate RBAC to run the network corruption experiment 

kubectl apply -f https://raw.githubusercontent.com/zebrium/zebrium-sockshop-demo/main/pod-network-corruption-rbac.yaml

# Make a note of the time

date

Step 5. Do something else for two hours!

This is a new EKS cluster and a new app and Zebrium account, so it’s important to give the machine learning a bit of time to learn the normal log patterns. We recommend waiting at least two hours before proceeding (you can wait longer if you like).

You can use this time to explore the Zebrium UI.

On the REPORTING page, you should see at least one sample root cause report.

Select it and explore how to interact with Root Cause Reports! Make sure you try “peeking” and zooming into Related Events.

You might also see other real root cause reports. If you do, they are likely due to anomalous patterns in the logs that occur during the bring-up of the new environment (Zebrium sees them as being anomalous because it doesn’t have much history to go on at this stage). Again, feel free to poke around.

Step 6. Run a network corruption chaos experiment to break the Sock Shop app.

Now that the ML has had a chance to get a baseline of the logs, we’re going to break the environment by running a Litmus network corruption chaos experiment. If you’re interested, you can read about the details of what the experiment does.

The following command will start the network corruption experiment:

# Run the pod-network-corruption test
kubectl apply -f https://raw.githubusercontent.com/zebrium/zebrium-sockshop-demo/main/pod-network-corruption-chaos.yaml

# Make note  of the date
date

It will take a minute or so for the experiment to start. You can tell that it’s running when the pod-network-corruption-helper goes into a Running state. Monitor its progress with kubectl (the -w option waits for output, so hit ^C once you see that everything is running):

# Watch the output until pod-network-helper goes into Running state
kubectl get pods -n sock-shop -w
....

sock-chaos-runner                      0/1     Pending             0          0s
sock-chaos-runner                      0/1     Pending             0          0s
sock-chaos-runner                      0/1     ContainerCreating   0          0s
sock-chaos-runner                      1/1     Running             0          13s
pod-network-corruption-qggdmo-m4fpg    0/1     Pending             0          0s
pod-network-corruption-qggdmo-m4fpg    0/1     Pending             0          0s
pod-network-corruption-qggdmo-m4fpg    0/1     ContainerCreating   0          0s
pod-network-corruption-qggdmo-m4fpg    1/1     Running             0          47s
pod-network-corruption-helper-hhopwy   0/1     Pending             0          0s
pod-network-corruption-helper-hhopwy   0/1     ContainerCreating   0          0s
pod-network-corruption-helper-hhopwy   1/1     Running             0          10s
....

# Hit ^C to stop the kubectl command

Once the chaos test has started running, you can go to the Sock Shop UI. You should still be able to navigate around and refresh the app but might notice some operations will fail (for example, if you add anything into the shopping cart, the number of items does not increment correctly). The chaos test will run for two minutes. Wait for it to complete before proceeding to the next step.

Step 7. The results and how to interpret them

Please give the machine learning a few minutes to detect the problem (typically 2 to 10 minutes) and then refresh your browser window until you see one or more new root cause reports (UI refresh is NOT automatic).

I tried the above procedure several times and saw really awesome results after each run. However, the actual root cause reports were different across runs. This is because of many factors, including the learning period, what events occurred while learning, the timing and order of the log lines while the experiment was running, other things happening on the system, and so on.

Important: If you don’t see a relevant report, or the report doesn’t show you root cause indicators of the problem, read the section below, “What should I do if there are no relevant root cause reports?”

Below are two examples and explanations of what I saw.

Important: We recommend that you read this section carefully as it will help you interpret the results when you try this on your own.

Results example 1:

1. Reporting page shows the first clue

The Reporting page contains a summary list of all the root cause reports found by the machine learning. After running the chaos experiment, I saw this:

There are three useful parts of the summary:

Plain language NLP summary – This is an experimental feature where we use the GPT-3 language model to construct a summary of the report. In this case, the summary is not perfect, but it gives us some useful context about the problem. We see something about a Chaos Engine and an order not being created in the allotted time.

Log type(s) and host(s) – Here you can see the host and log types (front end, events, orders, and messages) that contain the events for this incident.

One or two log “Hallmark” events – The ML picks out one or two events that it thinks will characterize the problem. In this example, we see a chaos event as well as an exception about an order timeout. Just from this, we get a good sense of the problem.

2. Viewing the report details shows the root cause and symptoms

Clicking on the summary lets you drill down into the details of the root cause report. Initially, you see only the core events. The core events represent the cluster of correlated anomalies that the ML picks out. There are many factors that go into anomaly scoring, but the most important ones are events that are either “rare” or “bad” (events with high severity).In the example above, there are 13 core events (count is shown on left). On the right, you can see the events themselves. I’ve highlighted four that explain the problem.This one represents the actual root cause: that network corruption experiment is executing:

2021-09-10T17:55:51Z INFO count = 1 type = ADDED name = ExperimentJobCreatepod-network-corruption0e2ebe6b-8bec-4509-b42e-6c8a41d352e7 kind = ChaosEngine namespace = sock-shop name = sock-chaos uid = 0e2ebe6b-8bec-4509-b42e-6c8a41d352e7 reason = ExperimentJobCreate msg = Experiment Job pod-network-corruption-vbfrrz for Chaos Experiment: pod-network-corruption

This kernel event shows a very relevant symptom of the network corruption on eth0:

Sep 10 17:55:55 ip-172-41-13-201 kernel: eth0: Caught tx_queue_len zero misconfig

Then we see a timeout in the orders logytpe and a front-end error that the order couldn’t be created, which show the upstream impact of the network corruption:

13:56:00.842ordersDEBUGjava.util.concurrent.TimeoutException: null 13:56:00.843frontendERROROrder response: {"timestamp":1631296560842,"status":500,"error":"Internal Server Error","exception":"java.lang.IllegalStateException","message":"Unable to create order due to timeout from one of the services.","path":"/orders"}

It’s interesting to note the variance in the structure of these events. Fortunately, the ML automatically learns how to parse, structure, and categorize each different type of event!

Why this is really cool

The root cause report above contains just 13 events that were picked out by the ML out of several hundred thousand events that were generated while the chaos test was running. This is all the more surprising given that there are no pre-built rules in place for this kind of incident and the unsupervised machine learning had very little to go by (just two hours of learning).

Results example 2:

The report produced in example 2 is quite different but still shows us the root cause and symptoms. However, to understand what was happening, I needed an extra click! Let me explain, as this is a useful way of learning how to use the UI.

Here’s the summary, where we see an event that mentions the network corruption experiment (a root cause indicator). But in this case, the experimental NLP summary is not useful:

Clicking on the report displays four core events. These relate to the tail end of the chaos experiment. So again, we get a good sense that the chaos experiment has run:

But this is not really enough, so wouldn’t it be good to see more detail about what was going on? Related Events and Peek provide a way to see more detail around the core events.

Related Events to the rescue!

At the top of the report, you will see the Related Events zoom buttons. The shaded dot shows we are at the core:

Each zoom level tells the machine learning to bring in additional surrounding errors and anomalies that it believes are related. Here I have clicked on the first zoom level:

This took me from four core events to 204 related events (out of hundreds of thousands). Right near the top, I see this:

As before, there is a kernel event indicating an eth0 misconfig. There’s also a MongoDB timeout exception (which makes sense given the network has been impacted).

What should I do if there are no relevant root cause reports?

Although this is just a test environment, there is a lot going on with approximately 40 different log types from about 25 pods, producing over a million log events each hour! The exact timing, order of log events, and types of events produced will vary by run and environment.

If you don’t see something relevant, here are a few things that you can try:

1. Related Events – As described above, you can click into any existing root cause reports and use the Related Events zoom levels to see more detail of errors and anomalies around the core events. This can be very helpful when the core doesn’t show anything relevant.

2. Peek – If you’re inside a root cause report and see an interesting event, you can look at the log stream that generated the event. This is done by mousing over an event and clicking on the Peek button at the right-hand-side of the event:

3. Scan for Root Cause – In the Reporting page, there is a green button at the top:

Clicking on this will pop up a time and date picker. You can then select an approximate time of the problem you’re looking to troubleshoot and use this to tell the ML to look for nearby errors and anomalies. The results of the scan will appear as a new root cause report a few minutes later (remember to refresh the screen to see it).

4. Contact Zebrium – There are a few other settings that can be changed that can help to create the best possible root cause reports. Please don’t hesitate to contact Zebrium at hello@zebrium.com for help with this.

Conclusion

The process of troubleshooting and tracking down the root cause of an incident in a distributed application is becoming more and more difficult and time-consuming.

In this blog post, we showed how to induce a failure in the Sock Shop microservices application by running a Litmus network corruption chaos experiment. The Zebrium machine learning technology was able to detect this and build a root cause report that detailed the root cause.

This experiment shows how you can use machine learning to automatically find a correlated cluster of anomalies that represents a set of log lines that explains the root cause. The technology is designed to work with any kind of application and does not require manual training or large training data sets.

References

https://aws.amazon.com/eks/

https://www.zebrium.com/

Containers