Picture this: you and your team are working around the clock to finish the next major version of your software product. You’re creating new features at a good pace. The team fixes bugs as soon as QA reports them. Unit tests are all green. After the application is greenlit by a more comprehensive suite of tests, it’s time to ship it. And then—boom! As soon as it hits the production environment, the app crashes spectacularly. What went wrong?

As it turns out, the test environments weren’t nearly as close to production as you once thought. Infrastructure changes were made to the environment without any records whatsoever. The result was that the environments slowly drifted apart.

As professionals in the tech industry, a considerable portion of our time is spent on troubleshooting defects. And yes, we spend time fixing them, too—but you can’t fix something when you don’t know what’s wrong. As any software developer who’s spent hours in front of a debugger will tell you, more often than not, the really hard part is to find the bug. As soon as you know what the problem is, fixing it might even be trivial.

So, learning to troubleshoot faster is one of the best investments you can make as a software developer or IT worker in general.

Let’s talk about how we can find the problems fast and fix them more quickly.

Root cause analysis: What it is and why you should care?

Root cause analysis (RCA) is a specific technique you can use to troubleshoot problems. With this technique, you analyze the issue at hand using a particular set of steps to identify the primary cause of the problem. RCA is based on the principle that it’s not useful to cater to the symptoms of a problem while ignoring its roots.

By employing RCA, you’ll be able to understand what has occurred. Often, you’re not able to get a complete picture just by observing the symptoms. But determining what happened is just the first step—you then need to go further and unveil the reason why it happened. Equipped with that knowledge, it’s time to put it into practice by formulating a plan or strategy to reduce the probability of it happening again.

With the “what” and “why” covered, here are four tips that will help use RCA to have fewer issues.

Use the rubber duck approach
Yes, I’m serious, the rubber duck approach. I’m not making this up. It’s also called rubber-duck debugging, and it probably has even more names. It consists of explaining your problem to a rubber duck. Don’t have a rubber duck? Don’t worry! You can use any inanimate object you happen to have at hand. Or you could even talk to a person!

So, what is the rubber duck approach really about? This approach is based on the observed effect that by explaining something to someone, you force yourself to order your thoughts. Our thought processes are often chaotic or messy. When we’re faced with the prospect of actually having to explain them to someone, we have no choice other than to order them somehow. Jeff Atwood, the cofounder of the popular Q&A site Stack Overflow, talks about how many times a software developer has told him about writing a new question to the site, figuring out the answer for themselves in the process, and never actually submitting the question!

Is the rubber duck approach enough to troubleshoot any problem? Of course not. It might be, but often it’s just the first step in a broader strategy.

Are you afraid that people are going to think you’re a bit odd for talking to inanimate objects? Well, the thing is, the whole rubber duck idea is somewhat of a joke. It’s a silly and memorable figure, not meant to be taken too seriously. What matters is that you force yourself to express the thoughts in your head in an orderly manner, explaining the problem at hand in as clearly as possible.

You can use the following approaches:
1. Write a Stack Overflow question. Or you can pretend that you’re writing a Stack Overflow question but write it in Notepad instead.
2. File a detailed bug report. Someone probably has to do it anyway, so why not kill two birds with one stone?
3. Walk to your coworker’s cubicle/office and talk to them for a few minutes. That is, as long as they’re OK with it, of course. Don’t disturb your colleagues unnecessarily.

Collect lots of log data (and search through it efficiently)

If you’ve successfully explained the problem in an obvious way but still can’t get to the root issue, then you have to go further. What’s needed now is to gather data about the problem and extract insights out of it.

Logging and monitoring can come in handy here—crash logs, application and server logs, and what have you. You have to gather evidence that the problem happened but also, if possible, find out how long it’s been happening and with what frequency.

You can’t stop there, though. Gathering a lot of data is important, but all of that data won’t be of much use if you can’t find the specific bits you need fast enough. Being stuck in a “needle in a haystack” situation is neither fun nor particularly productive.

That’s why you must employ tools that empower you to search and analyze all the log data you’ve been gathering in real time and turn it into valuable insights that you can use to diagnose and resolves issues faster.

Employ the five-whys technique
After you’ve gathered information, it’s time to put it to use by identifying causal factors. “Causal factor” here means the immediate cause of the problem at hand. What you shouldn’t do is identify one causal factor and then stop. You have to go further. One of the most well-known techniques for that is the five-whys technique.

The technique consists of asking the question “Why?” iteratively until you get to the root of the problem. Let’s see a quick example:

Problem: The website is showing error 500.
1. Why? Because the web framework’s routing component malfunctioned.
2. Why? Because it requires another component, which itself malfunctioned.
3. Why? Because this component of the web framework requires the intl extension, which isn’t working.
4. Why? Because it was accidentally deactivated after the server software got updated.

As you can see, the number five is just illustrative. It’s possible to get to the root problem with fewer steps. Or you may need even more.

The five-whys technique is far from perfect. It’s received its share of criticisms, and it certainly has its limitations. But it can be useful in encouraging engineers to keep searching for the root cause of the issues instead of stopping at the first sign of getting close to a solution.

Get a second pair of eyes
One practice that I’ve come to appreciate in my software developer career is code review. It’s nothing short of amazing how the simple fact of having another, unbiased person take a look at your code can reveal so many issues that you weren’t able to spot before. With time, the sheer expectation of having another person look at your code makes you more conscious of it. You start devoting more attention than you otherwise would.

So, with this point, am I recommending code reviews? Well, yes, but that’s not the only way you can get a second pair of eyes. I’m suggesting you employ review-like processes to pretty much every task an engineer does. Or better yet, pair. Do pair programming, pair server configuration, peer debugging, and peer customer-facing support. In short: do pair problem-troubleshooting.

Science or art?

Defect troubleshooting is at a stage where it’s still more art than science. But that shouldn’t stop you from employing techniques and tools to make it more efficient.

So, use those techniques for doing RCA:
1. Use the rubber duck approach
2. Collect lots of log data and use appropriate tools to search and analyze it
3. Employ the five-whys technique
4. Get a second pair of eyes

It’s time to grab your rubber duck and start analyzing the root cause of your issues.

Learn more about Amazon OpenSearch Service pricing

Visit the pricing page
Ready to build?
Get started with Amazon OpenSearch Service
Have more questions?
Contact us