Troubleshooting in style

If you ask me what I like to do most in life, I would straightaway answer two things: solving problems and explaining things to others.

I think these two passions are what led me to academia and eventually a career in research. Doing research is a non-stop problem-solving job. And once you solve your problems you need to explain your ‘complicated’ ideas, sometimes to people with a very different background than yours. Of course, these aspects are not only found in academia, but what is beautiful is that this troubleshooting at the university level is really what you are supposed to be doing. It’s part of the job description and you get rewarded for doing it actively.

Troubleshooting, like almost everything else in life, is a process that can get you really frustrated if you don’t deal with it in the proper way. It’s been countless of times that I became exhausted and really angry while trying to solve problems just because I would not work methodically and with a specific structure. As Robert Pirsig says in one of my favourite books ever, Zen and the art of motorcycle maintenance:

Motorcycle maintenance gets frustrating. Angering. Infuriating. That’s what makes it interesting.

So, is there really a specific way of working when troubleshooting things? Well, you will be pleased to know that there is indeed a methodology to troubleshoot, distilled from many hits against the wall. And what is best is that it can be applied to any kind of problem you are facing. Technical or not.

Here’s a short guide to remember when trying to solve any kind of problem.

1. The distance between symptom and problem

It may sound like the most obvious first step, but often finding the source of the problem is sometimes neglected. When facing a problem we tend to focus too much on what should be happening and we stop observing what is actually happening. This blocks us from understanding, and ultimately, solving the problem. The very first step of troubleshooting is to detect what is going wrong in the first place. And it is not easy because things are usually so inter-dependent. We often see manifestations of a problem in parts of the process that might have little to do with the actual problem. It is what I call ‘the distance between symptom and problem’. The longer this distance, the more difficult is to troubleshoot a situation.

Let me give you an example:

One morning, you wake up, get dressed and get in your car to go to work. You turn the key but nothing is happening: you hear the engine moving a bit, but your car won’t start. You could think that this is a problem, but that’s only when your final goal is to reach to work. However, when your short-term goal is to get the car going, a non-responding car is just the symptom, not the actual problem.

So you open the hood and start looking at the engine. You ask your wife to try to turn the key while you are looking at the engine, and you see that the engine rotates for half a second but doesn’t get started. You notice that the sparks aren’t working, so you decide that they must be somehow broken. You call your boss and you mention you will be late: this will take a while to fix.

Looking closer to the problem, we managed to find the source of the problem, by checking some symptoms (e.g. the engine was rotating a bit, but the sparks were not firing). But is that all there is in troubleshooting? Are we so sure that our spark plugs is the real issue?

2. Check your assumptions

When troubleshooting, a state of Zen is necessary for the mind. A hurried decision based on an assumption can cost you a lot of time and sometimes even money.

Let’s get back to our previous example:

After you safely recognised the spark plugs as the source of the problem, you head to the nearest car parts store with your wife’s car and you buy a set of 4 plugs suitable for your car model. You also need to buy the appropriate key, but that’s okay. These spark plugs seem to break often enough so you will need it again soon. You start unscrewing the four plugs and start plugging the new ones. After 30’ of struggling and sweating, you manage to finish and get ready to go to work. You try to start your car only to see the exact same behaviour as before…

So what happened? Weren’t the plugs the source of the problem? Why is the car not starting?

Even if this example might seem rather silly, this situation is more common than you would think. We hurriedly jump to a conclusion and we assume that we have found the problem. But is what we found maybe just another symptom? Jumping to conclusions in a hurry might solve our problems faster, but on the other hand it might cause more damage, cost us money (we had to buy new plugs and tools for no reason) and definitely time. And because there is an infinite number of wrong solutions to a problem and only one (or a few) correct, chances are that the first solutions we will come up with will probably not be related to the actual problem.

To avoid this situation, we must first identify all our assumptions. One by one. In the previous example, the assumption that we made was that the plugs were not working. And this is an assumption because we are not sure if what we are observing is a symptom or the problem. So once we identify the assumptions, we need to test them against the truth. Are the plugs really broken? Or maybe they just don’t receive enough electrical current? Is the weather too cold for the sparks to operate?

Once we have realised all the assumptions we made, we need to start pondering if there are more assumptions that we could make. Assumptions are good to make, they are the way to solve a complicated problem. But you need to recognise them as assumptions and hold action until they are verified or disputed. And your job is to verify them either as true or false. But what do you do when you have a lot of assumptions that are difficult to verify or contradict?

3. Probabilities matter, think simply

Sometimes testing your assumptions might be difficult. In the car example for instance you might not have the necessary tools, such a voltmeter to see if there is electricity reaching the spark plugs, or a method to verify the integrity of the cables.

So how can we test our assumptions?

Occam’s razor: Entities must not be multiplied beyond necessity

In one of the most famous versions of the Occam’s razor, it is stated that out of multiple explanations for a specific phenomenon, the simplest one is most likely to be true.

One can never be sure, but chances are that the problems are simple in nature, even if their symptoms are devastating. What I usually try to do is lay down the possibilities of each of the assumptions to be true. Let’s work with our example.

What could have gone wrong with the car? Here’s a list of options to consider:

  • The plugs are broken
  • The plugs are working, but they don’t receive electricity
  • The plugs and their supply is fine, the problem is somewhere else

Each of these assumptions can lead to further assumptions of things that are broken. For instance, if we assume that there is no electrical supply reaching the plugs, we can make these assumptions as well:

  • The cables are broken
  • The electrical distributor is broken
  • The battery is empty

How probable is that each of these problems are at hand? Let’s take them one by one.

  • The chances that all four plugs broke at the same time are rather low. Also, spark plugs have usually a long mileage, so you can fine tune the probabilities of this assumption by checking how many kilometres did you do since you last changed them.
  • The chances that there is something wrong with the electrical supply are considerable. However, we can discard some of the sub-assumptions (e.g. the battery is empty is not probable as the engine is rotating when the key is turning). The more sub-assumptions we discard, the less probable this assumption is.
  • The chances that there is something else wrong with the car are high, as there are so many things that could be wrong.

Probability is one aspect to consider, but if we need to narrow down the problem to something specific, we need to dispute some of the assumptions. And what’s better than starting with the most easy assumptions to check?

check-your-assumptions

That means that even though there is higher possibility that there is something else wrong with the car, we should first try to check the assumption of electrical supply first, since it is easier to examine (i.e. there are too many options to consider in the first case). Ideally you would start by checking the most probable and easier to check assumptions first. But if you run out of easy and probable things to check, I suggest that you focus on the things that are easier to check, even if they are improbable. This way you are left with fewer options and you can focus better on finding the actual problem.

One important fact with checking your assumptions is knowledge. Identifying and checking your assumptions is something that comes with experience and knowledge of the process you are examining. However, I think it is even more important for beginners to understand that we always start troubleshooting by making some assumptions, no matter how experienced we get. Realising that an assumption is an assumption is one of the most important steps for successful troubleshooting.

4. Try one thing at a time

After fiddling a bit under the hood of your car, you notice that the cable running from the batter to the distributor is rather broken. You forgot that sometimes rodents chew your car cables, especially since you moved in your new house out of the city. Luckily, you have a spare part in your trunk, so you change it. Alas, problem is now solved, so you can finally go to your work. You enter your car, the engine wobbles a bit and starts running! Only that it makes a very weird periodic noise and after a while the engine stops again…

What went wrong? Didn’t you find the problem and fixed it? Well, it seems that you are suffering from a second problem… this seems to be too much of a coincidence!

Actually, this is the situation you might end up if you are changing too many things at the same time in your process. Remember what you did? You changed the spark plugs due to a false assumption that the plugs were broken. But are you sure that the new plugs are working properly? Or that you installed them all correctly? This is why it is important to change only one thing at the time when troubleshooting. The more things you change, the more problems you might be introducing. This makes the symptoms even more weird and difficult to understand.

So, while troubleshooting, if you change something and it doesn’t work, change it back to its original state! This is rather easy to do when e.g. troubleshooting software, using version control systems, but it is equally important for any kind of process besides software.

5. Reproduce the problem

So you realise that one of the plugs was not installed correctly. You remove it, and tighten it again and now the engine is working smooth. You look at your watch and it’s almost lunch time… Well, half a day is lost already and traffic will be less now. So you do what each engineer proud of herself would do: you ponder a bit more about the problem.

Can we ever be sure that we found the problem, and not just one more symptom of the actual problem? With working with complicated machinery, software or processes, the certainty we can allow ourselves to have fixed a problem should be rather low. How can we be sure then that we will be able to drive all the way to work without any issues.

A pretty powerful technique to achieve this is to attempt to reproduce the problem. But what sane person would want to re-introduce the problem in their process, just for the sake of verification? Well, I would :) By re-producing the problem we are able to examine it better. It’s as if all this time we were trying to lock the problem under the lens of our investigative microscope, and now that we have finally located it, we can study it under the greatest detail!

First and most important, we can verify if the problem appears again, if revert our solution. Does the car stop working again if we change the cable? If yes, then this is a strong indication that the problem was indeed with the cable. If not, then it means that we haven’t really found the problem and that the problem disappeared by chance. Furthermore, if we are able to reproduce the problem, we are able to create fail-safes around it so that either the problem stops occurring, or when it does it is easier to detect and solve. For instance, if we determine that the cable was indeed the problem, we could install a sensor to check for the cable’s integrity in the future. Or we could coat the cable with a material that smells badly and repels rodents :)

But whatever we do, being able to reproduce the problem, guarantees a higher level of certainty that we have located the problem correctly.

6. Be patient

I started this article stating that troubleshooting can be a frustrating experience. And this is definitely true, especially when you are not in the right mindset for troubleshooting. If I had to describe the right mindset for troubleshooting with one word patience would be it. Finding the source of the problem is usually a long process with a lot of false positives, which can bring a certain feeling of grievance and irritation.

But you have two options: either to keep on trying or to give up. Considering that the second is sometimes not even an option, especially in a business environment (unless you don’t mind getting fired), then there is not much choice left. Realising that you need to keep trying until you find and fix your problem is definitely the biggest enabler in troubleshooting. Nagging and complaining doesn’t work. Trying more of your assumptions does.

7. Talk to the duck

hacker-duck

The duck is your best friend. Tell her everything.

In software engineering, there is a term called rubber duck debugging. It is a methodology where you need to explain your problem to a bathtub duck as explicit as possible, so that the duck can understand it. Before you even finish, the duck will transmit a very useful hint in your brain, using brainwaves. This hint will be something that is very obvious but somehow you omitted earlier. Using that hint, the problem becomes clearer and you can progress in your work. In any case, you shouldn’t forget to thank the duck for saving the day.

Even though this might seem like a silly idolatry tradition, it actually works much better than you think. Sometimes, ideas get really mixed up in your brain and it is very difficult to make sense of anything. By structuring your thoughts as part of an explanation, you need to fill in logical gaps that might have been existing, and therefore providing yourself important information. So important, so that it often leads to the solution of the problem.

I don’t have a duck, but I do practise this very often with my wife. She is there to listen to me and usually I do find the solution mid-way the explanation. You should definitely try it as well. If you feel uncomfortable talking to a inanimate object, try with a person. Whatever you choose though, never ever forget to thank your ‘duck’ for helping you out ;)


Wrapping up

These were my top seven tips or guidelines for debugging, based on my work experience. These are surely not the only ones. I am looking forward to learning: what are your favourite methods? What helps you see the light at the end of the tunnel when debugging?

comments powered by Disqus