Troubleshooting
Guidelines and Techniques

I've been constantly surprised down through the years that troubleshooting techniques that seem "obvious" to me are viewed as some sort of arcane knowledge by most folks. And I've often been disappointed that what started as a simple problem that could have been fixed in a few minutes grew into a huge tangle that even an expert couldn't sort out quickly. Troubleshooting needn't be the black art it's commonly thought to be. It seems though that just the Hippocratic guideline "do no harm" isn't enough; a more detailed step by step recipe is needed. So I've broken down troubleshooting into these ten specific rules of thumb.

The guidelines below are like guard rails on a mountain road. Keep them in mind to avoid plunging off a cliff:

Guidelines

1) Shorten the recipe for recreating the problem as much as you can while still producing an unambiguous answer

Having a way to quickly and clearly determine if the problem is fixed yet is very important; don't do anything else until you do this.

One goal is to be able to quickly check how you're doing. Hopefully you can boil the problem down to something you can check in just a few seconds. For example if the problem is your acounting program can't open any of its files, checking by mounting a particular USB thumb drive then trying to open a special file on it is probably overly complicated. Strip the problem to just its essence: in this example just checking if your accounting program can open a regular file on your main hard drive may be enough.

The other goal is to come up with a recipe that quickly determines "yes, it's fixed" or "no, it still fails" (or maybe "no, but it's different now"). If "is it fixed?" is answered "maybe", your checking procedure isn't clear enough yet.

2) Find something that works correctly even though it's similar to the problem, and test it again at every step

Don't focus so tightly on the problem that it blinds you from seeing anything else. To avoid over-focusing, find something that works correctly even though it's similar to the problem. Then test every guess twice, once on the thing that doesn't work and again on the thing that does work.

For example suppose the problem is that a particular new website won't display correctly. Casting about for something similar that works, you very quickly come up with display of a familiar website that's already bookmarked. Now the failing thing is a site that won't display, and the working thing is a site that will display.

Suppose the very first guess involves clicking the browser's "refresh" button. You find the refresh doesn't change the mis-display of the failing site. You go on to test the thing that works, and you find that although the site originally displayed correctly, after a refresh it doesn't display correctly any more. You quickly realize your Internet access isn't working at all, and correct displays are coming out of the browser's cache; sure enough, your modem's power cord has fallen out. Problem solved.

If you hadn't tested the working thing as well as the failing thing at every step, it would have taken you a lot longer to zero in on the problem.

3) Test each guess right away and move on promptly if it doesn't help the problem

Generate and test as many guesses as you can as quickly as possible (similar to brainstorming). Don't get hung up on a guess that doesn't improve the problem. (Especially don't get hung up on your first guess.) If a particular guess has occupied you for more than five minutes, focus exclusively on making a firm decision about it.

As soon as you formulate a guess, start determining if it helps. Strive for a clear "yes" or "no" answer. If your answer is still "maybe", try harder.

You may find it helpful to write down every guess and how you tested it. Then if you later find you've misinterpreted a particular test tool, you can quickly and accurately find just the affected guesses and retest them. If you generate several guesses at the same time, you may find it helpful to write down the others while you test the first one; that way you won't forget any of your guesses.

4) Change only a little at a time

Changing lots of things at the same time is like the stock fairy tale phrase "he rode off madly in all directions". If it works after you change ten things all at once, you don't know which of those ten things made the difference. (In fact, you probably can't even remember all ten of the things you changed.) And it's quite likely at least one of the other nine changes actually broke something else, which you won't discover for hours or days.

You might find it useful to write very detailed notes. Or you might find it best to only change one thing at a time.

5) If a change didn't solve the problem, put it back the way it was

If you keep making changes without putting them back, your system will almost certainly become more and more broken. Although there was probably only one problem when you started, you will soon have several problems. Now even an expert may have great difficulty untangling the mess.

To avoid this death spiral, every time you try a change, check it and if the problem is still there put the change back the way it was.

One way to think of this is to suppose your system has 100 different possible states and only one of them works. This system has 99 different ways to be broken, the one that it's in and 98 others (many of which are even worse). Each change you make has 98/99 odds (99 is 100 less the 1 state you're already in) of switching to some other kind of "broken"; that's pretty lousy odds. Another way to think of this is, "better the devil you know than the devil you don't".

The above guidelines will avoid wasting lots of time and will keep you from making a bad situation even worse. But just steering clear of gross stupidity isn't enough; how can you actually move forward to isolate and fix the problem? Just follow one or more of these troubleshooting techniques:

Techniques

A) Always first look at what changed recently

If it worked last week but doesn't work now, and you made a change just three days ago, look long and hard at that change. In fact, until you've spent at least an hour and taken at least one break to clear your mind, don't look at anything else at all.

Undo the change you made recently and see if your problem goes away. If the problem disappears, look very hard at that change trying to see some subtle error or incompleteness.

B) Find out if others have had a similar problem

If you get an error message, type the exact words of the message —especially any error message number even if it's gobbledygook to you— into your favorite web search. Sometimes both a description of your problem and a solution will pop out.

Academics call this "standing on the shoulders of giants". Use this shortcut whenever you can; only when it's really necessary spend the extra time to start all over from scratch.

C) Look for what's common

If several different problems all started at the same time, it's quite likely they can all be traced back to the same cause. But it won't always be immediately clear where the problem lies. You may have to think about it, write some notes, and perhaps even perform some experiments before that "aha" moment.

For example, suppose you notice three different problems at about the same time: 1) Although your email program shows what happened up until yesterday just fine, it never seems to catch up with what happened today. 2) Popup error messages about some "logging" issue keep appearing in the middle of your screen. and 3) The new accounting entries you make don't change this month's running totals.

Use whatever knowledge you have of how each piece of software works inside to list the steps it goes through when the problem becomes apparent. To continue the example: 1a) read new email from network, 1b) update local mail file and write it to disk, 1c) display what's in the local mail file; 2a) any significant event occurs, 2b) write information about the significant event to the local log file on the disk, 2c) if there's a problem display a popup message to the user; 3a) accept new accounting entry from the screen, 3b) vett the new entry to make sure it makes sense, 3c) update the local accounting file and write it to disk, 3d) calculate running totals from the local accounting file, 3e) display the calculated running totals. 1b), 2b), and 3c) are all about writing a file to the local disk. A problem writing to the local disk would explain all the seemingly unrelated problems; some sort of problem writing a new or modified file to the disk becomes your prime suspect.

D) Repeatedly compare the working recipe with the failing recipe, asking again and again "what's different?"

Tweak both the working recipe and the failing recipe little by little to progressively be more like each other. Eventually there will be only one difference left. When that happens, you will have isolated the one difference that matters, the key to the problem!

E) Break down the process and test each piece separately

For example if the problem is that two computers can't share files, you can break the problem up into: 1) can the computers communicate at all, 2) does name-to-address translation work for both machines, 3) is the problem only in one direction, only in the other direction, or in both directions, and 4) is the problem related to permissions. Test 1) by using ping with an IPaddress. If even this fails, look for loose cables and power failures. Test 2) by using ping with a computer name. If there's a problem, use another diagnostic tool like nslookup to localize it. Test 3) by using ping from one computer then from the other computer and seeing if the symptom looks the same. Test 4) by looking through the system logs for any error message about file access. With any luck the example problem will shrink to only one fourth of its initial size.

This way what begins as a large amorphous problem is reduced to a much smaller and clearer problem. Usually just one or two iterations of breaking the problem into pieces will pinpoint it so effectively you can then fix it.

Another Way

Of course you may be able to skip the "troubleshooting" phase entirely and move directly to resolving the problem:

Fix Most Anything With Either WD40 or Duct Tape

Location: (N) 42.67995, (W) -70.83761
(North America> USA> Massachusetts> Boston Metro North> Ipswich)
Email comments to Chuck Kollars
Time: UTC-5 (USA Eastern Time Zone)
(UTC-4 summertime --"daylight saving time")

Chuck Kollars' other web presences include Chuck's books and Chuck's movies.

You may also wish to look at Dad's photo album.

All content on this Personal Website (including text, photographs, audio files, and any other original works), unless otherwise noted on individual webpages, are available to anyone for re-use (reproduction, modification, derivation, distribution, etc.) for any non-commercial purpose under a Creative Commons License.

Trouble­shootingGuidelines and Techniques