2008/12/09

Some notes about troubleshooting

From Why Programs Fails: A Guide to Systematic Debugging:

Some terminology: from defects to failures

  1. The programmer creates a defect.
  2. The defect causes an infection (the program state differs from what the programmer intended)
  3. The infection propagates
  4. The infection causes a failure (an externally observable error in the program state).

Debugging can be decomposed into seven steps:
Track the problem in the database
Reproduce the failure
Automate and simplify the test case
Find possible infection origins
Focus on the most likely origins:
Known infections
Causes in state, code, and input
Anomalies
Code smells
Isolate the infection chain
Correct the defect

Note the TRAFFIC mnemonic.

Opinion ON:

That's nice theory. Unfortunately, the complexity of IT has teached us that computers are non-deterministic beasts. The first thing that we do in case of a problem is restart the system, hope that the problem will go away, and be able to keep doing our job. And it often works.

But we, software developers and support technicians, need to un-learn our hide the symptoms urge. If we suspect that there is some defect in the code (and trust this old software developer, you can be confindent that it very likely that there is one), our goal should always be to find the defect, instead of changing things so that the defect is not executed or the infection does not end up causing a failure. Leaving that defect unfixed is very likely to cause pain in a future situation. À la programming by coincidence.

Some advice
  • avoid corrective actions before the issue is understood. Before the failing code is identified, you should only use temporary corrective actions to help you frame the problem. I'm guilty of having neglected this rule lots of times, and requested customers to just upgrade the code to see if the issue goes away.
  • when possible, use tools that minimize the infection propagation (that, crash as soon as possible). On Windows, I was recently introduced to pageheap and I'm in love with. More on it another day.
  • defects come from source code: give support technicians access and knowledge to read the source code. Support people and developers should change seats quite often.

1 comentario:

Anónimo dijo...

Today, when I was doing the work assignations for tomorrow, I booked one of the members of my team to use pageheap to find that damn bug, as you recommend me. I will tell you about the results...