Basic Techniques

Breakpointing

Most systems provide a debugging option; you can set breakpoints at various points in the program. When the program halts at such a breakpoint, you can inspect the value of variables or change the value, set new breakpoints, run until the next breakpoint is encountered, etc. Commonly there is support for high level languages that allows you to use high level language statement numbers and variable names: a so-called high level language 'symbolic debugger'.

This is a very efficient method for debugging small programs, but hardly effective for large programs, in particular with complex datastructures. It is definite unsuitable to debug real-time software and software which is supposed to run continuously: when you enter a breakpoint all timing is lost, so it makes no sense to continue.

Dumps

Crash Dump

Most systems have a (compiler or link editor) option to generate a core dump whenever the program crashes (therefore also called a post mortem dump, as opposed to a snap dump). Commonly, the dump provides a hexadecimal output which can only be interpreted with effort (i.e. slowly). Even if support for high level languages is available this option is in general of limited use as the program has probable continued for some time after it processed the faulty part: the results are completely muddled.
However, with other methods to detect an inconsistency early in the process (e.g. Assert statements) and force a crash, a crash dump may be usefull.

For intermittent failures of a software process, crash dumps, selective tracing and abundant use of asserts may be the only way to find the problem.

Snap Dump

A Snap Dump provides a dump of sections of the program's memory, but opposed to the Crash Dump, processing may proceed. Therefore it is more like a trace. However, as the dump is commonly in hexadecimal format, its use is limited. For complex data structures it is much better to create a Structured Dump routine.

Stack Dump

Some systems can provide a stack dump showing the call sequence and potentially parameter values. This can be either as Crash Dump or as Snap Dump. A stack dump should provide a trace of all the functions/routines called. Such a dump can be very usefull as it shows the program's execution path and important values (parameters).
If your system does not provide such a facility, you can try to create it (make it a library routine so everybody can use it).

Structured Dump

Whenever you have a non-trivial data structure, write a dump routine to display the data structure in a functional way immediately (sooner or later you have to make this routine anyway). Call the Structured Dump routine whenever you would want to do a snap dump or detect an inconsistency (e.g. by the Assert statement).

Structure verification

Though formally not a 'dump'-routine, but a variant of Structured Dump: for any complex data structure create a routine which checks its consistency. During test you should call the routine every time some modifications to the data structure were made. When an inconsistency is found (e.g. through an Assert statement), you can dump the structure.
In operational software you can run the routine less often (e.g. depending on the Trace Option), but have the capability to run it on demand when there are any suspicions (operational databases almost always have inconsistencies).

Tracing

Tracing is outputing additional data from a program to show intermediate values of important variables and/or the flow of control. It is an effective method to verify the software and to locate bugs.

Often the trace output is sent to the control terminal (screen), but for serious tracing output to a file is much better: the relevant part doesn't scroll of your screen, it allows searching, and possibly post-processing.

There are however more possibilities. And when you are making programs which run continuously you need other methods.

Trace File

A nice method to get trace data only when there are problems, is to always create a trace file and write to it, but delete the file when all goes well. When there are problems, keep the file. Operating systems often provide for this:


IBM MVS JCL: DD …,DISP=(new,delete,keep)
DEC VMS: Open( …, History=NEW, Disposition=SAVE, … );
Close( …, Disposition=DELETE, … );
Unix: Open(file); …
Close(file); Delete(file);

Experience with programs using this mechanism (at the price of some I/O overhead) is good: when the program runs successfully, nothing is left in the end; only when it crashes you have trace results.

Trace Option

Very good experience was obtained by having a trace option as one of the standard program options. Whenever there was a problem, the program was run again with the trace switched on (in general providing structured dumps) without the need to use a special version of the program or to recompile it. The trace option may create a trace file (or avoid deletion of the trace file which was made by default).
The trace option may have a value so the amount of trace details or data sections can be controlled.

Assert Trace

A particular implementation of the Assert statement can be used as (program flow) trace. Insert Assert statements at all important points to demonstrate the flow of control (routine entries, if- and case-statements). If necessary use dummy asserts (e.g. 'ASSERT( 273, true );'). Adapt the ASSERT-routine to dump the buffer contents when the ASSERT_INDEX wraps around.
You may even trace in 'critical sections' and supervisory software if you assure that no output occurs there (i.e. no index wrapping; sufficiently large buffer which is cleared before entering). This method has been applied succesfully in device drivers.

Advanced Techniques

Continuous Processing

When you are confronted with software which should run continuously, tracing will result in huge amounts of data. So unrestricted tracing is out of the question. Still, the application which displays erroneous behavior or crashes after many days of operation needs some kind of tracing (and breakpointing is also not realistic). The following sections discuss some tracing techniques which could be applied under such circumstances.

Tracing

For continuously running programs, consider opening and closing files from time to time (e.g. every hour, 100 cycles or messages), and/or use multiple files. For example, close the active trace file after an hour, and start a new file. Use a limited number of files to avoid flooding the disk, but try to cover at least 24 hours of tracing.

Message Trace

If the system consists of multiple processes which exchange information via messages or transactions, add a 'trace flag' to the message/transaction header. Upon reception of a message with this trace flag set, each process should start producing trace output and subsequently set the trace flag on all outgoing messages caused by the message received. When the message has been processed, tracing should be turned off.

The nice thing about this mechanism is that you can selectively trace the effects of any event throughout the system. Suspected message sources or suspected conditions can be modified to set the trace flag. The experience with this mechanism is extremely good. Unfortunately, the method is a design decision and very hard to introduce at a later stage.
An alternative method which can always be introduced but is less elegant, is adaptation of the message exchange (interface) routine to dump messages, potentially filtering on message source and/or destination and/or message values.

Test Messages

Similarly to Message Trace, transactions can be extended with a Test-option: the transaction is to be handled as any other except that this transaction is not completely followed through (similarly to a roll-back in a database transaction).
In particular on complex transaction systems this is an extremely valuable option. In this way new transactions or transactions from a new program or new source can be verified on a live system without any consequences.

=O=

IBM MVS JCL:	DD …,DISP=(new,delete,keep)
DEC VMS:	Open( …, History=NEW, Disposition=SAVE, … );
DEC VMS:	Close( …, Disposition=DELETE, … );
Unix:	Open(file); …
Unix:	Close(file); Delete(file);