Last update PvD

Design Recommendations

This section provides some general recommendations on the design of a software system, additional to the problem-specific design (i.e. after Architecture Design). It is about 'what is not in the requirements specification', but should be seriously taken into account.

Note that there is a significant overlap in techniques between debugging (locating and removing errors from a software module during development), tracing (solving integration problems), operational test and error messaging.

Design for Testability

Testing a large and/or complex system is not trivial at all, however it is essential to deliver and maintain a good product.

What will not work

As can be seen from Testing, it is impossible to exhaustively test any non-trivial system. This is already true for fairly simple units. Commonly programs are considered as black boxes, that is, only testable on their external interfaces according to (functional) specification.
To avoid the black box, effort has to be spent to display the inner workings, intermediate results and internal states in the program so parts can be tested separately. Also it must be possible to modify or set intermediate results in the program, so one can test the subsequent parts the program.

Most development platforms provide simple facilities to do just that: debuggers. Commonly there is even support for high-level languages. But such a tool has several major drawbacks:

It is too simple for large systems. It is very hard to nearly impossible to verify complex data structures.
It provides insufficient facilities for automation (very desirable for regression testing).
It is completely unsuitable for (semi) real-time systems. Any breakpoint –required to inspect or set values– destroys the real-time behaviour, and makes it unclear whether the subsequent behavior is caused by an error or by the breakpoint. Continuation from a breakpoint makes no sense.

Which makes a debugger at most suitable for Unit testing, but not for any other test in a large system (such as an Integration test). Of course, a debugger is usefull to locate a problem.

Tracing generates huge amounts of data which has to be checked (manually ?). Tracing in (semi)continuous operating systems is a good way to fill your disks, but not useful for anything else. So conventional tracing is hardly a test method (i.e. to verify the correctness), but more a debugging method (to locate the problem). But tracing can be usefull when applied very selectively and restrictively.

What about operational tests, i.e. when the system is in operation for some time and there are some problems; do you expect operators to use development tools on your software ? Will you provide the source code to them ? Not likely !

Tests during a System's Life Cycle

Consider the applicable Tests during a system's life cycle.
Obviously all these tests must be developed for your specific system, so why not do that early in the development, so you can profit from them during the development.

Recommendations

regarding testing: Design for Testing

Consider testing and test facilities during the early design.
Make a Test plan describing how you are going to do all tests, and what facilities/tools and mechanisms you will use (e.g. set-up). Define test modes and test processes (e.g. consider automation).
Pay special attention to avoid the 'black box'; how are you going to inspect and set values (e.g. for Integration test).
Consider delivery & operational tests.
Define the test modules next to the functional modules to be developped.

This will also provide a good indication of the required effort for development (planning).

Design for Survivability

The system to be developed will be tested to assure conformance to the specification. As it is impossible to fully test any complex system, some errors will remain. As some errors will remain, one is forced to use operational measures to handle these errors in order to obtain the desired reliability in operational systems. Also, there are problems where the system was not designed for, in general out-of-spec conditions (e.g. exceptional operational conditions, hardware errors, input errors, overload). The distinction between consequences of development errors and consequences of operational problems should be considered academic: the means for detecting and handling such problem conditions overlap (main difference: during development one may stop the system to locate the error, but during the operational life this is undesirable and the system should continue as best as it can). The system must respond to such problems in an intelligent manner in order to survive.
So the system is not only subjected to tests during development, but also 'tested' during its operational life. Therefore a system should also contain tests and error handling (problem management) to assure satisfactory operation during its life time. Design for Survivability should make the system 'robust' or resilient.

Defensive programming

There are many similarities for robust or resilient software with defensive programming; one may say that robust systems start with defensive programming, and that defensive programming is good programming carried to the extreme.
Robust software is supposed to keep running, even under 'adverse conditions'. It depends on the purpose of the system what should be considered: inconsistent data must be taken into account, but maybe not (minor) hardware failures. Of course, when data or conditions are nonsense, a system can not be expected to carry out a good transaction. But a robust system should not crash; it should cancel the transaction, report the problem and continue.

Check

As a general rule: be very suspicious –almost paranoid– about everything, and data in particular. That may sound normal for data from some input form, but that should be extended to all information whether it is an intermediate result or data from a database. Distrust the reliability of (intermediate) results and the 'normal' operations of functions: check data consistency and the return codes of all functions. Here is a strong parallel with the assert statement (see Assert statement); you may not want to stop on an inconsistency, but at least flag it (analyze it later and resolve the issue).

However, if it can make a good guess or adapt the conditions to become acceptable, it may still carry out the transaction (while flagging an error). A simple example is a value which should be in a particular range, say 0..100 (e.g. a percentage). If the input value is below the range, adapt it to the lowest value in the range (i.e. 0), and if it is above the range, adapt it to the highest value in the range (i.e. 100). Maybe not the optimal result, but still acceptable and useful.

Handling erronous data is something different than covering up software errors (bugs); it is a way to make the system robust to (minor) errors and still producing reasonable output. Example: when calculating a square root, the input value should be non-negative (assuming we are not dealing with complex numbers). If the input value is negative, flag an error and return 0 (this is the lowest acceptable non-negative value; inversing the sign of a negative input parameter suggests correcting a software error).

During development such conditions should be investigated to see it they are not software errors (therefore this mechanism may have to be switched-off during test), and for the purpose of testing the 'error correcting mechanism' special test cases should be generated.

Robust systems require some kind of local system management so it can respond 'intelligently' to abnormal conditions. The management function must become aware of failing components (e.g. crashed, stuck or looping processes) and respond adequately to that. It may even stop low-priority processes or block inputs to avoid flooding. But simple measures like a orderly start-up and shut-down of a series of tasks (often required in a specific order) already make a difference.
It is virtually impossible to create a management function which can respond effectively to all kinds of exceptional conditions, however it does not take exceptional effort to counter the most likely vital threats. For example, consider common hardware failures (e.g. broken communication link, crashed harddisk). Manual intervention is a very good option provided that an operator can see what goes on in the system (i.e. requires to develop monitoring capabilities) and there are no severe real-time restrictions.

Proper maintenance is also vital for a robust system (you can't expect a poorly maintained system to be robust — the management function should support maintenance for vital parts as well). Special attention should go to databases; experience has shown that databases get polluted after some time with inconsistent data.

Consistency

Whenever you have a large and/or complex datastructure (such as a database but certainly for a datastructure in a program), create:

a structured dump routine, which shows what is actually in the datastructure (for problem analysis).
a consistency or sanity check routine. Check for:
- structural consistency: all references or keys to other data records are valid (i.e. that the referenced record exists), and that there are no orphan records (records that are no longer referenced);
- data consistency: data values within a record –and potentially related records– have mutually consistent values.

Run such consistency checks regularly (as maintenance), and when a problem in the datastructure is detected (e.g. an Assert-fail), and report the inconsistency (e.g. through a Structured Dump). For very large datastructures one may restrict the regular consistency check (and the structured dump) to limit the amount of resource usage and/or reported problems by limiting some key (e.g. key numbers 1…1000, or keys starting with 'A'), but make sure all records are ultimately checked.

The action to be taken when an inconsistency is detected varies; initially it may require manual intervention to repair the problem. After some experience with such inconsistencies, some automatic repairs can be made reliably. But ultimately it is better to remove the inconsistent data (records) as inconsistent data generates more problems (and threatens system reliability) than missing data.

The experience with such a consistency check is very good: once we added it on a system which had already run for over a year, and it showed a great many problems (the original developers were surprised the system was still running with such data). And most problems were rather easy to repair.

Rerunable steps

The next objective is to never make irreversible changes unless you are completely sure. Try to make transactions in steps such that you can either reverse a step, or redo them with corrected input. It implies that each step must be basic and also implies that if a step is run again –intentionally or not– it shouldn't cause problems. For example not create a directory and fill it with files but create the directory if it doesn't exist, and as the next step refill with fresh files.
A trivial example of reversibility is to not delete anything but place it in a bin or mark it to delete for later (automatic) deletion, or for recovery.

Graceful Degradation

The ultimate point is to design a 'graceful degradation' scheme. It is like a stack of exception routines: if the exception can't be resolved, the next level exception routine should be called.
There is a great difference between critical systems and non-critical systems: in non-critical systems you stop processing and allow human intervention to correct the problems and redo the operation; in critical systems you have to continue (for robust systems an 'assert fail' should not lead to the abortion of processing but only to logging for later investigation. Or to a restart, but that can loop too).
For critical systems, make a distinction between vital processes and less vital processes; less vital processes can be disabled or stopped –at least for some time– without jeopardising the system's survival. It is on the vital processes that one should focus.

To continue processing after an inconsistency has been detected, implies that reasonable assumptions should be made regarding data. Typically, if a value should be above some threshold but it is lower, assume the threshold value. If a value should be within limits, use the limit values for 'out-of-range' data. The results are probably not accurate, but not nonsense. You should log the exception, but continue with an acceptable value for further processing.

Summarising

Check data consistency and function return codes;
Investigate vital processes & data for exceptional conditions;
Try to make the processes 'robust' (by adapting data to in-range values, or fall-back for processes), or decide it makes no sense to continue processing (i.e. use an Assert) and cancel the transaction;
Focus on vital processes in critical systems, and define a gracefull degredation for those;
Do consider common failures (in particular of hardware);
Have some monitoring process(es) to manage the overall system (sanity).

Design for Maintainability

When admitting that a developed system is not perfect, a strategy for corrective actions on the system has to be defined. This is more than planning sustainment engineering; it includes upgrading an operational system with corrected functionality: corrective maintenance ('patching').

From the customer's point of view there is a similar problem for 'maintenance'. The system's environment will change and/or the customer will require adapted functionality: modificative maintenance. It is unlikely that any system remains unadapted for longer than two years. In fact there is seeming contradiction: the more successful a system is, (the more it is used, and the longer it lives,) the more maintenance it needs. So a similiar strategy as for testing has to be applied: how to upgrade the system for new customer requirements. This brings a series of issues:

how to appreciate new requirements (potentially conflicting; e.g. by developing a vision on the system's future);
how to handle change requests (priority, taking development effort into account; e.g. by a change review board and a roadmap);
how to provide a smooth upgrade (e.g. by including conversion & data migration);

Design for Usability

Actually a Requirements issue. Users will apply this system for their own purposes (a user in this context is a user of the system, not necessarily the customer but possibly an employee or a customer of your customer. Consider that your customer is not the system's user). Users will also make mistakes. The system should provide facilities for the users allowing them to test their applications. It typically includes:

user error detection and to-the-point error messages (it wouldn't be the first time that the system is blamed for user errors).
special user test facilities: allow a user to test his application gradually (i.e. set loop tests, application tracing, etc);
monitor external connections;
in particular for remote connections: provide an option to make a transparent connection to a test site.

Error Messaging

During its operational life, errors will occur, either caused by programming errors, incorrect commands, incorrect input data or unforeseen conditions. The problem is that the location of error detection and the location of the cause can be far appart. One should discern:

the cause of the error (and its severity);
the phenomena it produces; and
which of those phenomena is detected (what is flagged as the error/inconsistency).

An error message should be clear on:

what is detected (e.g. a syntax error in a file name, or file does not exist, or can not open file);
the severity of the message (success/informational, warning, error, severe/fatal error);
which functional module/routine has detected the error. A stack trace is also very valuable.

=O=