Last update PvD

Redundant Systems

Basic Characteristics

Redundant systems are supposed to keep running even when a major failure occurs by having extra components in the system, that is components which would otherwise be superfluous (functionally redundant) except for Fault Tolerancy.  There are several ways to achieve that, providing fault tolerancy in varying degrees.  Major characteristics of redundant configurations are:


Replacement Strategies

Methods to replace a failing unit or function are:

Note that


Coupling

The main ways to couple functionally identical units –the processing relationship– are:

You also have to take the complexity of (management) software into account:  by redundancy the hardware may have become more reliable, but –due to the complexity– software may have become less reliable (it will take longer to find all bugs before software reliability gets better than hardware reliability).  The (software) complexity increases with the coupling:  development/­acquisition costs will be higher for tightly coupled components than for loosely coupled components.

From the point of view of a computer, there are only 3 important numbers:  0 (no other system), 1 (a single other system, don't have to think which one), and n (there are multiple other systems).  From the architectural point of view this means 1 unit, 2 units, or n units.


Common Configurations

Popular configurations with respect to redundancy are:

Note that one can apply a mix of above configurations.  It is also possible to apply distinct replacement strategies at various levels.  Typically, in a load-sharing strategy, one can have an additional (hot) spare (dimension 2n+1), and even a (cold) spare for various other units (dimension n+1).  However, limit yourself to one or two redundancy mechanisms to restrict software complexity.
Example:  the space shuttle contains 5 identical computers:  3 have identical software and are operated in a majority vote configuration;  the fourth computer runs distinct software and checks whether the answers from the majority vote are within reasonable bounds (software fall-back).  The fifth computer is a hardware spare (cold standby).

Also note that master/slave or front-end/back-end configurations are not redundant configurations (on the contrary).


Overview

Replacement Coupling
UncoupledLooselyTightly
NoneCold stand-by
Hard swapCold stand-by
Stand-byCold stand-byWarm stand-by, BuddiesHot stand-by, Buddies, Majority voting
 RedistributionCold stand-by, Load sharingWarm stand-by, Buddies, Load sharingHot stand-by, Buddies
DegradeFall-backFall-back

Performance

The cost/performance-ratio for the various configurations can only be indicated in a relative way as the number of units and the costs per unit are unknown.  So it is more or less an indicator for efficiency:  effective processing power of the configuration expressed in the power of a single unit, divided by (the costs for) the total number of units in the configuration.  And the costs for other modules (e.g. management) is ignored, and so is software development (which may well be the major cost factor).
Configuration EfficiencyRedundancy
Fall-back1 / 1minor
Cold standbyN / (N+1)reasonable
Hot standbyN / 2Ngood
 Majority votingN / 3Nextremely good
Load sharingN / (N+1)very good
Buddies2N / 2Nvery good

The performance efficiency is not simply inversely proportional to the redundancy.


Load Distribution & Context

The appropriate configuration depends heavily on the characteristics of the service the system is supposed to deliver (and of course the required reliability).  When a service request is context free (i.e. any unit may handle the request), load can be distributed straightforward (e.g. n+1 cluster configuation).  Requests are preferably distributed in a cyclic way;  when re-issuing a previously failed request, it will be automatically be assigned to some other unit.  The failing unit has to be flagged for test/repair.

However, when there is a context (e.g. relevant data not available everywhere), the picture is more complex.  In the extreme case that the context is divided into n subcontexts (e.g. transactions on a partitioned database), a request can only be handled by a single unit (1 out of n).  When immediate backup must be available (hot standby), this leads to a 2n configuration.
When requests have a context, you may split the context into many small subcontexts and have multiple subcontexts per unit, and each subcontext in multiple units.  This allows a request to be handled by m units (say 2 or 3) out of n.  However, transaction assignment will be elaborate and the distribution of the subcontexts to ensure good dynamic load balance is not simple.
See Centralisation versus Distribution below.


System Management

A redundant system does requires some kind of management ('system defense', Fault Management) over all components.  The general strategy to survive is:


Pitfalls

The main issue for a redundant system is a so-called Single Point of Failure (SPoF):  some component or service which is not redundant, so when that fails the whole system fails.  Often it is a minor component or service which has been overlooked as being critical.  In particular low level infrastructure (utilities:  power, cooling, data bus, …) are prone to such mistakes.  Be sure to investigate the redundancy of all services/­facilities rquired for the system (or accept that risk if it is too expensive to solve).
In organisations a single specialist can be the vital vulnerability (he may quit, get ill, …).

If you are relying on physical separation between two parts of a redundant system to avoid common causes for failure, be sure that there is a considerable geographic distance between the two sites.  Otherwise both sites share common risks like power fails, flooding, storms, riots, etc.  And don't think that you won't get any flooding on a top floor;  a leaky roof or bursted water pipe is sufficient.  See also Risk Management.


Considerations

Redundancy doesn't come cheap;  it requires extra hardware and much more complex software.  So if you don't really need it, avoid it.

Therefore the first question considers the availability requirements for the system in the client's application:  what are the consequences for failing ?  Usually there are only strict availability requirements for some essential functions, so redundancy for a small number of vital components.  Is redundancy useful in this system, or are there more vulnerable common parts outside the system (typically power, environmental) ?  It is extremely difficult to avoid all single points of failure !  There is a lot of reliability to be gained by other methods than replication.

The next question is how to achieve sufficient availability in your design.  What is the estimated availability of a non-redundant solution (system availability should be part of any design and carefully controlled;  careless design modifications may have significant impact) ?
Perform a simple system availability calculation.  Carefully assess component failure rates (estimates), failure interdependency and calculate the availability.  Apply limits to (sub)system availability.  What if the component with poorest availability figures is significantly improved.
Note that

Are redundant units sufficiently separated (physical/geographical diversity of power, cables, equipment) ?  Duplicating a system at a single site won't provide protection for flooding or burning down both copies.

Software which has not been in use for at least 2 years has a worse error rate than hardware, so do not introduce hardware redundancy at the expense of software complexity if you are in a hurry.  The trade-off is loss of business risk versus aquisition & operational costs (hardware, software development).  Assuming that the system is well designed and built, robustness against a single point of failure will substantially improve reliability.

Why redundancy fails (likely order):


Centralisation versus Distribution

Redundancy is implicitly also a distribution issue (there is a striking parallel with Centralised versus Distributed organisations).  This section compares distributed systems to a centralized one on some general characteristics/aspects:
AspectCentralDistributed
Acquisition costs1normalmore expensive
Acquisition timenormallonger due to complexity
Operational costs2normal(slightly) more expensive
Performance3averageusually less
Utilisation4normalless
Availability5averagevery good
Scalabilitylimitedmuch better
Notes:

  1. More hardware required, but individual components (development or buy-in) should be cheaper.  Software –in particular system management– will be more complex and therefore likely more expensive and initially error-prone.
  2. Operations & Maintenance is probably more expensive (more hardware & potentially geographical dispersed, however due to redundancy maintenance/­replacement can be delayed).
    Data management may be more complex (concurrent updates, consistency).
  3. Price/performance ratio is in general less favourable for a distributed solution.  Grosh's law states A performance increase factor n can be achieved at √n costs.  Which is true for most systems in the performance range except for high-end and low-end systems (e.g. based on high-volume & cheap microprocessors).
    Overhead in a distributed solution may seem more than in a centralised system, but that is theoretical;  in practice it is often the reverse (in particular with organisations).
  4. Utilization will be less in a distributed solution:  idle resources (e.g. processing power) can not be redistributed.
    In a centralised solution, specialisation (of equipment or functions) may pay off;  in a distributed solution these would be underutilised (i.e. not cost-effective).
  5. Overall system availability in a distributed system should be much better (in a proper design), but equiptment failure rate will increase (more components).  Initial software reliability will be poor.

Note that above list is in general true, but your specific case may differ on some points.

Note that we talk about 'replication':  multiple identical units.  'Duplication' is the common case, but replication with a high number of small units (i.e. the load sharing cluster) may be more effective.

Note that there are perfectly good reasons for (physically) distributing a system apart from redundancy;  usually it is (total) costs.  Example:  a city does not have a single huge telephone exchange, but multiple medium-sized exchanges:  it provides reduction of subscriber line length (which presents major costs).  Scalability can be a reason as well.

See also Central versus Distributed organisations/architectures.

A good alternative to a redundant system is often a 'robust' system:  a system which is can survive adverse conditions but (essentially) not redundant.  Use a reliable platform and spend effort to make more robust software (see Design for Survivability).


=O=