3 Management Functional Areas
Last update PvD

3.2 FM
Fault Management

Overview

  1. Standard
  2. Alarm surveillance
  3. Fault localization
  4. Fault correction
  5. Testing
  6. Trouble Administration
  7. Alarm Handling

Standard

[M.3400]:  Fault (or maintenance) Management (FM) is a set of functions which enables the detection, isolation and correction of abnormal operations of the telecommunication network and its environment.  It provides facilities for the performance of the maintenance phases from recommendation M.20.

{Though Fault Management is commonly the first thing an operator demands when starting Network Management, it is not the first thing to do.  For proper Fault Management one must know what equipment is in the network, i.e. basic Configuration Management (Inventory).}


Alarm surveillance

The interpretation of event reports or collected data to see whether there are alarm conditions in the network.  Such data are not only 'on' or 'off';  it can be more subtle than that like slowly deteriorating conditions (e.g. more than 'normal' bit errors in transmission).  Note also that alarm conditions may be raised by other management areas, typically Treshold Crossing Alarms (TCA) in Performance Management.  Most of current FM applications do only Alarm surveillance.

There are basically three purposes (potential actions) for Fault Reports:

  1. to take corrective actions towards the cause (i.e. repair:  mainly EML, WorkOrder).
  2. to take countermeasures regarding consequences (work-arounds, rerouting, activate backup, service restoral by recreation, etc;  mainly NML);
  3. for information only (mainly SML:  inform customers);

Note that commonly all three purposes are applicable in parallel, but that they are not independent;  one may consider the goals as three aspects of the same problem.
Note also that the information content will differ with the purpose:  for repair actions (1), one needs to know the failing component;  for counter­measures (2) one usually needs to know the effect of the problem;  and a customer (3) may desire an estimate how long it will take before his service resumes.
However, it is not unlikely that you detect a problem (and are aware of consequences so you can take counter­measures and inform the customer) without knowing yet the exact cause (which makes repair as yet impossible), or the cause is known but repair will take time (e.g. requires an engineer to go to some remote site), or the repair lies outside the jurisdiction of this network management (e.g. a switched service fails due to a transmission problem;  the transmission network management should initiate the repair).  I.e. effective response to the problem depends on the speed of repair (1), nature of the consequences (2, are counter­measures possible) and the status of the service contract (3, whether it still allows some slack).  So it makes sense to take counter­measures (2) and inform customer (3) together and handle repair (1) separately if it can not be done sufficiently fast.

Examples:  when the problem occurs during the night and the customer has only a 'working hours' contract, all actions can be delayed.  When you can activate counter­measures for this problem faster than the notification interval, there is no obligation to inform the customer (and repair can be done in due time).  When you can correct the problem sufficiently fast (e.g. software restart/­reload), there is no need to inform the customer or to take countermeasures.  If the customer has a 'cheap' contract, you may skip counter­measures, do the repair during normal working hours, but you may have to inform the customer.  If it concerns a multi-domain (international) service, and the network in a foreign domain fails, you may have to inform the customer, cooperate with other providers to establish a work-around, but no repair actions.

This results effectively to 2 levels of response:

Intelligent handling of fault reports is extremely important to avoid overloading the OSS and/or operators.  It is however quite complex.  For counter­measures and repair one needs the root cause (and only that);  to inform customers one needs the list of all impacted users.
One needs different fault detection mechanisms for these purposes or it must be possible to map a fault to impacted services/­users (e.g. using CM facilities).  The list of impacted services/­users should not be generated unsolicited (i.e. as a series of Event Reports), but presented on request.  Otherwise Network Management will be flooded by large numbers of Event Reports when a heavy used service fails (e.g. a fiber is cut).


Fault localization

First, one needs to discern cause and effect:

  1. the cause itself (i.e. the problem {failing component, error, bug};  strictly seen the 'root cause' can be something else, like a typing mistake in a software program, or a glitch in power supply);  and
  2. the consequential effect as displayed through various phenomena.

Commonly one of the phenomena is observed, and Fault Localisation should determine the cause.  In many cases this is not simple at all.  'Fault Correlation' might help:  aggregate multiple fault reports in a particular area to conclude that they are all caused by a single failure;  only this 'cause' should be reported.  Otherwise one needs diagnostic tests.
A typical example is the failure of a transmission cable;  a cable will never report a fault, but the receiving nodes will report a 'loss of signal' alarm.  The OSS may correlate all these alarms and conclude that a cable failed, and maybe even what section.


Fault correction

Fault correction comprises:

Preferably the service remains operational;  it may be degraded, but ultimately after repair the basic service should be restored to full operational level (automatically).


Testing

In an operational environment testing typically includes:

Confidence test:
a quick check whether a function or component seems to operates correctly (it doesn't prove full functionality;  the function may remain operational);
Thorough test:
a full function test (exhaustive/­in-depth;­ the function may be temporarily disabled); and
Diagnostic test:
typically used by operational/­maintenance people to locate a malfunction, identify the failing component (i.e. board or cable) or the condition causing it to fail.

Such tests can be activated by:

Trigger (automatic)
on an event such as start-up, confidence test, or a failure report (e.g. by thorough test);
Schedule (commonly regular:  routine test)
for preventive maintenance (confidence and/or thorough);  or
Manual
typically a diagnostic by maintenance people for corrective maintenance.

Trouble Administration

Informing customers about problems with the service, or receiving complaints from customer.  It is elaborated in System Management Functions X.790 Trouble Management Function.

For local problem resolution and the relationship with Trouble Administration and Testing, see Fault Cycle.

See also (System Management Function) Alarm Reporting X.733.


Alarm Handling

For some specific areas and types of alarms, alarms can be handled fully automatic;  however in practice alarms are presented to human operators so they can take the appropriate action to resolve the problem.  The software to support that are called Alarm Handlers, and mostly they provide additional functionality to ease the operator's task, like

Commonly, an alarm has to be 'accepted' by an operator (i.e. acknowledge that he has noticed the alarm, and that he will take responsibility to resolve the problem).  Alarms which are not accepted within an interval should cause an alarm by itself at a supervisory position.

Some handlers allow you to attach a script to a specific alarm (object type and severity) which is than run automatically when the alarm occurs.

A feature that is very desirable but is nearly almost absent, is alarm correlation.  For example, instead of an alarm from node A that it lost the signal from node B, and complementary alarm from node B that it lost the signal from node A, the correlated alarm that the connection A-B failed should be presented.  It reduces the number of alarms, and provides better indication what has failed.  Sadly, this feature is hardly found.


Next section: CM
Up to Contents
Up to Index

=O=