Fault management


Fault management

In network management, fault management is the set of functions that detect, isolate, and correct malfunctions in a telecommunications network, compensate for environmental changes, and include maintaining and examining error logs, accepting and acting on error detection notifications, tracing and identifying faults, carrying out sequences of diagnostics tests, correcting faults, reporting error conditions, and localizing and tracing faults by examining and manipulating database information.

When a fault or event occurs, a network component will often send a notification to the network operator using a protocol such as SNMP. An alarm is a persistent indication of a fault that clears only when the triggering condition has been resolved. A current list of problems occurring on the network component is often kept in the form of an active alarm list such as is defined in RFC 3877,the Alarm MIB. A list of cleared faults is also maintained by most network management systems.

Fault management systems may use complex filtering systems to assign alarms to severity levels. These can range in severity from debug to emergency, as in the syslog protocol. [RFC 3164] Alternatively, they could use the ITU X.733 Alarm Reporting Function's perceived severity field. This takes on values of cleared, indeterminate, critical, major, minor or warning. Note that the latest version of the syslog protocol draft under development within the IETF includes a mapping between these two different sets of severities. It is considered good practice to send a notification not only when a problem has occurred, but also when it has been resolved. The latter notification would have a severity of clear.

A fault management console allows a network administrator or system operator to monitor events from multiple systems and perform actions based on this information. Ideally, a fault management system should be able to correctly identify events and automatically take action, either launching a program or script to take corrective action, or activating notification software that allows a human to take proper intervention (i.e. send e-mail or SMS text to a mobile phone). Some notification systems also have escalation rules that will notify a chain of individuals based on availability and severity of alarm.

There are two primary ways to perform fault management - these are active and passive. Passive fault management is done by collecting alarms from devices (normally via SNMP) when something happens in the devices. In this mode, the fault management system only knows if a device it is monitoring is smart enough to throw and error and report it to the management tool. However, if the device being monitored fails completely or locks up, it won't throw an alarm and the problem will not be detected. Active fault management addresses this issue by actively monitoring devices via tools such as PING to determine if the device is active and responding. If the device stops responding, active monitoring will throw an alarm showing the device as unavailable and allows for the proactive correction of the problem.

Notes

References


*Federal Standard 1037C
*MIL-STD-188

Wikimedia Foundation. 2010.

Look at other dictionaries:

  • Fault management — Dans le cadre de la supervision de réseaux, le Fault management est l ensemble des fonctions qui permettent de détecter, isoler et corriger les erreurs dans un réseau de télécommunication et de réagir aux changements environnementaux. Cela inclut …   Wikipédia en Français

  • fault management —    One of the five basic types of network management defined by the International Organization for Standardization (ISO) and CCITT. Fault management is used in detecting, isolating, and correcting faults on the network …   Dictionary of networking

  • fault management —    Detects, isolates, and corrects network faults. One of five categories of network management defined by the ISO …   IT glossary of terms, acronyms and abbreviations

  • Fault tree analysis — (FTA) is a failure analysis in which an undesired state of a system is analyzed using boolean logic to combine a series of lower level events. This analysis method is mainly used in the field of safety engineering to quantitatively determine the… …   Wikipedia

  • Avaya Unified Communications Management — Developer(s) Nortel (now Avaya) Operating system MS Windows, and Linux Type Unified Communications Configuration and Management Avaya Unified Communications Management in computer networking is the name of a collection o …   Wikipedia

  • Network management model — The ISO under the direction of the OSI group has created a network management model as the primary means for understanding the major functions of network management systems. The model in question is interchangeably called either the OSI network… …   Wikipedia

  • Operations, administration and management — or operations, administration and maintenance (OA M or OAM) is a general term used to describe the processes, activities, tools, standards, etc involved with operating, administering, managing and maintaining any system. It is more commonly used… …   Wikipedia

  • Systems management — refers to enterprise wide administration of distributed systems including (and commonly in practice) computer systems.[citation needed] Systems management is strongly influenced by network management initiatives in telecommunications. Centralized …   Wikipedia

  • Network management — refers to the activities, methods, procedures, and tools that pertain to the operation, administration, maintenance, and provisioning of networked systems.[1] Operation deals with keeping the network (and the services that the network provides)… …   Wikipedia

  • Avaya Proactive Voice Quality Management — Avaya PVQM Proactive real time voice quality management continuously and passively monitors the user voice experience without user knowledge, and conducts real time problem resolution while calls are on going without user interference. Avaya… …   Wikipedia


Share the article and excerpts

Direct link
Do a right-click on the link above
and select “Copy Link”

We are using cookies for the best presentation of our site. Continuing to use this site, you agree with this.