Fault-tolerant system


Fault-tolerant system
This article contains specific implementations of fault tolerant systems. For general theory, see fault-tolerant design.

Fault-tolerance or graceful degradation is the property that enables a system (often computer-based) to continue operating properly in the event of the failure of (or one or more faults within) some of its components. A newer approach is progressive enhancement. If its operating quality decreases at all, the decrease is proportional to the severity of the failure, as compared to a naïvely-designed system in which even a small failure can cause total breakdown. Fault-tolerance is particularly sought-after in high-availability or life-critical systems.

Fault-tolerance is not just a property of individual machines; it may also characterise the rules by which they interact. For example, the Transmission Control Protocol (TCP) is designed to allow reliable two-way communication in a packet-switched network, even in the presence of communications links which are imperfect or overloaded. It does this by requiring the endpoints of the communication to expect packet loss, duplication, reordering and corruption, so that these conditions do not damage data integrity, and only reduce throughput by a proportional amount.

An example of graceful degradation by design in an image with transparency. The top two images are each the result of viewing the composite image in a viewer that recognises transparency. The bottom two images are the result in a viewer with no support for transparency. Because the transparency mask (centre bottom) is discarded, only the overlay (centre top) remains; the image on the left has been designed to degrade gracefully, hence is still meaningful without its transparency information.

Data formats may also be designed to degrade gracefully. HTML for example, is designed to be forward compatible, allowing new HTML entities to be ignored by Web browsers which do not understand them without causing the document to be unusable.

Recovery from errors in fault-tolerant systems can be characterised as either roll-forward or roll-back. When the system detects that it has made an error, roll-forward recovery takes the system state at that time and corrects it, to be able to move forward. Roll-back recovery reverts the system state back to some earlier, correct version, for example using checkpointing, and moves forward from there. Roll-back recovery requires that the operations between the checkpoint and the detected erroneous state can be made idempotent. Some systems make use of both roll-forward and roll-back recovery for different errors or different parts of one error.

Within the scope of an individual system, fault-tolerance can be achieved by anticipating exceptional conditions and building the system to cope with them, and, in general, aiming for self-stabilization so that the system converges towards an error-free state. However, if the consequences of a system failure are catastrophic, or the cost of making it sufficiently reliable is very high, a better solution may be to use some form of duplication. In any case, if the consequence of a system failure is so catastrophic, the system must be able to use reversion to fall back to a safe mode. This is similar to roll-back recovery but can be a human action if humans are present in the loop.

Contents

Fault tolerance requirements

The basic characteristics of fault tolerance require:

  1. No single point of failure
  2. Fault isolation to the failing component
  3. Fault containment to prevent propagation of the failure
  4. Availability of reversion modes

In addition, fault tolerant systems are characterized in terms of both planned service outages and unplanned service outages. These are usually measured at the application level and not just at a hardware level. The figure of merit is called availability and is expressed as a percentage. For example, a five nines system would statistically provide 99.999% availability.

Fault-tolerant systems are typically based on the concept of redundancy.

Fault-tolerance by replication

Spare components addresses the first fundamental characteristic of fault-tolerance in three ways:

  • Replication: Providing multiple identical instances of the same system or subsystem, directing tasks or requests to all of them in parallel, and choosing the correct result on the basis of a quorum;
  • Redundancy: Providing multiple identical instances of the same system and switching to one of the remaining instances in case of a failure (failover);
  • Diversity: Providing multiple different implementations of the same specification, and using them like replicated systems to cope with errors in a specific implementation.

All implementations of RAID, redundant array of independent disks, except RAID 0 are examples of a fault-tolerant storage device that uses data redundancy.

A lockstep fault-tolerant machine uses replicated elements operating in parallel. At any time, all the replications of each element should be in the same state. The same inputs are provided to each replication, and the same outputs are expected. The outputs of the replications are compared using a voting circuit. A machine with two replications of each element is termed Dual Modular Redundant (DMR). The voting circuit can then only detect a mismatch and recovery relies on other methods. A machine with three replications of each element is termed Triple Modular Redundancy (TMR). The voting circuit can determine which replication is in error when a two-to-one vote is observed. In this case, the voting circuit canghghj output the correct result, and discard the erroneous version. After this, the internal state of the erroneous replication is assumed to be different from that of the other two, and the voting circuit can switch to a DMR mode. This model can be applied to any larger number of replications.

Lockstep fault tolerant machines are most easily made fully synchronous, with each gate of each replication making the same state transition on the same edge of the clock, and the clocks to the replications being exactly in phase. However, it is possible to build lockstep systems without this requirement.

Bringing the replications into synchrony requires making their internal stored states the same. They can be started from a fixed initial state, such as the reset state. Alternatively, the internal state of one replica can be copied to another replica.

One variant of DMR is pair-and-spare. Two replicated elements operate in lockstep as a pair, with a voting circuit that detects any mismatch between their operations and outputs a signal indicating that there is an error. Another pair operates exactly the same way. A final circuit selects the output of the pair that does not proclaim that it is in error. Pair-and-spare requires four replicas rather than the three of TMR, but has been used commercially.

No single point of repair

If a system experiences a failure, it must continue to operate without interruption during the repair process.

Fault isolation to the failing component

When a failure occurs, the system must be able to isolate the failure to the offending component. This requires the addition of dedicated failure detection mechanisms that exist only for the purpose of fault isolation.

Recovery from a fault condition requires classifying the fault or failing component. The National Institute of Standards and Technology (NIST) categorizes faults based on Locality, Cause, Duration and Effect.

Fault containment

Some failure mechanisms can cause a system to fail by propagating the failure to the rest of the system. An example of this kind of failure is the "Rogue transmitter" which can swamp legitimate communication in a system and cause overall system failure. Mechanisms that isolate a rogue transmitter or failing component to protect the system are required.

See also

Bibliography

External links


Wikimedia Foundation. 2010.

Look at other dictionaries:

  • fault-tolerant system — gedimams atspari sistema statusas T sritis automatika atitikmenys: angl. fail safe system; failure tolerant system; fault tolerant system vok. ausfallunempfindliches System rus. нечувствительная к отказам система, f pranc. système tolérant aux… …   Automatikos terminų žodynas

  • Fault-tolerant computer systems — are systems designed around the concepts of fault tolerance. In essence, they have to be able to keep working to a level of satisfaction in the presence of faults. Types of fault tolerance Most fault tolerant computer systems are designed to be… …   Wikipedia

  • Fault-tolerant design — In engineering, Fault tolerant design, also known as fail safe design, is a design that enables a system to continue operation, possibly at a reduced level (also known as graceful degradation), rather than failing completely, when some part of… …   Wikipedia

  • fault tolerant — UK US (also fault tolerant) adjective ► relating to a computer system that continues working even when there is something wrong with the hardware or software: » a fault tolerant processing system fault tolerance noun [U] …   Financial and business terms

  • fault-tolerant — adjective Date: 1975 relating to or being a computer or program with a self contained backup system that allows continued operation when major components fail • fault tolerance noun …   New Collegiate Dictionary

  • failure-tolerant system — gedimams atspari sistema statusas T sritis automatika atitikmenys: angl. fail safe system; failure tolerant system; fault tolerant system vok. ausfallunempfindliches System rus. нечувствительная к отказам система, f pranc. système tolérant aux… …   Automatikos terminų žodynas

  • Configurable Fault Tolerant Processor — The Configurable Fault Tolerant Processor (CFTP), developed by the Space Systems Academic Group at the Naval Postgraduate School, is an experimental payload on board the United States Naval Academy s (USNA) MidSTAR 1 satellite. Midstar 1 was… …   Wikipedia

  • System Fault Tolerance — (SFT) is a fault tolerant system built into Netware operating systems. It is a disk mirroring system similar to RAID 1. SFT level III is a server duplexing system where if a server fails a mirrored server is ready to take its place …   Wikipedia

  • fault tolerance — fault tolerant ˈfault ˌtolerant adjective COMPUTING fault tolerant computer/​machine a computer that continues working even if it has a fault or when there is a fault in a program fault tolerance noun [uncountable] * * * fault tolerant UK US… …   Financial and business terms

  • System Prevalence — is a simple software architectural pattern that combines system images (snapshots) and transaction journaling to provide speed, performance scalability, transparent persistence and transparent live mirroring of computer system state. In a… …   Wikipedia