Fault-tolerant design

Fault-tolerant design

In engineering, Fault-tolerant design, also known as fail-safe design, is a design that enables a system to continue operation, possibly at a reduced level (also known as graceful degradation), rather than failing completely, when some part of the system fails. The term is most commonly used to describe computer-based systems designed to continue more or less fully operational with, perhaps, a reduction in throughput or an increase in response time in the event of some partial failure. That is, the system as a whole is not stopped due to problems either in the hardware or the software. An example in another field is a motor vehicle designed so it will continue to be drivable if one of the tires is punctured. A structure is able to retain its integrity in the presence of damage due to causes such as fatigue, corrosion, manufacturing flaws, or impact.


If each component, in turn, can continue to function when one of its subcomponents fails, this will allow the total system to continue to operate, as well. Using a passenger vehicle as an example, a car can have "run-flat" tires, which each contain a solid rubber core, allowing them to be used even if a tire is punctured. The punctured "run-flat" tire may be used for a limited time at a reduced speed.


This means having backup components which automatically "kick in" should one component fail. For example, large cargo trucks can lose a tire without any major consequences. They have so many tires that no one tire is critical (with the exception of the front tires, which are used to steer).

When to use

Providing fault-tolerant design for every component is normally not an option. In such cases the following criteria may be used to determine which components should be fault-tolerant:

* How critical is the component? In a car, the radio is not critical, so this component has less need for fault-tolerance.

* How likely is the component to fail? Some components, like the drive shaft in a car, are not likely to fail, so no fault-tolerance is needed.

* How expensive is it to make the component fault-tolerant? Requiring a redundant car engine, for example, would likely be too expensive both economically and in terms of weight and space, to be considered.

An example of a component that passes all the tests is a car's occupant restraint system. While we do not normally think of the "primary" occupant restraint system, it is gravity. If the vehicle rolls over or undergoes severe g-forces, then this primary method of occupant restraint may fail. Restraining the occupants during such an accident is absolutely critical to safety, so we pass the first test. Accidents causing occupant ejection were quite common before seat belts, so we pass the second test. The cost of a redundant restraint method like seat belts is quite low, both economically and in terms or weight and space, so we pass the third test. Therefore, adding seat belts to all vehicles is an excellent idea. Other "supplemental restraint systems", such as airbags, are more expensive and so pass that test by a smaller margin. This is why inexpensive vehicles typically have fewer airbags than expensive vehicles.


Hardware fault-tolerance sometimes requires that broken parts can be swapped out with new ones while the system is still operational (in computing known as "hot swapping"). Such a system implemented with a single backup is known as single point tolerant, and represents the vast majority of fault-tolerant systems. In such systems the mean time between failures should be long enough for the operators to have time to fix the broken devices before the backup also fails. It helps if the time between failures is as long as possible, but this is not specifically required in a fault-tolerant system.

Fault-tolerance is notably successful in computer applications. Tandem Computers built their entire business on such machines, which used single point tolerance to create their NonStop systems with uptimes measured in years.

Fail-safe architectures may encompass also the computer software, for example by process replication (computer science).


Fault-tolerant design's advantages are obvious, while many of its disadvantages are not:

* Interference with fault detection in the same component. To continue the above passenger vehicle example, it may not be obvious to the driver when a tire has been punctured, with either of the fault-tolerant systems. This is usually handled with a separate "automated fault detection system". In the case of the tire, an air pressure monitor detects the loss of pressure and notifies the driver. The alternative is a "manual fault detection system", such as manually inspecting all tires at each stop.

* Interference with fault detection in another component. Another variation of this problem is when fault-tolerance in one component prevents fault detection in a different component. For example, if component B performs some operation based on the output from component A, then fault-tolerance in B can hide a problem with A. If component B is later changed (to a less fault-tolerant design) the system may fail suddenly, making it appear that the new component B is the problem. Only after the system has been carefully scrutinized will it become clear that the root problem is actually with component A.

* Reduction of priority of fault correction. Even if the operator is aware of the fault, having a fault-tolerant system is likely to reduce the importance of repairing the fault. If the faults are not corrected, this will eventually lead to system failure, when the fault-tolerant component fails completely or when all redundant components have also failed.

* Test difficulty. For certain critical fault-tolerant systems, such as a nuclear reactor, there is no easy way to verify that the backup components are functional. The most infamous example of this is Chernobyl, where operators tested the emergency backup cooling by disabling primary and secondary cooling. The backup failed, resulting in a core meltdown and massive release of radiation.

* Cost. Both fault-tolerant components and redundant components tend to increase cost. This can be a purely economic cost or can include other measures, such as weight. Manned spaceships, for example, have so many redundant and fault-tolerant components that their weight is increased dramatically over unmanned systems, which don't require the same level of safety.

* Inferior components. A fault-tolerant design may allow for the use of inferior components, which would have otherwise made the system inoperable. While this practice has the potential to mitigate the cost increase, use of multiple inferior components may lower the reliability of the system to a level equal to, or even worse than, a comparable non-fault-tolerant system.

Related terms

There is a difference between fault-tolerance and systems that rarely have problems. For instance, the Western Electric crossbar systems had failure rates of two hours per forty years, and therefore were highly "fault resistant". But when a fault did occur they still stopped operating completely, and therefore were not "fault-tolerant".

ee also

*Fault-tolerant system
*Safe-life design
*Capillary routing
*Separation of protection and security

External links

* [http://portal.acm.org/citation.cfm?id=779417 Implementation and evaluation of failsafe computer-controlled systems]
* [http://www.cs.rutgers.edu/~iftode/seminar04_papers.htm Seminar on Self-Healing Systems]
* Interview with Robert Hanmer about his book "Patterns for Fault Tolerant Software" ( [http://se-radio.net/podcast/2007-11/episode-77-fault-tolerance-bob-hanmer-pt-1 Part One] , [http://se-radio.net/podcast/2007-11/episode-78-fault-tolerance-bob-hanmer-pt-2 Part Two] ) (Podcast)

Wikimedia Foundation. 2010.

Look at other dictionaries:

  • fault-tolerant design — gedimui atspari konstrukcija statusas T sritis radioelektronika atitikmenys: angl. damage tolerant design; fault tolerant design vok. ausfallsichere Konstruktion, f; fehlertolerante Konstruktion, f; fehlertolerantes Design, n rus.… …   Radioelektronikos terminų žodynas

  • Fault-tolerant computer systems — are systems designed around the concepts of fault tolerance. In essence, they have to be able to keep working to a level of satisfaction in the presence of faults. Types of fault tolerance Most fault tolerant computer systems are designed to be… …   Wikipedia

  • Fault-tolerant system — This article contains specific implementations of fault tolerant systems. For general theory, see fault tolerant design. Fault tolerance or graceful degradation is the property that enables a system (often computer based) to continue operating… …   Wikipedia

  • Error-tolerant design — An error tolerant design is one that does not unduly penalize user errors. It is the human equivalent of fault tolerant design that allows equipment to continue functioning in the presence of hardware faults, such as a limp in mode for an… …   Wikipedia

  • damage-tolerant design — gedimui atspari konstrukcija statusas T sritis radioelektronika atitikmenys: angl. damage tolerant design; fault tolerant design vok. ausfallsichere Konstruktion, f; fehlertolerante Konstruktion, f; fehlertolerantes Design, n rus.… …   Radioelektronikos terminų žodynas

  • Configurable Fault Tolerant Processor — The Configurable Fault Tolerant Processor (CFTP), developed by the Space Systems Academic Group at the Naval Postgraduate School, is an experimental payload on board the United States Naval Academy s (USNA) MidSTAR 1 satellite. Midstar 1 was… …   Wikipedia

  • Design management — is the business side of design. Design managers need to speak the language of the business and the language of design …   Wikipedia

  • Design — For the 1970s music group, see Design (UK band). All Saints Chapel in the Cathedral Basilica of St. Louis by Louis Comfort Tiffany. The building structure and decorations are both examples of design …   Wikipedia

  • fault tolerance —    A design method that ensures continued system operation in the event of individual failures by providing redundant elements. At the component level, the design includes redundant chips and circuits and the capability to bypass faults… …   Dictionary of networking

  • Design for manufacturability (IC) — For a general explanation outside the integrated circuit topic, see Design for manufacturability (disambiguation). Achieving high yielding designs in the state of the art, VLSI technology has become an extremely challenging task due to the… …   Wikipedia