- Fault-tolerant computer systems
Fault-tolerant computer systems are systems designed around the concepts of
fault tolerance . In essence, they have to be able to keep working to a level of satisfaction in the presence of faults.Types of fault tolerance
Most fault-tolerant
computer systems are designed to be able to handle several possible failures, including hardware-related faults such ashard disk failures, input oroutput device failures, or other temporary or permanent failures;software bug s and errors; interface errors between the hardware and software, including driver failures; operator errors, such as erroneous keystrokes, bad command sequences, or installing unexpected software; and physical damage or other flaws introduced to the system from an outside source [Fault-tolerant computer system design book contents. Dhiraj K. Pradhan, Pages: 135 - 138 1996 ISBN:0-13-057887-8] .Hardware fault-tolerance is the most common application of these systems, designed to prevent failures due to hardware components. Typically, components have multiple backups and are separated into smaller "segments" that act to contain a fault, and extra redundancy is built into all physical connectors, power supplies, fans, etc. [Formal Techniques in Real-Time and Fault-Tolerant Systems: Second International Symposium, Nijmegen, the Netherlands, January 8-10, 1992, Proceedings
By Jan Vytopil
Contributor Jan Vytopil, Published by Springer, 1991, ISBN 3540550925, 9783540550921] . There are special software and instrumentation packages designed to detect failures, such as fault masking, which is a way to ignore faults by seamlessly preparing a backup component to execute something as soon as the instruction is sent, using a sort of voting protocol where if the main and backups don't give the same results, the flawed output is ignored.
Software fault-tolerance is based more around nullifying programming errors using real-time redundancy, or static "emergency" subprograms to fill in for programs that crash. There are many ways to conduct such fault-regulation, depending on the application and the available hardware. [Fault-tolerant computer system design book contents. Dhiraj K. Pradhan, Pages: 221 - 235 1996 ISBN:0-13-057887-8] .
History
The first known fault-tolerant computer was SAPO, built in 1951 in
Czechoslovakia byAntonin Svoboda [Computer structures: principles and examples, pg 155By Daniel P. Siewiorek, C. Gordon Bell, Allen NewellPublished by McGraw-Hill, 1982ISBN 0070573026, 9780070573024] . Its basic design was magnetic drums connected via relays, with a voting method of memory error detection. Several other machines were developed along this line, mostly for military use. Eventually, they separated into three distinct categories: machines that would last a long time without any maintenance, such as the ones used onNASA space probe s andsatellites ; computers that were very dependable but required constant monitoring, such as those used to monitor and controlnuclear power plants orsupercollider experiments; and finally, computers with a high amount of runtime which would be under heavy use, such as many of the supercomputers used byinsurance companies for theirprobability monitoring.Most of the development in the so called LLNM (Long Life, No Maintenance) computing was done by NASA during the 1960's [Computer structures: principles and examples, pg 189By Daniel P. Siewiorek, C. Gordon Bell, Allen NewellPublished by McGraw-Hill, 1982ISBN 0070573026, 9780070573024] , in preparation for
Project Apollo and other research aspects. NASA's first machine went into aspace observatory , and their second attempt, the JSTAR computer, was used inVoyager . This computer had a backup of memory arrays to use memory recovery methods and thus it was called the JPL Self-Testing-And-Repairing computer. It could detect its own errors and fix them or bring up redundant modules as needed. The computer is still working today.Hyper-dependable computers were pioneered mostly by
aircraft manufacturers, [Computer structures: principles and examples, pg 210By Daniel P. Siewiorek, C. Gordon Bell, Allen NewellPublished by McGraw-Hill, 1982ISBN 0070573026, 9780070573024]nuclear power companies, and the railroad industry in the USA. These needed computers with massive amounts of uptime that would fail gracefully enough with a fault to allow continued operation, while relying on the fact that the computer output would be constantly monitored by humans to detect faults. Again, IBM developed the first computer of this kind for NASA for guidance ofSaturn V rockets, but later onBNSF ,Unisys , andGeneral Electric built their own [Computer structures: principles and examples, pg 223By Daniel P. Siewiorek, C. Gordon Bell, Allen NewellPublished by McGraw-Hill, 1982ISBN 0070573026, 9780070573024] .In general, the early efforts at fault-tolerant designs were focused mainly on internal diagnosis, where a fault would indicate something was failing and a worker could replace it. SAPO, for instance, had a method by which faulty memory drums would emit a noise before failure [Fault tolerant computing in computer designNeilforoshan, M.RJournal of Computing Sciences in Colleges archiveVolume 18 , Issue 4 (April 2003) Pages: 213 - 220 ISSN:1937-4771 ] . Later efforts showed that, to be fully effective, the system had to be self-repairing and diagnosing -- isolating a fault and then implementing a redundant backup while alerting a need for repair. This is known as N-model redundancy, where faults cause automatic fail safes and a warning to the operator, and it is still the most common form of level one fault-tolerant design in use today.
Voting was another initial method, as discussed above, with multiple redundant backups operating constantly and checking each other's results, with the outcome that if, for example, four components reported an answer of 5 and one component reported an answer of 6, the other four would "vote" that the fifth component was faulty and have it taken out of service. This is called M out of N majority voting.
Historically, motion has always been to move further from N-model and more to M out of N due to the fact that the complexity of systems and the difficulty of ensuring the transitive state from fault-negative to fault-positive did not disrupt operations.
Fault tolerance verification and validation
The most important requirement of design in a fault tolerant computer system is making sure it actually meets its requirements for reliability. This is done by using various failure models to simulate various failures, and analyzing how well the system reacts. These
statistical models are very complex, involving probability curves and specific fault rates,latency curves, error rates, and the like. The most commonly used models are HARP, SAVE, and SHARPE in the USA, and SURF or LASS in Europe.Fault tolerance research
Research into the kinds of tolerances needed for critical systems involves a large amount of interdisciplinary work. The more complex the system, the more carefully all possible interactions have to be considered and prepared for. Considering the importance of high-value systems in transport,
utilities and the military, the field of topics that touch on research is very wide: it can include such obvious subjects assoftware modeling and reliability, orhardware design , to arcane elements such as stochastic models,graph theory , formal or exclusionary logic,parallel processing , remotedata transmission , and more. [Reliability Evaluation of Some Fault-Tolerant Computer Architectures
By Shunji Osaki, Toshihiko Nishio
Published by Springer, 1980
ISBN 3540102744, 9783540102748
]
See also
* Fault Tolerant System
References
External links
* [http://64.233.169.104/search?q=cache:uBL7iMOpV9UJ:www.cs.ucla.edu/~rennels/article98.pdf+Fault-tolerant+computer+systems&hl=en&ct=clnk&cd=13&gl=us&client=firefox-a Primer on Fault-Tolerant Computer Systems from UCLA]
* [http://www.freepatentsonline.com/5099485.html A fault-tolerant patent with a lot of basic information on specific ways to detect faults]
Wikimedia Foundation. 2010.