Recovery-oriented computing

Recovery-oriented computing

Recovery-oriented computing (sometimes abbreviated to ROC) is a method constructed at Leland Stanford Junior University and the University of California, Berkeley for developing reliable Internet services. Its proponents seek to recognize computer bugs as inevitable, and then reduce their harmful effects. The National Science Foundation funds the project.

There are characteristics that set recovery oriented computing apart from all other failure handling techniques.

Isolation and redundancy

Isolation in these types of systems requires redundancy. Should one part of the system fail, a redundant part will need to take its place. Isolation must be failure proof for all types of failures whether they be software or human caused failures. One potential way to isolate parts of a system is using virtual machine monitors such as Xen. Virtual machine monitors allow many virtual machines to run on a physical machine and should there be a problem with one virtual machine it can be restarted without restarting the physical machine, or it can be stopped and another can take its place.

ystem-wide undo support

The ability to undo across different programs and time frames is absolutely necessary in this type of system because human error is the only cause of system failures. Humans innately have the mind to do so. Not having undo support also limits testing aspects of a production system because it doesn’t allow for trial and error.

System-wide undo support should cover all aspects of the system. This includes hardware and software upgrades, configuration as well as application management. There are obviously limits to what can be undone, and these limits are currently being explored, tested and rated based on their tradeoffs.

Integrated diagnostic support

Integrated diagnostic support is another characteristic a recovery-oriented computer should have. This means that the system should be able to identify the root cause of a system failure. Once it does this it should then either be able to contain the failure so it cannot affect other parts of the system or alternatively it should repair the failure. All of the system components or modules should be self-testing; it should be able to know when there is something wrong with itself. As well as determining problems with themselves, the modules should also be able to verify the behavior of other modules that they are dependent upon. The system must also track module, resource, and user request dependencies throughout the system. This will allow for containment of failures.

Online verification and recovery mechanisms

Recovery mechanisms are ways in which the systems can recover from failures. These recovery mechanisms should be well designed, meaning that they are reliable, effective and efficient. These systems should be proactive in testing and verifying the behavior of the recovery mechanisms so should there be a real failure it is certain that these mechanisms will do what they are designed to do and aid in the recovery of the system. These verifications should be performed even in production level equipment as this type of equipment is the most vital to have up. There are two methods for performing these tests and both of these should be used. The first method is directed tests in which the tests are set up and executed. The other method is a random test in which they occur without warning.

Modularity, measurability and restartability

Software aging problems are best resolved by restarting the component that is affected. This entails both modularity and restartability. Components should be restarted before they fail, and designed to make this option available or better yet, do it automatically. Applications should also be designed for restartability.


These systems should have frequent dependability and availability benchmarking to justify their existence and usage by tracking their progress. These benchmarks should be reproducible and an impartial measure of system dependability, reliability, and availability.

ee also

*Reliable system design
*Computer glitch

External links

* [ The Berkeley/Stanford Recovery-Oriented Computing (ROC) Project] , the official web site, which to date includes information on research, people, publications, talks, retreats, and projects

Wikimedia Foundation. 2010.

Look at other dictionaries:

  • List of computing topics — Originally, the word computing was synonymous with counting and calculating, and the science and technology of mathematical calculations. Today, computing means using computers and other computing machines. It includes their operation and usage,… …   Wikipedia

  • Transparency (computing) — Any change in a computing system, such as new feature or new component, is transparent if the system after change adheres to previous external interface as much as possible while changing its internal behaviour. The purpose is to shield from… …   Wikipedia

  • Cloud computing — logical diagram Cloud computing is the delivery of computing as a service rather than a product, whereby shared resources, software, and information are provided to computers and other devices as a utility (like the electricity grid) over a… …   Wikipedia

  • List of computing and IT abbreviations — This is a list of computing and IT acronyms and abbreviations. Contents: 0–9 A B C D E F G H I J K L M N O P Q R S T U V W X Y …   Wikipedia

  • Benchmark (computing) — This article is about the use of benchmarks in computing, for other uses see benchmark. In computing, a benchmark is the act of running a computer program, a set of programs, or other operations, in order to assess the relative performance of an… …   Wikipedia

  • Kernel (computing) — A kernel connects the application software to the hardware of a computer In computing, the kernel is the main component of most computer operating systems; it is a bridge between applications and the actual data processing done at the hardware… …   Wikipedia

  • Shell (computing) — A shell is a piece of software that provides an interface for users of an operating system which provides access to the services of a kernel. However, the term is also applied very loosely to applications and may include any software that is… …   Wikipedia

  • ROC — Republic Of China (Governmental » US Government) Republic Of China (Regional » Countries) *** Receiver Operating Characteristic (Academic & Science » Electronics) *** Receiver Operating Characteristic (Academic & Science » Amateur Radio) *… …   Abbreviations dictionary

  • Dynamic infrastructure — is an information technology paradigm concerning the design of data centers so that the underlying hardware and software can respond dynamically to changing levels of demand in more fundamental and efficient ways than before. The paradigm is also …   Wikipedia

  • Список журналов издательства Springer — Содержание 1 Биомедицина и науки о жизни (Biomedical and Life Sciences) 2 З …   Википедия

Share the article and excerpts

Direct link
Do a right-click on the link above
and select “Copy Link”

We are using cookies for the best presentation of our site. Continuing to use this site, you agree with this.