Downtime


Downtime

The term downtime is used to refer to periods when a system is unavailable. Downtime or outage duration refers to a period of time that a system fails to provide or perform its primary function. Reliability, availability, recovery, and unavailability are related concepts. The unavailability is the proportion of a timespan that a system is unavailable or offline. This is usually a result of the system failing to function because of an unplanned event, or because of routine maintenance.

The term is commonly applied to networks and servers. The common reasons for unplanned outages are system failures (such as a crash) or communications failures (commonly known as network outage).

The term is also commonly applied in industrial environments in relation to failures in industrial production equipment. Some facilities measure the downtime incurred during a work shift, or during a 12 or 24-hour period. Another common practice is to identify each downtime event as having an operational, electrical or mechanical origin.

The opposite of downtime is uptime.

Contents

Characteristics

Unplanned downtime may be the result of a software bug, human error, equipment failure, malfunction, high bit error rate, power failure, overload due to exceeding the channel capacity, a cascading failure, etc.

Telecommunication outage classifications

Downtime can be caused by failure in hardware (physical equipment), software (logic controlling equipment), interconnecting equipment (such as cables, facilities, routers,...), wireless transmission (wireless, microwave, satellite), and/or capacity (system limits).

The failures can occur because of damage, failure, design, procedural (improper use by humans), engineering (how to use and deployment), overload (traffic or system resources stressed beyond designed limits), environment (support systems like power and HVAC), scheduled downtime (outages designed into the system for a purpose such as software upgrades and equipment growth), other (none of the above but known), or unknown.

The failures can be the responsibility of customer/service provider, vender/supplier, utility, government, contractor, end customer, public individual, act of nature, other (none of the above but known), or unknown.[1]

Impact

Outages caused by system failures can have a serious impact on the users of computer/network systems, in particular those industries that rely on a nearly 24-hour service:

Also affected can be the users of an ISP and other customers of a telecommunication network.

Corporations can lose business due to network outage or they may default on a contract, resulting in financial losses.

Those people or organizations that are affected by downtime can be more sensitive to particular aspects:

  • some are more affected by the length of an outage - it matters to them how much time it takes to recover from a problem
  • others are sensitive to the timing of an outage - outages during peak hours affect them the most

The most demanding users are those that require high availability.

Famous outages

On Mother's Day, Sunday, May 8, 1988, a fire broke out in the main switching room of the Hinsdale Central Office of the Illinois Bell telephone company. One of the largest switching systems in the state, the facility processed more than 3.5 million calls each day while serving 38,000 customers, including numerous businesses, hospitals, and Chicago’s O’Hare and Midway Airports.[2]

Virtually the entire AT&T network of 4ESS toll tandems switches went in and out of service over and over again on Jan. 15, 1990 disrupting long distance service for the entire nation. The problem dissipated by itself when traffic slowed down. A software bug was found.

AT&T lost its frame relay network for 26 hours on April 13, 1998.[3] This affected many thousands of customers, and bank transactions were one casualty. AT&T failed to meet the service level agreement on their contracts with customers and had to refund[4] 6600 customer accounts, costing millions of dollars.

Xbox Live had intermittent downtime during the 2007-2008 holiday season which lasted thirteen days.[5] Increased demand from Xbox 360 purchasers (the largest number of new user sign-ups in the history of Xbox Live) was given as the reason for the downtime; in order to make amends for the service issues, Microsoft offered their users the opportunity to receive a free game.[6]

Sony's PlayStation Network April 2011 outage, began on April 20, 2011, and was gradually restored on May 14, 2011 starting in the United States. This outage is the longest amount of time the PSN has been offline since its inception in 2006. Sony has stated the problem was caused by an external intrusion which resulted in the confiscation of personal information.[7] Sony reported on April 26, 2011 that a large amount of user data had been obtained by the same hack that resulted in the downtime.

Service levels

In service level agreements, it is common to mention a percentage value (per month or per year) that is calculated by dividing the sum of all downtimes timespans by the total time of a reference time span (e.g. a month). 0% downtime means that the server was available all the time.

For Internet servers downtimes above 1% per year or worse can be regarded as unacceptable as this means a downtime of more than 3 days per year. For e-commerce and other industrial use any value above 0.1% is usually considered unacceptable.

Response and reduction of impact

It is the duty of the network designer to make sure that a network outage does not happen. When it does happen, a well-designed system will further reduce the effects of an outage by having localized outages which can be detected and fixed as soon as possible.

A process needs to be in place to detect a malfunction - network monitoring - and to restore the network to a working condition - this generally involves a help desk team that can troubleshoot a problem, one composed of trained engineers; a separate help desk team is usually necessary in order to field user input, which can be particularly demanding during a downtime.

A network management system can be used to detect faulty or degrading components prior to customer complaints, with proactive fault rectification.

Risk management techniques can be used to determine the impact of network outages on an organisation and what actions may be required to minimise risk. Risk may be minimised by using reliable components, by performing maintenance, such as upgrades, by using redundant systems or by having a contingency plan or business continuity plan. Technical means can reduce errors with error correcting codes, retransmission, checksums, or diversity scheme.

Planning

A planned outage is the result of a planned activity by the system owner and/or by a service provider. These outages, often scheduled during the maintenance window, can be used to perform tasks including the following:

  • Deferred maintenance, e.g., a deferred hardware repair or a deferred restart to clean-up a garbled memory
  • Diagnostics to isolate a detected fault
  • Hardware fault repair
  • Fixing an error or omission in a configuration database or omission in a recent configuration database change
  • Fixing an error in application database or an error in a recent application database change
  • Software patching/software updates to fix a software fault.

Outages can also be planned as a result of a predictable natural event, such as Sun outage.

Maintenance downtimes have to be carefully scheduled in industries that rely on computer systems. In many cases, system-wide downtimes can be averted using what is called a "rolling upgrade" - the process of incrementally taking down parts of the system for upgrade, without affecting the overall functionality.

Avoidance

For most websites, website monitoring is available. Website monitoring (synthetic or passive) is a service that "monitors" downtime and users on the site.

Other usage

Downtime can also refer to time when human capital or other assets go down. For instance, if employees are in meetings or unable to perform their work due to another constraint, they are down. This can be equally expensive, and can be the result of another asset (i.e. computer/systems) being down. This is also commonly known as "dead time".

This term is used also in factories or industrial use. See total productive maintenance (TPM).

Measuring Downtime

There are a many external services which can be used to monitor the uptime and downtime as well as availability of a service or a host. Some examples:

  • Pingdom
  • Watchmouse[8]

See also

References


Wikimedia Foundation. 2010.

Look at other dictionaries:

  • Downtime — (engl. Stillstandszeit, Ausfallzeit, Abstellzeit) ist die gebräuchliche Bezeichnung der Zeit, in der ein System (insb. ein Computersystem) nicht verfügbar bzw. nicht funktionstüchtig ist. Man unterscheidet zwischen geplanter und ungeplanter… …   Deutsch Wikipedia

  • downtime — down‧time [ˈdaʊntaɪm] noun [uncountable] 1. MANUFACTURING time lost in producing goods because something has gone wrong, for example because a machine has broken or materials have not arrived: • loss of revenue due to downtime • With less… …   Financial and business terms

  • downtime — 1952, from DOWN (Cf. down) (adv.) + TIME (Cf. time) …   Etymology dictionary

  • downtime — [n] time during which an activity is stopped break, breathing spell, freedom, free time, halt, interim, interlude, intermission, letup, lull, pause, recess, repose, respite, rest, spare time, spell, stay, suspension, time on one’s hands, time out …   New thesaurus

  • downtime — occurs when a vehicle is being repaired (esp. a commercial vehicle), it cannot fulfil its function. There is a loss in both potential proceeds from its use as well as the salary of its operators …   Dictionary of automotive terms

  • downtime — [doun′tīm΄] n. 1. the time during which a machine, factory, etc. is shut down for repairs or the like 2. the time during which a computer or computer system is down, or inoperative, due to hardware or software failure 3. a) time spent not… …   English World dictionary

  • downtime — [[t]da͟ʊntaɪm[/t]] 1) N UNCOUNT In industry, downtime is the time during which machinery or equipment is not operating. On the production line, downtime has been reduced from 55% to 26%. 2) N UNCOUNT In computing, downtime is time when a computer …   English dictionary

  • downtime — noun Date: 1928 1. time during which production is stopped especially during setup for an operation or when making repairs 2. inactive time (as between periods of work) < napping during our downtime > < an injured athlete facing months of… …   New Collegiate Dictionary

  • downtime — down|time [ˈdauntaım] n [U] 1.) the time when a computer is not working 2.) also down time informal a period of time when you have finished what you were doing, and you can relax or do something that you had not originally planned to do ▪ Often,… …   Dictionary of contemporary English

  • downtime — time needed to repair a machine    When you own a computer, you have to expect some downtime …   English idioms