Voice activity detection


Voice activity detection

Voice activity detection (also known as speech activity detection or, more simply, speech detection) is an algorithm used in speech processing wherein the presence or absence of human speech is detected in regions of audio. The main uses of VAD are in speech coding and speech recognition. Voice activity detection (VAD) may not only indicate the presence or absence of speech, but also other aspects of the speech, for example whether the speech is voiced, unvoiced or sustained. Voice activity detection is usually language independent.

While talking to someone, there will be silent periods when we are not talking. A VAD feature in VOIP can disable the silence packets and use the silent period to transmit some traffic other than voice.

Voice Activity Detection (VAD)

The process of separating conversational speech from silence, music, noise or other non-speech signals is called voice activity detection (VAD). The primary function of a voice activity detector is to provide an indication of the presence of speech in order to facilitate speech processing as well as possibly providing delimiters for the beginning and end of a speech segment. It was first investigated for use on time-assignment speech interpolation (TASI) systems. VAD is an important enabling technology for a variety of speech-based applications. For these purposes, there have been proposed various VAD algorithms that trade off delay, sensitivity, accuracy and computational cost.

VAD Applications

* VAD is an integral part of different speech communication systems such as audio conferencing, echo cancellation, speech recognition, speech encoding, and hands-free telephony.
* In the field of multimedia applications, VAD allows simultaneous voice and data applications.
* Similarly, in Universal Mobile Telecommunications Systems (UMTS), it controls and reduces the average bit rate and enhances overall coding quality of speech.
* In cellular radio systems (for instance GSM and CDMA systems) based on Discontinuous Transmission (DTX) mode, VAD is essential for enhancing system capacity by reducing co-channel interference and power consumption in portable digital devices.

For a wide range of applications such as digital mobile radio, Digital Simultaneous Voice and Data (DSVD) or speech storage, it is desirable to provide a discontinuous transmission of speech-coding parameters. Advantages can include lower average power consumption in mobile handsets, higher average bit rate for simultaneous services like data transmission, or a higher capacity on storage chips. However, the improvement depends mainly on the percentage of pauses during speech and the reliability of the VAD used to detect these intervals. On the one hand, it is advantageous to have a low percentage of speech activity. On the other hand clipping, that is the loss of milliseconds of active speech, should be minimized to preserve quality. This is the crucial problem for a VAD algorithm under heavy noise conditions.

One controversial application of VAD is in conjunction with predictive dialers and telemarketing firms. In order to maximize agent productivity, telemarketing firms set up predictive dialers to call more numbers than they have agents available, knowing most calls will end up in either "Ring - No Answer" or answering machines. When a person answers, they typically speak briefly ("Hello", "Good evening", etc.) and then there is a brief period of silence. Answering machine messages usually contain 3-15 seconds of continuous speech. By setting VAD parameters correctly, dialers can determine whether a person or a machine answered the call, and if it's a person, transfer the call to an available agent. If it detects an answering machine, the dialer hangs up. Often, the system correctly detects a person answering the call, but no agent is available. This leaves the called party frustratedly repeating "Hello? Hello?" into the phone, and when combined with the volume of agents that did get through, created the impetus to develop "Do Not Call" lists across the US.

Robust VAD

Robust VAD algorithms have been suggested to solve some of the problems ordinary VAD cannot solve. Voice activity detection is an outstanding problem for speech transmission, enhancement and recognition. The varying nature of speech and background noise makes it especially challenging. The earlier algorithms are based on the Itakura LPC distance measure, energy levels, timing, pitch, and zero crossing rates, cepstral features, adaptive noise modeling of voice signals and the periodicity measure. Unfortunately, these algorithms have some problems in low SNR values, especially when the noise is non-stationary. Consistent accuracy cannot be achieved since most algorithms rely on a threshold level for comparison. This threshold level is often assumed to be fixed or calculated in the silence (voice-inactive) intervals. During the last decade numerous researchers have studied different strategies for detecting speech in noise and the influence of the VAD decision on speech processing systems.

Technical Overview of VAD

The basic function of a VAD algorithm is to extract some measured features or quantities from the input signal and to compare these values with thresholds, usually extracted from the characteristics of the noise and speech signals. Then, voice-active decision is made if the measured values exceed the thresholds. VAD in non-stationary noise requires a time-varying threshold value. This value is usually calculated in the voice-inactive segments.

VAD is more critical for non-stationary noise environments since it is needed to update the constantly varying noise statistics affecting a misclassification error strongly to the system performance. A representative set of recently published VAD methods formulates the decision rule on a frame by frame basis using instantaneous measures of the divergence distance between speech and noise. The different measures which are used in VAD methods include spectral slope, correlation coefficients, log likelihood ratio, cepstral, weighted cepstral, and modified distance measures.

VAD can be decomposed in two steps:
* the computation of metrics
* the application of a classification rule.

Independently from the VAD method, we must compromise between having voice detected as noise or noise detected as voice. A VAD operating in a mobile environment must be able to detect speech in the presence of a range of very diverse types of acoustic background noise. In these difficult detection conditions it is vital that a VAD should ``fail-safe", indicating ``speech detected" when the decision is in doubt so that no clipping is introduced. The biggest difficulty in the detection of speech in this environment is the very low signal-to-noise ratios (SNRs) that are encountered. It is impossible to distinguish between speech and noise using simple level detection techniques when parts of the speech utterance are buried below the noise.

Evaluation of VAD Performance

Performance of VAD can be measured in terms of activity, and the degree and severity of clipping. In order to evaluate the amount of clipping and how often noise is detected as speech, the VAD output is compared with those of an ideal VAD. The performance of a VAD is evaluated on the basis of the following four traditional parameters:

* FEC (Front End Clipping): clipping introduced in passing from noise to speech activity;
* MSC (Mid Speech Clipping): clipping due to speech misclassified as noise;
* OVER: noise interpreted as speech due to the VAD flag remaining active in passing from speech activity to noise;
* NDS (Noise Detected as Speech): noise interpreted as speech within a silence period.

Although the method described above provides useful objective information concerning the performance of a VAD, it only gives an initial estimate with regard to the subjective effect. For example, the effects of speech signal clipping can at times be hidden by the presence of background noise, depending on the model chosen for the comfort noise synthesis, so some of the clipping measured with objective tests is in reality not audible. It is therefore important to carry out subjective tests on the VAD's, the main aim of which is to ensure that the clipping perceived is acceptable. This kind of test requires a certain number of listeners to judge recordings containing the processing results of the VAD's being tested. The listeners have to give marks on the following features:

* Quality;
* Comprehension difficulty;
* Audibility of clipping.

These marks, obtained by listening to several speech sequences, are then used to calculate average results for each of the features listed above, thus providing a global estimate of the behavior of the VAD being tested. To conclude, whereas objective methods are very useful in an initial stage to evaluate the quality of a VAD, subjective methods are more significant. As, however, they are more expensive (since they require the participation of a certain number of people for a few days), they are generally only used when a proposal is about to be standardized.

References

* M.Y. Appiah, M. Sasikath, R. Makrickaite, M. Gusaite, " [http://www.kom.auc.dk/~myap04/pjts/final_report_8th.pdf Robust Voice Activity Detection and Noise Reduction Mechanism] (PDF)", Institute of Electronics Systems, Aalborg University
* D.K. Freeman, G. Cosier, C.B. Southcott and I. Boyd, "The voice activity detection for the pan-european digital cellular mobile telephone service" in Proc. Int. Conf. acoustics, speech, signal processing, May 1989, pp. 369-372 ;
* Beritelli.F; Casale.S; Ruggeri.G; Serrano.S, "Performance evaluation and comparison of G.729/AMR/fuzzy voice activity detectors", Signal Processing Letters, IEEE ,Vol. 9 , Issue 3 , March 2002, pp.85 - 88
* DMA minimum performance standards for discontinuous transmission operation of mobile stations� TIA doc. and database IS-727, June 1998.
* Stephen W. Laverty, Donald R. Brown, "Improved voice activity detection in the presence of passing vehicle", Worcester Polytechnic Institute;
* Chen Dong, Kuang Jingming, "A robust voice activity detector applied for AMR", Department of Electronic Engineering, Beijing Institute of technology;
* Philippe Renevey and Andrzej Drygajlo, "Entropy based voice activity detection in very noisy conditions", Swiss center for electronics and microtechnology, Swiss federal institute of technology.
* Aurix Ltd: providers of [http://www.aurix.com/ speech detection software]


Wikimedia Foundation. 2010.

Look at other dictionaries:

  • Voice Activity Detection — В Википедии …   Википедия

  • Voice stress analysis — (VSA) technology is said to record psychophysiological stress responses that are present in human voice, when a person suffers psychological stress in response to a stimulus (question) and where the consequences may be dire for the subject being… …   Wikipedia

  • Activity recognition — aims to recognize the actions and goals of one or more agents from a series of observations on the agents actions and the environmental conditions. Since the 1980s, this research field has captured the attention of several computer science… …   Wikipedia

  • Lie detection — is the practice of determining whether someone is lying. Activities of the bodynot easily controlled by the conscious mind are compared under different circumstances. Usually this involves askingthe subject control questions where the answers are …   Wikipedia

  • VAD — Voice Activity Detection (Computing » Telecom) Voice Activity Detection (Computing » Networking) Voice Activity Detection (Community » Law) * Vitamin A Deficiency (Medical » Physiology) * Voluntary Aid Detachment (Governmental » United Nations) * …   Abbreviations dictionary

  • Speex — Filename extension .spx Internet media type audio/x speex, audio/speex, audio/ogg Developed by Xiph.Org Foundation, Jean Marc Valin Type of format Audio Contained by Ogg …   Wikipedia

  • Comfort noise — (or comfort tone) is synthetic background noise used in radio and wireless communications to fill the artificial silence in a transmission resulting from voice activity detection or from the audio clarity of modern digital lines. Some modern… …   Wikipedia

  • Silence suppression — The term silence suppression is used in telephony to describe the process of not transmitting information over the network when one of the parties involved in a telephone call is not speaking, thereby reducing bandwidth usage.Voice is carried… …   Wikipedia

  • Clipping (audio) — For shortening of voice snippets due to failures in voice activity detection, see squelch and voice activity detection. The altered peaks and troughs of the sinusoidal waveform displayed on this oscilloscope indicate the signal has been clipped.… …   Wikipedia

  • Speex — Vorlage:Infobox Dateiformat/Wartung/MagischeZahl fehltVorlage:Infobox Dateiformat/Wartung/Website fehlt Speex Dateiendung: .spx MIME Type: audio/x speex …   Deutsch Wikipedia