Katz's back-off model

Katz's back-off model

Katz back-off is a generative n-gram language model that estimates the conditional probability of a word given its history in the n-gram. It accomplishes this estimation by "backing-off" to models with smaller histories under certain conditions. By doing so, the model with the most reliable information about a given history is used to provide the better results.

The method

The equation for Katz's back-off model is deceptively simple: [Katz, S. M. (1987). Estimation of probabilities from sparse data for the language model component of a speech recogniser. IEEE Transactions on Acoustics, Speech, and Signal Processing, 35(3), 400–401. ]

:P_{bo} (w_i | w_{i-n+1} cdots w_{i-1}) = egin{cases} d_{w_{i-n+1} cdots w_{i-1 frac{C(w_{i-n+1}...w_{i})}{C(w_{i-n+1} cdots w_{i-1})} mbox{ if } C(w_{i-n+1} cdots w_i) > k \ alpha_{w_{i-n+1} cdots w_{i-1 P_{bo}(w_i | w_{i-n+2} cdots w_{i-1}) mbox{ otherwise}end{cases}

where,:C(x) = number of times x appears in training:w_i = ith word in the given context

Essentially, this means that if the n-gram has been seen k or more times in training, the conditional probability of a word given its history is proportional to the maximum likelihood estimate of that n-gram. Otherwise, the conditional probability is equal to the back-off conditional probability of the "(n-1)-gram".

The more difficult part is determining the values for k, d and α.

Computing the parameters

k is the least important of all the parameters. It is usually chosen to be 0. However, empirical testing may find more optimal values for k.

d is typically the amount of discounting found by Good-Turing estimation. In other words, if Good-Turing estimates C as C^*, then d = frac{C^*}{C}

To compute α, it is useful to first define a quantity β, which is the left-over probability mass for the (n-1)-gram:

:eta_{w_{i-n+1} cdots w_{i -1 = 1 - sum_{ {w_i : C(w_{i-n+1} cdots w_{n}) > 0 } } d_{w_{i-n+1} cdots w_{i-1 frac{C(w_{i-n+1}...w_{i})}{C(w_{i-n+1} cdots w_{i-1})}

Then the back-off weight, α, is computed as follows::alpha_{w_{i-n+1} cdots w_{i -1 = frac{eta(w_{i -1} cdots w_{i-n+1})} {sum_{ { w_i : C(w_{i-n+1} cdots w_{n}) = 0 } } P(w_i | w_{i-n+2} cdots w_{i-1})}

Discussion

This model generally works well in practice, but fails in some circumstances. For example, suppose that the bigram "a b" and the unigram "c" are very common, but the trigram "a b c" is rarely seen (less than k times). Instead of assigning a more appropriate value of 0, the method will back off to the bigram and estimate P(c | b). [Manning and Shütze, Foundations of Statistical Natural Language Programming, MIT Press (1999), ISBN 978-0262133609.]

References


Wikimedia Foundation. 2010.

Игры ⚽ Нужна курсовая?

Look at other dictionaries:

  • n-gram — Not to be confused with engram. In the fields of computational linguistics and probability, an n gram is a contiguous sequence of n items from a given sequence of text or speech. The items in question can be phonemes, syllables, letters, words or …   Wikipedia

  • N-gram — An n gram is a sub sequence of n items from a given sequence. n grams are used in various areas of statistical natural language processing and genetic sequence analysis. The items in question can be phonemes, syllables, letters, words or base… …   Wikipedia

  • Park51 — An artist s rendering of the proposed Park51 Basic information …   Wikipedia

  • Desperate Housewives (season 7) — Desperate Housewives Season 7 ABC promotional poster for the seventh season of Desperate Housewives. From left to right: Lynette, Susan, Renee, Gabrielle, and Bree. Country o …   Wikipedia

  • George Harrison — This article is about the musician. For his eponymous album, see George Harrison (album). For other persons named George Harrison, see George Harrison (disambiguation). George Harrison George Harrison in 1974 …   Wikipedia

  • The Thief and the Cobbler — Arabian Knight redirects here. For other uses, see Arabian Nights (disambiguation). The Thief and the Cobbler An unreleased poster made near the end of the film s production, before it was taken from Richard Williams …   Wikipedia

  • literature — /lit euhr euh cheuhr, choor , li treuh /, n. 1. writings in which expression and form, in connection with ideas of permanent and universal interest, are characteristic or essential features, as poetry, novels, history, biography, and essays. 2.… …   Universalium

  • Corgi Toys — (trademark) is the name of a range of die cast toy vehicles produced by Mettoy Playcraft Ltd. in the United Kingdom. The Mettoy company was founded in 1933 by German émigré Philip Ullmann in Northampton, England, where he was later joined by… …   Wikipedia

  • performing arts — arts or skills that require public performance, as acting, singing, or dancing. [1945 50] * * * ▪ 2009 Introduction Music Classical.       The last vestiges of the Cold War seemed to thaw for a moment on Feb. 26, 2008, when the unfamiliar strains …   Universalium

  • Media and Publishing — ▪ 2007 Introduction The Frankfurt Book Fair enjoyed a record number of exhibitors, and the distribution of free newspapers surged. TV broadcasters experimented with ways of engaging their audience via the Internet; mobile TV grew; magazine… …   Universalium

Share the article and excerpts

Direct link
Do a right-click on the link above
and select “Copy Link”