Good–Turing frequency estimation

Good–Turing frequency estimation

Good–Turing frequency estimation is a statistical technique for predicting the probability of occurrence of objects belonging to an unknown number of species, given past observations of such objects and their species. (In drawing balls from an urn, the 'objects' would be balls and the 'species' would be the distinct colors of the balls (finite but unknown in number). After drawing R_{red} red balls, R_{black} black balls and R_{green} green balls, we would ask what is the probability of drawing a red ball, a black ball, a green ball or one of a previously unseen color.)

Historical background

Good–Turing frequency estimation was developed by Alan Turing and his assistant I.J. Good during World War II. (It was part of their efforts at Bletchley Park to crack German ciphers for the Enigma machine during the war.) Turing at first modeled the frequencies as a binomial distribution, but found it inaccurate. Good developed smoothing algorithms to improve the estimator's accuracy.

The discovery was recognized as significant when published by Good in 1953, but the calculations were difficult so it was not used as widely as it might have been. [ [http://www.newswise.com/articles/view/501440/ Newsise: Scientists Explain and Improve Upon 'Enigmatic' Probability Formula] ]

The method even gained some literary fame due to the Robert Harris novel "Enigma".

In the 1990s, Geoffrey Sampson worked with William A. Gale of AT&T, to create and implement a simplified and easier-to-use variant of the Good–Turing method [ [http://www.grsampson.net/RGoodTur.html Geoffrey Sampson: Good–Turing Frequency Estimation] ] described below.

The method

Let us define some data structures and notation.

Assume we have observed "X" distinct species, numbered "x" = 1, ... , "X".

The frequency vector R_x gives the number of objects we have observed for species "x".

The frequency of frequencies vector N_r shows how many times the frequency "r" occurs in the vector R_x. For example N_1 is the number of species for which only 1 object was observed.

Note that the total number of objects observed, "N", can be found from N = sum r N_r.

The first step in the calculation is to find an estimate of the total probability of unseen objects.This estimate is p_0 = N_1 / N .

The next step is to find an estimate of probability for objects which were seen "r" times. This estimate is p_r = frac{(r+1) S(N_{r+1})}{N S(N_r)} (see also empirical Bayes method).

The notation S( ) means the smoothed or adjusted value of the frequency shown in parenthesis. An overview of how to perform this smoothing will now be given.

We would like to make a plot of log N_r versus log r but this is problematic because for large "r" many N_r will be zero. Instead we plot log Z_r versus log r where "Z" is defined as

:Z_r = frac{N_r}{0.5(t-q)}

where "q", "r" and "t" are consecutive subscripts having N_q, N_r, N_t non-zero.

A linear regression is then fit to the log-log plot. For small values of r we take S(N_r) = N_r(that is, no smoothing is performed), while for large values on "r", S(N_r) is read off theregression line. An automatic procedure (not described here) specifies at what point the switch from no smoothing to linear smoothing should take place.

Code for the method is available in the public domain.

References

See also

* Ewens sampling formula
* Pseudocount


Wikimedia Foundation. 2010.

Игры ⚽ Нужен реферат?

Look at other dictionaries:

  • Alan Turing — Turing redirects here. For other uses, see Turing (disambiguation). Alan Turing Turing at the time of his election to Fellowship of the Royal Society …   Wikipedia

  • I. J. Good — Infobox Person image size = 150px name = I. J. Good birth date = birth date and age|1916|12|9|df=y birth place = flagicon|UK London, England death date = death place = occupation = Statistician and cryptographerIrving John (Jack) Good (born 9… …   Wikipedia

  • Cryptanalysis of the Enigma — enabled the western Allies in World War II to read substantial amounts of secret Morse coded radio communications of the Axis powers that had been enciphered using Enigma machines. This yielded military intelligence which, along with that from… …   Wikipedia

  • List of statistics topics — Please add any Wikipedia articles related to statistics that are not already on this list.The Related changes link in the margin of this page (below search) leads to a list of the most recent changes to the articles listed below. To see the most… …   Wikipedia

  • Empirical Bayes method — In statistics, empirical Bayes methods are a class of methods which use empirical data to evaluate / approximate the conditional probability distributions that arise from Bayes theorem. These methods allow one to estimate quantities… …   Wikipedia

  • List of mathematics articles (G) — NOTOC G G₂ G delta space G networks Gδ set G structure G test G127 G2 manifold G2 structure Gabor atom Gabor filter Gabor transform Gabor Wigner transform Gabow s algorithm Gabriel graph Gabriel s Horn Gain graph Gain group Galerkin method… …   Wikipedia

  • Cellular neural network — Cellular neural networks (CNN) are a parallel computing paradigm similar to neural networks, with the difference that communication is allowed between neighbouring units only. Typical applications include image processing, analyzing 3D surfaces,… …   Wikipedia

  • Minimum description length — The minimum description length (MDL) principle is a formalization of Occam s Razor in which the best hypothesis for a given set of data is the one that leads to the best compression of the data. MDL was introduced by Jorma Rissanen in 1978. It is …   Wikipedia

  • Theoreme de Cox-Jaynes — Théorème de Cox Jaynes Le théorème de Cox Jaynes (1946) est une codification des processus d apprentissage à partir d un certain ensemble de postulats. Cette codification se trouve coïncider au terme de ces considérations avec celle… …   Wikipédia en Français

  • Théorème de Cox-Jaynes — Le théorème de Cox Jaynes (1946) est une codification des processus d apprentissage à partir d un certain ensemble de postulats. Cette codification se trouve coïncider au terme de ces considérations avec celle historiquement d origine toute… …   Wikipédia en Français

Share the article and excerpts

Direct link
Do a right-click on the link above
and select “Copy Link”