Hypergeometric distribution


Hypergeometric distribution
Hypergeometric
parameters: \begin{align}N&\in \left\{1,2,\dots\right\} \\
                                 m&\in \left\{0,1,2,\dots,N\right\} \\
                                 n&\in \left\{1,2,\dots,N\right\}\end{align}\,
support: \scriptstyle{k\, \in\, \left\{\max{(0,\, n+m-N)},\, \dots,\, \min{(m,\, n )}\right\}}\,
pmf: {{{m \choose k} {{N-m} \choose {n-k}}}\over {N \choose n}}
cdf: {\sum_{i=0}^k}{{{m \choose i} {{N-m} \choose {n-i}}}\over {N \choose n}}
mean: n {m\over N}
mode: \left \lfloor \frac{(n+1)(m+1)}{N+2} \right \rfloor
variance: n{m\over N}{(N-m)\over N}{N-n\over N-1}
skewness: \frac{(N-2m)(N-1)^\frac{1}{2}(N-2n)}{[nm(N-m)(N-n)]^\frac{1}{2}(N-2)}
ex.kurtosis:  \left.\frac{1}{n m(N-m)(N-n)(N-2)(N-3)}\cdot\right.

\Big[(N-1)N^{2}\Big(N(N+1)-6m(N-m)-6n(N-n)\Big)+ 6 n m (N-m)(N-n)(5N-6)\Big]

mgf: \frac{{N-m \choose n} \scriptstyle{\,_2F_1(-n, -m; N - m - n + 1; e^{t}) } }
                         {{N \choose n}}  \,\!
cf: \frac{{N-m \choose n} \scriptstyle{\,_2F_1(-n, -m; N - m - n + 1; e^{it}) }}
{{N \choose n}}

In probability theory and statistics, the hypergeometric distribution is a discrete probability distribution that describes the probability of k successes in n draws from a finite population without replacement. (cf. the binomial distribution, which describes the probability of k successes in n draws with replacement.)

Contents

Definition

A random variable X follows the hypergeometric distribution if its probability mass function is given by:[1]

 P(X=k) = {{{m \choose k} {{N-m} \choose {n-k}}}\over {N \choose n}}

Where,

  • N is the population size
  • m \over N is the initial probability of success
  • n is the number of draws
  • k is the number of successes
  • \textstyle {a \choose b} is a binomial coefficient

It is positive when \max(0, n+m-N) \leq k \leq \min(m,n).

Combinatorial identities

As one would expect intuitively, the probabilities sum up to 1 :

 \sum_{0\leq k\leq m}    { {m \choose k} { N-m \choose n-k} \over {N \choose n} }  = 1

This is essentially Vandermonde's identity from combinatorics.

Also note that the following identity holds:

 {{{m \choose k} {{N-m} \choose {n-k}}}\over {N \choose n}} = {{{n \choose k} {{N-n} \choose {m-k}}}\over {N \choose m}}.

This follows clearly from the symmetry of the problem, but it can also be shown easily by expressing the binomial coefficients in terms of factorials, and rearranging the latter.

Application and example

The classical application of the hypergeometric distribution is sampling without replacement. Think of an urn with two types of marbles, black ones and white ones. Define drawing a white marble as a success and drawing a black marble as a failure (analogous to the binomial distribution). If the variable N describes the number of all marbles in the urn (see contingency table below) and m describes the number of white marbles, then N − m corresponds to the number of black marbles. In this example X is the random variable whose outcome is k, the number of white marbles actually drawn in the experiment. This situation is illustrated by the following contingency table:

drawn not drawn total
white marbles k mk m
black marbles nk N + k − n − m N − m
total n N − n N

Now, assume (for example) that there are 5 white and 45 black marbles in the urn. Standing next to the urn, you close your eyes and draw 10 marbles without replacement. What is the probability that exactly 4 of the 10 are white? Note that although we are looking at success/failure, the data are not accurately modeled by the binomial distribution, because the probability of success on each trial is not the same, as the size of the remaining population changes as we remove each marble.

This problem is summarized by the following contingency table:

drawn not drawn total
white marbles k = 4 mk = 1 m = 5
black marbles nk = 6 N + k − n − m = 39 N − m = 45
total n = 10 N − n = 40 N = 50

The probability of drawing exactly k white marbles can be calculated by the formula

 P(X=k) = f(k;N,m,n) = {{{m \choose k} {{N-m} \choose {n-k}}}\over {N \choose n}}.

Hence, in this example calculate

 P(X=4) = f(4;50,5,10) = {{{5 \choose 4} {{45} \choose {6}}}\over {50 \choose 10}} = {5\cdot 8145060\over 10272278170} = 0.003964583\dots.

Intuitively we would expect it to be even more unlikely for all 5 marbles to be white.

 P(X=5) = f(5;50,5,10) = {{{5 \choose 5} {{45} \choose {5}}}\over {50 \choose 10}} = {1\cdot 1221759
\over 10272278170} = 0.0001189375\dots,

As expected, the probability of drawing 5 white marbles is roughly 35 times less likely than that of drawing 4.

Symmetries

Swapping the roles of black and white marbles:

f(k;N,m,n) = f(nk;N,Nm,n)

Swapping the roles of drawn and not drawn marbles:

f(k;N,m,n) = f(mk;N,m,Nn)

Swapping the roles of white and drawn marbles:

f(k;N,m,n) = f(k;N,n,m)

Symmetry application

The metaphor of defective and drawn objects depicts an application of the hypergeometric distribution in which the interchange symmetry between n and m is not of foremost concern. Here is an alternative metaphor which brings this symmetry into sharper focus, as there are also applications where it serves no purpose to distinguish n from m.

Suppose you have a set of N children who have been identified with an unusual bone marrow antigen. The doctor wishes to conduct a heredity study to determine the inheritance pattern of this antigen. For the purposes of this study, the doctor wishes to draw tissue from the bone marrow from the biological mother and biological father of each child. This is an uncomfortable procedure, and not all the mothers and fathers will agree to participate. Of the mothers, m participate and N-m decline. Of the fathers, n participate and N-n decline.

We assume here that the decisions made by the mothers is independent of the decisions made by the fathers. Under this assumption, the doctor, who is given n and m, wishes to estimate k, the number of children where both parents have agreed to participate. The hypergeometric distribution can be used to determine this distribution over k. It's not straightforward why the doctor would know n and m, but not k. Perhaps n and m are dictated by the experimental design, while the experimenter is left blind to the true value of k.

It is important to recognize that for given N, n and m a single degree of freedom partitions N into four sub-populations:

  1. Children where both parents participate
  2. Children where only the mother participates
  3. Children where only the father participates and
  4. Children where neither parent participates.

Knowing any one of these four values determines the other three by simple arithmetic relations. For this reason, each of these quadrants is governed by an equivalent hypergeometric distribution. The mean, mode, and values of k contained within the support differ from one quadrant to another, but the size of the support, the variance, and other high order statistics do not.

For the purpose of this study, it might make no difference to the doctor whether the mother participates or the father participates. If this happens to be true, the doctor will view the result as a three-way partition: children where both parents participate, children where one parent participates, children where neither parent participates. Under this view, the last remaining distinction between n and m has been eliminated. The distribution where one parent participates is the sum of the distributions where either parent alone participates.

Symmetry and sampling

To express how the symmetry of the clinical metaphor degenerates to the asymmetry of the sampling language used in the drawn/defective metaphor, we will restate the clinical metaphor in the abstract language of decks and cards. We begin with a dealer who holds two prepared decks of N cards. The decks are labelled left and right. The left deck was prepared to hold n red cards, and N-n black cards; the right deck was prepared to hold m red cards, and N-m black cards.

These two decks are dealt out face down to form N hands. Each hand contains one card from the left deck and one card from the right deck. If we determine the number of hands that contain two red cards, by symmetry relations we will necessarily also know the hypergeometric distributions governing the other three quadrants: hand counts for red/black, black/red, and black/black. How many cards must be turned over to learn the total number of red/red hands? Which cards do we need to turn over to accomplish this? These are questions about possible sampling methods.

One approach is to begin by turning over the left card of each hand. For each hand showing a red card on the left, we then also turn over the right card in that hand. For any hand showing a black card on the left, we do not need to reveal the right card, as we already know this hand does not count toward the total of red/red hands. Our treatment of the left and right decks no longer appears symmetric: one deck was fully revealed while the other deck was partially revealed. However, we could just as easily have begun by revealing all cards dealt from the right deck, and partially revealed cards from the left deck.

In fact, the sampling procedure need not prioritize one deck over the other in the first place. Instead, we could flip a coin for each hand, turning over the left card on heads, and the right card on tails, leaving each hand with one card exposed. For every hand with a red card exposed, we reveal the companion card. This will suffice to allow us to count the red/red hands, even though under this sampling procedure neither the left nor right deck is fully revealed.

By another symmetry, we could also have elected to determine the number of black/black hands rather than the number of red/red hands, and discovered the same distributions by that method.

The symmetries of the hypergeometric distribution provide many options in how to conduct the sampling procedure to isolate the degree of freedom governed by the hypergeometric distribution. Even if the sampling procedure appears to treat the left deck differently from the right deck, or governs choices by red cards rather than black cards, it is important to recognize that the end result is essentially the same.

Relationship to Fisher's exact test

The test (see above) based on the hypergeometric distribution (hypergeometric test) is identical to the corresponding one-tailed version of Fisher's exact test. Reciprocally, the p-value of a two-sided Fisher's exact test can be calculated as the sum of two appropriate hypergeometric tests (for more information see [2]).

Order of draws

The probability of drawing any sequence of white and black marbles (the hypergeometric distribution) depends only on the number of white and black marbles, not on the order in which they appear; i.e., it is an exchangeable distribution. As a result, the probability of drawing a white marble in the ith draw is

 P(W_i) = {\frac{m}{N}}

This can be shown by induction. First, it is certainly true for the first draw that:

 P(W_1) = {\frac{m}{N}}.

Also, we can show that  P(W_{n+1})= {\frac{m}{N}}  by writing:


\begin{align}
 P(W_{n+1}) & = {\sum_{k=0}^n}P(W_{n+1}|k)f(k;N,m,n)\\
 & = {\sum_{k=0}^n}\frac{m-k}{N-n}f(k;N,m,n) \\
 & = {\sum_{k=0}^n}\frac{m-k}{N-n}\frac{\binom mk \binom {N-m} {n-k}}{\binom Nn} \\
 & = \frac{1}{(N-n)\binom Nn} \left \{ m\sum_{k=0}^n \binom mk \binom {N-m} {n-k} - \sum_{k=0}^n k\binom mk \binom {N-m} {n-k}\right \} \\
 & = \frac{1}{(N-n)\binom Nn}\left\{ m\binom Nn - \sum_{k=1}^n k\frac{m}{k} \binom {m-1}{k-1} \binom {N-m} {n-k}\right \} \\
 & = \frac{m}{(N-n)\binom Nn}\left\{ \binom Nn - \sum_{k=1}^n \binom {m-1}{k-1} \binom {N-1-(m-1)} {n-1-(k-1)}\right \} \\
 & = \frac{m}{(N-n)\binom Nn}\left\{ \binom Nn - \binom {N-1}{n-1}\right \} \\
 & = \frac{m}{(N-n)\binom Nn}\left\{ \binom Nn - \frac{n}{N}\binom Nn\right \} \\
 & = \frac{m}{(N-n)}\left\{ 1 - \frac{n}{N} \right\} = \frac{m}{N}
\end{align}
,

which makes it true for every ith draw.


A simpler proof than the one above is the following:

By symmetry each of the N marbles has the same chance to be drawn in the i-th draw. In addition, according to the sumrule, the chance of drawing a white marble in the i-th draw can be calculated by summing the chances of each individual white marble being drawn in the i-th draw. These two observations imply that if for example the number of white marbles at the outset is 3 times the number of black marbles, then also the chance of a white marble being drawn in the i-th draw is 3 times as big as a black marble being drawn in the i-th draw. In the general case we have m white marbles and Nm black marbles at the outset. So

 P(W_i) = {\frac{m}{N-m}}P(B_i).

Since in the i-th draw either a white or a black marble needs to be drawn, we also know that

P(Wi) + P(Bi) = 1.

Combining these two equations immediately yields

 P(W_i) = {\frac{m}{N}}.

Related distributions

Let X ~ Hypergeometric(m, N, n) and p = m / N.

  • Let Y have a binomial distribution with parameters n and p; this models the number of successes in the analogous sampling problem with replacement. If N and m are large compared to n and p is not close to 0 or 1, then X and Y have similar distributions, i.e., P(X \le k) \approx P(Y \le k).
  • If n is large, N and m are large compared to n and p is not close to 0 or 1, then

P(X \le k) \approx \Phi \left( \frac{k-n p}{\sqrt{n p (1-p)}} \right)

where Φ is the standard normal distribution function

  • If the probabilities to draw a white or black marble are not equal (e.g. because their size is different) then X has a Noncentral hypergeometric distribution

Multivariate hypergeometric distribution

Multivariate Hypergeometric Distribution
parameters: c \in \mathbb{N}
(m_1,\ldots,m_c) \in \mathbb{N}^c
N = \sum_{i=1}^c m_i
n \in [0,N]
support: \left\{ \mathbf{k} \in \mathbb{Z}_{0+}^c \, : \, \sum_{i=1}^{c} k_i = n \right\}
pmf: \frac{\prod_{i=1}^{c} \binom{m_i}{k_i}}{\binom{N}{n}}
mean: E(X_i) = \frac{n m_i}{N}
variance: \text{Var}(X_i) = \frac{m_i}{N} \left(1-\frac{m_i}{N}\right) n \frac{N-n}{N-1}
\text{Cov}(X_i,X_j) = -\frac{n m_i m_j}{N^2} \frac{N-n}{N-1}

The model of an urn with black and white marbles can be extended to the case where there are more than two colors of marbles. If there are mi marbles of color i in the urn and you take n marbles at random without replacement, then the number of marbles of each color in the sample (k1,k2,...,kc) has the multivariate hypergeometric distribution. This has the same relationship to the multinomial distribution that the hypergeometric distribution has to the binomial distribution—the multinomial distribution is the "with-replacement" distribution and the multivariate hypergeometric is the "without-replacement" distribution.

The properties of this distribution are given in the adjacent table, where c is the number of different colors and N=\sum_{i=1}^{c} m_i is the total number of marbles.

Example

Suppose there are 5 black, 10 white, and 15 red marbles in an urn. You reach in and randomly select six marbles without replacement. What is the probability that you pick exactly two of each color?

 P(2\text{ black}, 2\text{ white}, 2\text{ red}) = {{{5 \choose 2}{10 \choose 2} {15 \choose 2}}\over {30 \choose 6}} = .079575596816976

Note: When picking the six marbles without replacement, the expected number of black marbles is 6*(5/30) = 1, the expected number of white marbles is 6*(10/30) = 2, and the expected number of red marbles is 6*(15/30) = 3.


See also


References

  1. ^ Rice, John A. (2007). Mathematical Statistics and Data Analysis (Third Edition ed.). Duxbury Press. p. 42. 
  2. ^ K. Preacher and N. Briggs. "Calculation for Fisher's Exact Test: An interactive calculation tool for Fisher's exact probability test for 2 x 2 tables (interactive page)". http://quantpsy.org/fisher/fisher.htm. 

External links


Wikimedia Foundation. 2010.

Look at other dictionaries:

  • hypergeometric distribution — hipergeometrinis skirstinys statusas T sritis fizika atitikmenys: angl. hypergeometric distribution vok. hypergeometrische Verteilung, f rus. гипергеометрическое распределение, n pranc. distribution hypergéométrique, f …   Fizikos terminų žodynas

  • hypergeometric distribution — noun A discrete probability distribution that describes the number of successes in a sequence of n draws from a finite population without replacement …   Wiktionary

  • hypergeometric distribution — noun Date: 1936 a probability function f(x) that gives the probability of obtaining exactly x elements of one kind and n x elements of another if n elements are chosen at random without replacement from a finite population containing N elements… …   New Collegiate Dictionary

  • hypergeometric distribution — /huy peuhr jee euh me trik, huy /, Math. a system of probabilities associated with finding a specified number of elements, as 5 white balls, from a given number of elements, as 10 balls, chosen from a set containing 2 kinds of elements of known… …   Universalium

  • hypergeometric distribution — noun : a probability function of the form f(x) = C(M, x) C(N M, n x) / C(N, n) where C(M, x) = M!/(x! (M x)!) that gives the probability of obtaining exactly x elements of one kind and n x elements of another if n elements are chosen at ran …   Useful english dictionary

  • Wallenius' noncentral hypergeometric distribution — Introduction Probability mass function for Wallenius Noncentral Hypergeometric Distribution for different values of the odds ratio ω. m1 = 80, m2 = 60, n = 100, ω = 0.1 ... 20In probability theory and statistics, Wallenius noncentral… …   Wikipedia

  • Fisher's noncentral hypergeometric distribution — Probability mass function for Fisher s noncentral hypergeometric distribution for different values of the odds ratio ω. m 1 = 80, m 2 = 60, n = 100, ω = 0.01, ..., 1000In probability theory and statistics, Fisher s noncentral hypergeometric… …   Wikipedia

  • Hypergeometric — can refer to various related mathematical topics:*Hypergeometric series, p F q , a power series **Confluent hypergeometric function, 1 F 1, also known as the Kummer function **Euler hypergeometric integral, an integral representation of 2 F 1… …   Wikipedia

  • distribution hypergéométrique — hipergeometrinis skirstinys statusas T sritis fizika atitikmenys: angl. hypergeometric distribution vok. hypergeometrische Verteilung, f rus. гипергеометрическое распределение, n pranc. distribution hypergéométrique, f …   Fizikos terminų žodynas

  • Noncentral hypergeometric distributions — In statistics, the hypergeometric distribution is the discrete probability distribution generated by picking colored balls at random from an urn without replacement. Various generalizations to this distribution exist for cases where the picking… …   Wikipedia