# Information theory and measure theory

﻿
Information theory and measure theory

= Measures in information theory =

Many of the formulas in information theory have separate versions for continuous and discrete cases, i.e. integrals for the continuous case and sums for the discrete case. These versions can often be generalized using measure theory. For discrete random variables, probability mass functions can be considered density functions with respect to the counting measure, thus requiring only basic discrete mathematics for what can be considered, in a measure theory context, integration. Because the same integration expression is used for the continuous case, which uses basic calculus, the same concepts and expressions can be used for both discrete and continuous cases. Consider the formula for the differential entropy of a continuous random variable with probability density function $f\left(x\right)$::$h\left(X\right) = -int_X f\left(x\right) log f\left(x\right) ,dx,$This can usually be taken to be:$h\left(X\right) = -int_X f\left(x\right) log f\left(x\right) ,dmu\left(x\right),$where μ is the Lebesgue measure. But if instead, "X" is discrete, "f" is a probability mass function, and ν is the counting measure, we can write::$H\left(X\right) = -int_X f\left(x\right) log f\left(x\right) ,d u\left(x\right) = -sum_\left\{xin X\right\} f\left(x\right) log f\left(x\right).$The integral expression and the general concept is identical to the continuous case; the only difference is the measure used. In both cases the probability density function "f" is the Radon–Nikodym derivative of the probability measure with respect to the measure against which the integral is taken.

If $mathbb P$ is the probability measure on "X", then the integral can also be taken directly with respect to $mathbb P$::$h\left(X\right) = -int_X log frac\left\{mathrm d mathbb P\right\}\left\{mathrm dmu\right\} ,dmathbb P,$

If instead of the underlying measure μ we take another probability measure $mathbb Q$, we are led to the Kullback–Leibler divergence: let $mathbb P$ and $mathbb Q$ be probability measures over the same space. Then if $mathbb P$ is absolutely continuous with respect to $mathbb Q$, written $mathbb P << mathbb Q,$ the Radon–Nikodym derivative $frac\left\{mathrm dmathbb P\right\}\left\{mathrm dmathbb Q\right\}$ exists and the Kullback–Leibler divergence can be expressed in its full generality:

:$D_mathrm\left\{KL\right\}\left(mathbb P | mathbb Q\right) = int_\left\{mathrm\left\{supp\right\}mathbb P\right\} frac\left\{mathrm dmathbb P\right\}\left\{mathrm dmathbb Q\right\} log frac\left\{mathrm dmathbb P\right\}\left\{mathrm dmathbb Q\right\} ,d mathbb Q = int_\left\{mathrm\left\{supp\right\}mathbb P\right\} log frac\left\{mathrm dmathbb P\right\}\left\{mathrm dmathbb Q\right\} ,d mathbb P,$where the integral runs over the support of $mathbb P.$ Note that we have dropped the negative sign: the Kullback–Leibler divergence is always non-negative due to Gibbs' inequality.

Entropy as a "measure"

There is an analogy between Shannon's basic "measures" of the information content of random variables and a measure over sets. Namely the joint entropy, conditional entropy, and mutual information can be considered as the measure of a set union, set difference, and set intersection, respectively (Reza pp. 106-108).

If we associate the existence of abstract sets $ilde X$ and $ilde Y$ to arbitrary discrete random variables "X" and "Y", somehow representing the information borne by "X" and "Y", respectively, such that:
* $mu\left( ilde X cap ilde Y\right) = 0$ whenever "X" and "Y" are unconditionally independent, and
* $ilde X = ilde Y$ whenever "X" and "Y" are such that either one is completely determined by the other (i.e. by a bijection);

where $mu$ is a signed measure over these sets, and we set:

: $H\left(X\right) = mu\left( ilde X\right),$: $H\left(Y\right) = mu\left( ilde Y\right),$: $H\left(X,Y\right) = mu\left( ilde X cup ilde Y\right),$: : $I\left(X;Y\right) = mu\left( ilde X cap ilde Y\right);$

we find that Shannon's "measure" of information content satisfies all the postulates and basic properties of a formal measure over sets, as commonly illustrated in an information diagram. This can be a handy mnemonic device in some situations, e.g.

:

Because the entropy, joint entropy, conditional entropy, and bivariate mutual information of discrete random variables are all nonnegative, many basic inequalities in information theory (among no more than two random variables) can be derived from this formulation by considering the measure μ to be nonnegative.

Multivariate mutual information

Certain extensions to the definitions of Shannon's basic measures of information are necessary to deal with the σ-algebra generated by the sets that would be associated to three or more arbitrary random variables. (See Reza pp. 106-108 for an informal but rather complete discussion.) Namely $H\left(X,Y,Z,cdots\right)$ needs to be defined in the obvious way as the entropy of a joint distribution, and a multivariate mutual information $I\left(X;Y;Z;cdots\right)$ defined in a suitable manner so that we can set:

: $H\left(X,Y,Z,cdots\right) = mu\left( ilde X cup ilde Y cup ilde Z cup cdots\right),$: $I\left(X;Y;Z;cdots\right) = mu\left( ilde X cap ilde Y cap ilde Z cap cdots\right);$

in order to define the (signed) measure over the whole σ-algebra. There is no single universally accepted definition for the mutivariate mutual information, but the one that corresponds here to the measure of a set intersection is due to Fano (Srinivasa). The definition is recursive. As a base case the mutual information of a single random variable is defined to be its entropy: $I\left(X\right)=H\left(X\right)$. Then for $ngeq 2$ we set: $I\left(X_1; cdots; X_n\right) = I\left(X_1; cdots; X_\left\{n-1\right\}\right) - I\left(X_1; cdots; X_\left\{n-1\right\}|X_n\right),$where the conditional mutual information is defined as: The first step in the recursion yields Shannon's definition $I\left(X_1;X_2\right)=H\left(X_1\right)-H\left(X_1|X_2\right).$ It is interesting to note that the mutual information (same as interaction information but for a change in sign) of three or more random variables can be negative as well as positive: Let "X" and "Y" be two independent fair coin flips, and let "Z" be their exclusive or. Then $I\left(X;Y;Z\right) = - 1$ bit.

Many other variations are possible for three or more random variables: for example, $I\left(X,Y;Z\right)$ is the mutual information of the joint distribution of "X" and "Y" relative to "Z", and can be interpreted as $mu\left(\left( ilde X cup ilde Y\right) cap ilde Z\right).$ Many more complicated expressions can be built this way, and still have meaning, e.g. $I\left(X,Y;Z|W\right),$ or $H\left(X,Z|W,Y\right).$

References

* Fazlollah M. Reza. "An Introduction to Information Theory". New York: McGraw-Hill 1961. New York: Dover 1994. ISBN 0-486-68210-2
* Sunil Srinivasa. "A Review on Multivariate Mutual Information". Notre Dame EE-80653 Information Theory Tutorials, Fall 2005. [http://www.nd.edu/~jnl/ee80653/tutorials/sunil.pdf PDF] .
* R. W. Yeung, "On entropy, information inequalities, and Groups." [http://user-www.ie.cuhk.edu.hk/~ITIP/online/tutorial.ps PS]

Wikimedia Foundation. 2010.

### Look at other dictionaries:

• Information theory — Not to be confused with Information science. Information theory is a branch of applied mathematics and electrical engineering involving the quantification of information. Information theory was developed by Claude E. Shannon to find fundamental… …   Wikipedia

• information theory — the mathematical theory concerned with the content, transmission, storage, and retrieval of information, usually in the form of messages or data, and esp. by means of computers. [1945 50] * * * ▪ mathematics Introduction       a mathematical… …   Universalium

• Information flow (information theory) — Information flow in an information theoretical context is the transfer of information from a variable h to a variable l in a given process. The measure of information flow, p , is defined as the uncertainty before the process started minus the… …   Wikipedia

• Entropy (information theory) — In information theory, entropy is a measure of the uncertainty associated with a random variable. The term by itself in this context usually refers to the Shannon entropy, which quantifies, in the sense of an expected value, the information… …   Wikipedia

• Algorithmic information theory — is a subfield of information theory and computer science that concerns itself with the relationship between computation and information. According to Gregory Chaitin, it is the result of putting Shannon s information theory and Turing s… …   Wikipedia

• Semiotic information theory — considers the information content of signs and expressions as it is conceived within the semiotic or sign relational framework developed by Charles Sanders Peirce.Information and uncertaintyThe good of information is its use in reducing our… …   Wikipedia

• History of information theory — The decisive event which established the discipline of information theory, and brought it to immediate worldwide attention, was the publication of Claude E. Shannon s classic paper A Mathematical Theory of Communication in the Bell System… …   Wikipedia

• Entropy in thermodynamics and information theory — There are close parallels between the mathematical expressions for the thermodynamic entropy, usually denoted by S , of a physical system in the statistical thermodynamics established by Ludwig Boltzmann and J. Willard Gibbs in the 1870s; and the …   Wikipedia

• Gambling and information theory — Statistical inference might be thought of as gambling theory applied to the world around. The myriad applications for logarithmic information measures tell us precisely how to take the best guess in the face of partial information [Jaynes, E.T.… …   Wikipedia

• Redundancy (information theory) — Redundancy in information theory is the number of bits used to transmit a message minus the number of bits of actual information in the message. Informally, it is the amount of wasted space used to transmit certain data. Data compression is a way …   Wikipedia