Expectation-maximization algorithm

Expectation-maximization algorithm

An expectation-maximization (EM) algorithm is used in statistics for finding maximum likelihood estimates of parameters in probabilistic models, where the model depends on unobserved latent variables. EM alternates between performing an expectation (E) step, which computes an expectation of the likelihood by including the latent variables as if they were observed, and a maximization (M) step, which computes the maximum likelihood estimates of the parameters by maximizing the expected likelihood found on the E step. The parameters found on the M step are then used to begin another E step, and the process is repeated.


The EM algorithm was explained and given its name in a classic 1977 paper by Arthur Dempster, Nan Laird, and Donald Rubin in the Journal of the Royal Statistical Society [Arthur Dempster, Nan Laird, and Donald Rubin. " [http://links.jstor.org/sici?sici=0035-9246%281977%2939%3A1%3C1%3AMLFIDV%3E2.0.CO%3B2-ZMaximum likelihood from incomplete data via the EM algorithm] ". "Journal of the Royal Statistical Society", Series B, 39(1):1–38, 1977] . They pointed out that the method had been "proposed many times in special circumstances" by other authors, but the 1977 paper generalized the method and developed the theory behind it.


EM is frequently used for data clustering in machine learning and computer vision. In natural language processing, two prominent instances of the algorithm are the Baum-Welch algorithm (also known as "forward-backward") and the inside-outside algorithm for unsupervised induction of probabilistic context-free grammars.

In psychometrics, EM is almost indispensable for estimating item parameters and latent abilities of item response theory models.

With the ability to deal with missing data and observe unidentified variables, EM is becoming a useful tool to price and manage risk of a portfolio.

The EM algorithm (and its faster variant OS-EM) is also widely used in medical image reconstruction, especially in Positron Emission Tomography and Single Photon Emission Computed Tomography. See below for other faster variants of EM.

Demonstrations and activities

Various 1D, 2D and 3D [http://wiki.stat.ucla.edu/socr/index.php/SOCR_EduMaterials_Activities_2D_PointSegmentation_EM_Mixture demonstrations of EM together with Mixture Modeling] are provided as part of the paired SOCR activities and applets. These applets and activities show empirically the properties of the EM algorithm for parameter estimation in diverse settings.

A number of methods have been proposed to accelerate the sometimes slow convergence of the EM algorithm, such as those utilising conjugate gradient and modified Newton-Raphson techniques. Additionally EM can be utilised with constrained estimation techniques.

EM in a nutshell

EM is an iterative technique for estimating the value of some unknown quantity, given the values of some correlated, known quantity.

The approach is to first assume that the quantity is represented as a value in some parameterized probability distribution (the popular application is a mixture of Gaussians, hence the example below). The EM procedure, then, is:

#Initialize the distribution parameters

#Repeat until "convergence:"
##E-Step: estimate the [E] xpected value of the unknown variables, given the current parameter estimate
##M-Step: re-estimate the distribution parameters to [M] aximize the likelihood of the data, given the expected estimates of the unknown variables

Steps 1 and 2 are loaded, and depend on what distribution you choose, how many parameters there are, and how complicated the missing value is. Two other caveats: First, the choice of initial parameters technically does not matter, but in practice a poor choice can lead to a bad estimate. Second, convergence, though guaranteed, may take a long time to get to. In practice, if the values of the missing variables and parameters do not change significantly between two iterations, then the algorithm terminates.

A simple application is filling missing values in a column of a database. Assume you know about 50% of the values in a column, but the remaining values are corrupt or missing. For simplicity, assume that the data is distributed normally with a unit variance. Then, the only parameter that must be computed is the mean value. What's more convenient, is that the E-step is simple: the expected value of each missing value is the mean. But the E-step changes the overall mean of the data, and so the estimate can be improved. This will continue for a few iterations. An example:

Initialization Step:Data: [4, 10, ?, ?] Initial mean value: 0
New Data: [4, 10, 0, 0]

Step 1:New Mean: 3.5
New Data: [4, 10, 3.5, 3.5]

Step 2:New Mean: 5.25
New Data: [4, 10, 5.25, 5.25]

Step 3:New Mean: 6.125
New Data: [4, 10, 6.125, 6.125]

Step 4:New Mean: 6.5625
New Data: [4, 10, 6.5625, 6.5625]

Step 5:New Mean: 6.7825
New Data: [4, 10, 6.7825, 6.7825]

Result:New Mean: 6.890625

From here you can see the value is slowly converging toward 7. For this simple model (a univariate normal distribution with unit variance), it can be seen that substituting the average of the known values is the best answer. For more complex models, there is no easy way to find the best answer, and the EM algorithm is a very popular approach for estimating the answer.

In the mixture of Gaussians demonstration below, we have a collection of multi-dimensional objects. We'll assume that each individual data object was generated from one of K Gaussians. If we knew "which" Gaussian each object came from, then estimating the parameters is easy (using Maximum likelihood Estimation techniques). Instead, we must first compute the Expected value that the object came from each Gaussian (E-step) and then estimate the parameters, given these expected assignments (M-step).

Specification of the EM procedure

Let extbf{y} denote incomplete data consisting of values of observable variables, and let extbf{z} denote the missing data. Together, extbf{y} and extbf{z} form the complete data. extbf{z} can either be actual missing measurements or a hidden variable that would make the problem easier if its value were known. For instance, in mixture models, the likelihood formula would be much more convenient if mixture components that "generated" the samples were known (see example below).

Let p, be the joint probability density function (continuous case) or probability mass function (discrete case) of the complete data with parameters given by the vector heta: p( mathbf y, mathbf z | heta). This function can also be considered as the complete data likelihood, that is, it can be thought of as a function of heta.Further, note that the conditional distribution of the missing data given the observed can be expressed as:

:p(mathbf z |mathbf y, heta) = frac{p(mathbf y, mathbf z | heta)}{p(mathbf y | heta)} = frac{p(mathbf y|mathbf z, heta) p(mathbf z | heta) }{int p(mathbf y|mathbf hat{z}, heta) p(mathbf hat{z} | heta) dmathbf hat{z

by using the Bayes rule and the law of total probability. (This formulation only requires knowledge of the observation likelihood given the unobservable data p(mathbf y|mathbf z, heta) and the probability of the unobservable data p(mathbf z | heta).)

An EM algorithm iteratively improves an initial estimate heta_0 by constructing new estimates heta_1, heta_2, and so on.An individual re-estimation step that derives heta_{n+1}, from heta_n, takes the following form:

: heta_{n+1}=argmax_{ heta}Q( heta)

where Q( heta) is the expected value of the log-likelihood. In other words, we do not know the complete data, so we cannot say what is the exact value of the likelihood, but given the data that we do know (the y's), we can find "a posteriori" estimates of the probabilities for the various values of the unknown z's. For each set of z's there is a likelihood value for heta, and we can thus calculate an expected value of the likelihood with the given values of y's (and which depends on the previously assumed value of heta because this influenced the probabilities of the "z"'s).

"Q" is given by

:Q( heta)= sum_z p left(z ,|, y, heta_n ight) log p left(y, z ,|, heta ight)

or more generally

:Q( heta)= E_{mathbf z} ! ! left [ log p left(mathbf y, mathbf z ,|, heta ight) Big| mathbf y ight] where it is understood that this denotes the conditional expectation of log p left( mathbf y, mathbf z ,|, heta ight) being taken with the heta used in the conditional distribution of extbf{z} fixed at heta_n . (The log of the likelihood is often used instead of true likelihood because it leads to easier formulas, but still attains its maximum at the same point as the likelihood.)

In other words, heta_{n+1} is the value that maximizes (M) the conditional expectation (E) of the complete data log-likelihood given the observed variables under the previous parameter value.The expectation Q( heta) in the continuous case would be given by

:Q( heta)=E_{mathbf z} ! ! left [ log p left(mathbf y, mathbf z ,|, heta ight) Big| mathbf y ight] =int^infty _{- infty} p left(mathbf z ,|, mathbf y, heta_n ight) log p left(mathbf y, mathbf z ,|, heta ight) dmathbf z

Speaking of an expectation (E) step is a bit of a misnomer.What is calculated in the first step are the fixed, data-dependent parameters of the function "Q".Once the parameters of "Q" are known, it is fully determined and is maximized in the second (M) step of an EM algorithm.

The origin of the name comes from the fact that in the paper by Dempster, Laird and Rubin, they first discuss a less general problem in which the probability distribution is of the exponential family, and in that case the so-called E step consists of finding the expected values of certain sufficient statistics of the complete data.


Part of the reason for the popularity of EM algorithms is that,as can be shown, an EM iteration does not decrease the observed data likelihood function. However, there is no guarantee that the sequence converges to a maximum likelihood estimator. For multimodal distributions, this means that an EM algorithm will converge to a local maximum (or saddle point) of the observed data likelihood function, depending on starting values. There are a variety of heuristic approaches for escaping a local maximum such as using several different random initial estimates, heta_0, or applying simulated annealing.

EM is particularly useful when maximum likelihood estimation of a complete data model is easy.If closed-form estimators exist, the M step is often trivial.A classic example is maximum likelihood estimation of a finite mixture of Gaussians, where each component of the mixture can be estimated trivially if the mixing distribution is known.

"Expectation-maximization" is a description of a class of related algorithms, not a specific algorithm; EM is a recipe or meta-algorithm which is used to devise particular algorithms.The Baum-Welch algorithm is an example of an EM algorithm applied to hidden Markov models.Another example is the EM algorithm for fitting a mixture density model.

An EM algorithm can also find maximum a posteriori (MAP) estimates, by performing MAP estimation in the M step, rather than maximum likelihood.

There are other methods for finding maximum likelihood estimates, such as gradient descent, conjugate gradient or variations of the Gauss-Newton method. Unlike EM, such methods typically require the evaluation of first and/or second derivatives of the likelihood function.

Incremental versions

The classic EM procedure is to replace both "Q" and "θ" with their optimal possible (argmax) values at each iteration. However it can be shown (see Neal & Hinton, 1999) that simply finding "Q" and "θ" to give "some" improvement over their current value will also ensure successful convergence.

For example, to improve "Q", we could restrict the space of possible functions to a computationally simple distribution such as a factorial distribution,

:Q=prod_i Q_i. !

Thus at each E step we compute the variational approximation of "Q".

To improve "θ", we could use any hill-climbing method, and not worry about finding the optimal "θ", just some improvement. This method is also known as "Generalized EM" (GEM).

Relation to variational Bayes methods

EM is a partially non-Bayesian, maximum likelihood method. Its final result gives a probability distribution over the latent variables (in the Bayesian style) together with a point estimate for "θ" (either a maximum likelihood estimate or a posterior mode). We may want a fully Bayesian version of this, giving a probability distribution over "θ" as well as the latent variables. In fact the Bayesian approach to inference is simply to treat "θ" as another latent variable. In this paradigm, the distinction between the E and M steps disappears. If we use the factorized Q approximation as described above (variational Bayes), we may iterate over each latent variable (now including "θ") and optimize them one at a time. There are now "k" steps per iteration, where "k" is the number of latent variables. For graphical models this is easy to do as each variable's new "Q" depends only on its Markov blanket, so local message passing can be used for efficient inference.

Example: Gaussian Mixture

Assume that a sample of "m" vectors (or scalars) mathbf y_1, dots, extbf{y}_m, where mathbf y_j in mathbb{R}^l, is drawn from one of "n" Gaussian distributions. Let z_j in {1,2,ldots,n} denote which Gaussian mathbf y_j is from. The probability that a particular mathbf y comes from the i^{mathrm{th D-dimensional Gaussian is

P(mathbf y | z=i, heta) = mathcal{N}(mu_i,sigma_i) = (2pi)^{-D/2} {left| sigma_i ight^{-1/2} expleft(-frac{1}{2}(mathbf y - mathbf mu_i)^T sigma_i^{-1} (mathbf y - mathbf mu_i) ight)

Our task is to estimate the unknown parameters heta = left{ mu_1, dots, mu_n, sigma_1, dots, sigma_n, P(z=1), dots, P(z=n) ight}, that is, the mean and standard deviation of each Gaussian and the probability for each Gaussian being drawn for any given point. (Actually it is not clear that we should allow the standard deviations to take any value because then the maximum likelihood may be unbounded as one centers a particular Gaussian on a particular data point and decreases the standard deviation toward zero!)


Estimation of the unobserved "z"'s (which Gaussian is used), conditioned on the observation, using the values from the last maximization step:

:p(z_j=i|mathbf y_j, heta_t) = frac{p(z_j=i, mathbf y_j | heta_t)}{p(mathbf y_j| heta_t)} = frac{p(mathbf y_j|z_j=i, heta_t) p(z=i| heta_t)}{sum_{k=1}^n p(mathbf y_j | z_j=k, heta_t) p(z=k| heta_t)}


We now want to maximize the expected log-likelihood of the joint event:

:egin{align}Q( heta) & = E_{z} left [ ln prod_{j=1}^m p left(mathbf y_j, mathbf z | heta ight) Big| mathbf y_j ight] \ & = E_{z} left [ sum_{j=1}^m ln p left(mathbf y_j, mathbf z | heta ight) Big| mathbf y_j ight] \ & = sum_{j=1}^m E_{z} left [ ln p left(mathbf y_j, mathbf z | heta ight) Big| mathbf y_j ight] \ & = sum_{j=1}^m sum_{i=1}^n p left(z_j=i | mathbf y_j, heta_t ight) ln pleft(z_j=i, mathbf y_j | heta ight) \end{align}

If we expand the probability of the joint event, we get

:Q( heta) = sum_{j=1}^m sum_{i=1}^n p(z_j=i | mathbf y_j, heta_t) ln left( p(mathbf y_j | z_j=i, heta) p(z_j=i | heta) ight)

We have the constraint

:sum_{i=1}^{n} p(z_j=i| heta) = 1

If we add a Lagrange multiplier, and expand the pdf, we get

:egin{align}mathcal{L}( heta)= & left( sum_{j=1}^m sum_{i=1}^n p(z_j=i | mathbf y_j, heta_t) left( - frac{D}{2} ln (2pi) - frac{1}{2} ln left| sigma_i ight| - frac{1}{2}(mathbf y_j - mathbf mu_i)^T sigma_i^{-1} (mathbf y_j - mathbf mu_i) + ln p(z=i | heta) ight) ight) \& - lambda left( sum_{i=1}^{n} p(z=i | heta) - 1 ight)end{align}

To find the new estimate heta_{t+1}, we find a maximum where frac{partial mathcal{L}( heta)}{partial heta} = 0.

New estimate for mean (using some differentiation rules from matrix calculus):

:egin{align}frac{partial mathcal{L}( heta)}{partial mu_i} & = sum_{j=1}^m p(z_j=i | mathbf y_j, heta_t) left( - frac{partial}{partial mu_i} frac{1}{2}(mathbf y_j - mathbf mu_i)^T sigma_i^{-1} (mathbf y_j - mathbf mu_i) ight) \ & = sum_{j=1}^m p(z_j=i | mathbf y_j, heta_t) left( - frac{1}{2}(sigma_i^{-1} +sigma_i^{-T})(mathbf y_j - mathbf mu_i)(-1) ight) \ & = sum_{j=1}^m p(z_j=i | mathbf y_j, heta_t) left( sigma_i^{-1}(mathbf y_j - mathbf mu_i) ight) \ & = 0 \ & Downarrow \sum_{j=1}^m p(z_j=i | mathbf y_j, heta_t) sigma_i^{-1} mathbf mu_i & = sum_{j=1}^m p(z_j=i | mathbf y_j, heta_t) sigma_i^{-1} mathbf y_j \ & Downarrow \mu_i sum_{j=1}^m p(z_j=i | mathbf y_j, heta_t) & = sum_{j=1}^m p(z_j=i | mathbf y_j, heta_t) mathbf y_j \ & Downarrow \mu_i & = frac{sum_{j=1}^m p(z_j=i | mathbf y_j, heta_t) mathbf y_j}{sum_{j=1}^m p(z_j=i | mathbf y_j, heta_t)} \end{align}

New estimate for covariance:

:egin{align}frac{partial mathcal{L}( heta)}{partial sigma_i} & = sum_{j=1}^m p(z_j=i | mathbf y_j, heta_t) left( - frac{partial}{partial sigma_i} frac{1}{2} ln left| sigma_i ight| - frac{partial}{partial sigma_i} frac{1}{2}(mathbf y_j - mathbf mu_i)^T sigma_i^{-1} (mathbf y_j - mathbf mu_i) ight) \ & = sum_{j=1}^m p(z_j=i | mathbf y_j, heta_t) left( - frac{1}{2} sigma_i^{-T} + frac{1}{2} sigma_i^{-T}(mathbf y_j - mathbf mu_i) (mathbf y_j - mathbf mu_i)^T sigma_i^{-T} ight) \ & = 0 \ & Downarrow \sum_{j=1}^m p(z_j=i | mathbf y_j, heta_t) sigma_i^{-1} & = sum_{j=1}^m p(z_j=i | mathbf y_j, heta_t) sigma_i^{-1} (mathbf y_j - mathbf mu_i) (mathbf y_j - mathbf mu_i)^T sigma_i^{-1} \ & Downarrow \sum_{j=1}^m p(z_j=i | mathbf y_j, heta_t) & = sum_{j=1}^m p(z_j=i | mathbf y_j, heta_t) sigma_i^{-1} (mathbf y_j - mathbf mu_i) (mathbf y_j - mathbf mu_i)^T \ & Downarrow \sigma_i & = frac{sum_{j=1}^m p(z_j=i | mathbf y_j, heta_t) (mathbf y_j - mathbf mu_i) (mathbf y_j - mathbf mu_i)^T}{sum_{j=1}^m p(z_j=i | mathbf y_j, heta_t)} \end{align}

New estimate for class probability:

:egin{align}frac{partial mathcal{L}( heta)}{partial p(z=i| heta)} & = left( sum_{j=1}^m p(z_j=i | mathbf y_j, heta_t) frac{partial ln p(z=i| heta)}{partial p(z=i| heta)} ight) - lambda left( frac{partial p(z=i| heta)}{partial p(z=i| heta)} ight) \ & = left( sum_{j=1}^m p(z_j=i | mathbf y_j, heta_t) frac{1}{p(z=i| heta)} ight) - lambda \ & = 0 \ & Downarrow \sum_{j=1}^m p(z_j=i | mathbf y_j, heta_t) frac{1}{p(z=i| heta)} & = lambda \ & Downarrow \p(z=i| heta) & = frac{1}{lambda} sum_{j=1}^m p(z_j=i | mathbf y_j, heta_t) \end{align}

Inserting into the constraint:

:egin{align}sum_{i=1}^{n} p(z=i| heta) & = sum_{i=1}^{n} frac{1}{lambda} sum_{j=1}^m p(z_j=i | mathbf y_j, heta_t) \ & = 1 \ & Downarrow \lambda & = sum_{i=1}^{n} sum_{j=1}^m p(z_j=i | mathbf y_j, heta_t) \end{align}

Inserting lambda into our estimate:

:egin{align}p(z=i| heta) & = frac{1}{lambda} sum_{j=1}^m p(z_j=i | mathbf y_j, heta_t) \ & = {displaystylefrac{sum_{j=1}^m p(z_j=i | mathbf y_j, heta_t)}{sum_{k=1}^{n} sum_{j=1}^m p(z_j=k | mathbf y_j, heta_t) \ & = frac{1}{m}sum_{j=1}^m p(z_j=i | mathbf y_j, heta_t)end{align}

These estimates now become our heta_{t+1}, to be used in the next estimation step.


* Robert Hogg, Joseph McKean and Allen Craig. "Introduction to Mathematical Statistics". pp. 359-364. Upper Saddle River, NJ: Pearson Prentice Hall, 2005.
* Radford Neal, Geoffrey Hinton. "A view of the EM algorithm that justifies incremental, sparse, and other variants". In Michael I. Jordan (editor), "Learning in Graphical Models" pp 355-368. Cambridge, MA: MIT Press, 1999.
* [http://www.inference.phy.cam.ac.uk/mackay/itila/ The on-line textbook: Information Theory, Inference, and Learning Algorithms] , by David J.C. MacKay includes simple examples of the E-M algorithm such as clustering using the soft K-means algorithm, and emphasizes the variational view of the E-M algorithm.
* [http://citeseer.ist.psu.edu/bilmes98gentle.html A Gentle Tutorial of the EM Algorithm and its Application to Parameter Estimation for Gaussian Mixture and Hidden Markov Models] , by [http://ssli.ee.washington.edu/people/bilmes/ Jeff Bilmes] includes a simplified derivation of the EM equations for Gaussian Mixtures and Gaussian Mixture Hidden Markov Models.
* [http://www.cse.buffalo.edu/faculty/mbeal/thesis/index.html Variational Algorithms for Approximate Bayesian Inference] , by M. J. Beal includes comparisons of EM to Variational Bayesian EM and derivations of several models including Variational Bayesian HMMs.
* [http://www.cc.gatech.edu/~dellaert/em-paper.pdf The Expectation Maximization Algorithm] , by Frank Dellaert, gives an easier explanation of EM algorithm in terms of lowerbound maximization.
* [http://www.seanborman.com/publications/EM_algorithm.pdf The Expectation Maximization Algorithm: A short tutorial] , A self contained derivation of the EM Algorithm by Sean Borman.
*http://wiki.stat.ucla.edu/socr/index.php/SOCR_EduMaterials_Activities_2D_PointSegmentation_EM_Mixture SOCR demonstrations of EM and Mixture Modeling]
* Jamshidian, Mortaza and Jennrich, Robert I. (1997). "Acceleration of the EM Algorithm by Using Quasi-Newton Methods,"Journal of the Royal Statistical Society, Ser. B , 59, 569--587.

External links

* [http://lcn.epfl.ch/tutorial/english/mixtureModel/index.html Java code example]
* [http://www.neurosci.aist.go.jp/~akaho/MixtureEM.html Another Java code example]

ee also

*Estimation theory
*Constrained clustering
*Data clustering
*K-means algorithm
*Imputation (statistics)
*Ordered subset expectation maximization
*Baum-Welch algorithm, a particular case
*Latent variable model
*Hidden Markov Model
*Structural equation model
*Rubin causal model

Wikimedia Foundation. 2010.

Look at other dictionaries:

  • Expectation-Maximization-Algorithmus — Der Expectation Maximization Algorithmus (kurz EM Algorithmus, selten auch Estimation Maximization Algorithmus) ist ein Algorithmus der mathematischen Statistik. Der EM Algorithmus wird vorrangig zur Ballungsanalyse verwendet (Siehe hierzu den… …   Deutsch Wikipedia

  • Expectation-Maximization — Algorithme espérance maximisation L algorithme espérance maximisation (en anglais Expectation maximisation algorithm, souvent abrégé EM), proposé par Dempster et al. (1977), est une classe d algorithmes qui permettent de trouver le maximum de… …   Wikipédia en Français

  • Ordered subset expectation maximization — This article is about an algorithm. For Israeli food corporation, see Osem (company). In mathematical optimization, the ordered subset expectation maximization (OSEM) method is an iterative method that is used in computed tomography. In… …   Wikipedia

  • Maximization — or maximisation can refer to: Maximization in the sense of exaggeration Entropy maximization Maximization (economics) Profit maximization Utility maximization problem Budget maximizing model Shareholder value maximization Optimization… …   Wikipedia

  • K-means algorithm — The k means algorithm is an algorithm to cluster n objects based on attributes into k partitions, k < n. It is similar to the expectation maximization algorithm for mixtures of Gaussians in that they both attempt to find the centers of natural… …   Wikipedia

  • Inside-outside algorithm — The inside outside algorithm is a way of re estimating production probabilities in a probabilistic context free grammar. It was introduced by James K. Baker in 1979 as a generalization of the forward backward algorithm for parameter estimation on …   Wikipedia

  • Baum-Welch algorithm — In computer science, statistical computing and bioinformatics, the Baum Welch algorithm is used to find the unknown parameters of a hidden Markov model (HMM). It makes use of the forward backward algorithm and is named for Leonard E. Baum and… …   Wikipedia

  • Mixture model — See also: Mixture distribution In statistics, a mixture model is a probabilistic model for representing the presence of sub populations within an overall population, without requiring that an observed data set should identify the sub population… …   Wikipedia

  • k-means clustering — In statistics and data mining, k means clustering is a method of cluster analysis which aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean. This results into a partitioning of… …   Wikipedia

  • Multiple sequence alignment — A multiple sequence alignment (MSA) is a sequence alignment of three or more biological sequences, generally protein, DNA, or RNA. In many cases, the input set of query sequences are assumed to have an evolutionary relationship by which they… …   Wikipedia