Expectation Maximization Algorithm and Gaussian Mixture Model

Sat 12 Mar 2016 by Tianlong Song Tags Machine Learning Data Mining

In statistical modeling, it is possible that some observations are just missing. For example, when flipping two biased coins with unknown biases, we only have a sequence of observations on heads and tails, but forgot to record which coin each observation comes from. In this case, the conventional maximum likelihood (ML) or maximum a posteriori (MAP) algorithm would no longer be able to work, and it is time for the expectation maximization (EM) algorithm to come into play.

A Motivating Example

Although the two-biased-coin example above works as a valid example, another example will be discussed here, as it is more relevant to practical needs. Let us assume that we have a collection of float numbers, which come from two different Gaussian distributions. Unfortunately, we do not know which distribution each number comes from. Now we are supposed to learn the two Gaussian distributions (i,e, their means and variances) from the given data. This is the well-known Gaussian mixture model (GMM). What makes things difficult is that we have missing observations, i.e., membership of each number towards the two distributions. Though conventional ML or MAP would not work here, this is a perfect problem that EM can handle.

Expectation Maximization (EM) Algorithm1

Let us consider a statistical model with a vector of unknown parameters \(\boldsymbol\theta\), which generates a set of observed data \(\textbf{X}\) and a set of missing observations \(\textbf{Z}\). The likelihood function, \(p(\textbf{X},\textbf{Z}|\boldsymbol\theta)\), characterizes the probability that \(\textbf{X}\) and \(\textbf{Z}\) appear given the model with parameters \(\boldsymbol\theta\). An intuitive idea to estimate \(\boldsymbol\theta\) would be trying to perform the maximum likelihood estimation (MLE) considering all possible \(\textbf{Z}\), i.e.,

\begin{equation} \max_{\boldsymbol\theta}{\ln{p(\textbf{X}|\boldsymbol\theta)}}=\max_{\boldsymbol\theta}{\ln{\sum_{\textbf{Z}}{p(\textbf{X},\textbf{Z}|\boldsymbol\theta)}}}=\max_{\boldsymbol\theta}{\ln{\sum_{\textbf{Z}}{p(\textbf{X}|\textbf{Z},\boldsymbol\theta)p(\textbf{Z}|\boldsymbol\theta)}}}. \end{equation}

Unfortunately, the problem above is not directly tractable, since we do not have any prior knowledge on the missing observations \(\textbf{Z}\).

The EM algorithm aims to solve the problem above by starting with a guess on \(\boldsymbol\theta=\boldsymbol\theta_{0}\) and then iteratively applying the two steps as indicated below:

  • Expectation Step (E Step): Calculate the log likelihood with respect to \(\boldsymbol\theta\) given \(\boldsymbol\theta_{t}\) by
    \begin{equation} \mathcal{L}(\boldsymbol\theta|\boldsymbol\theta_{t})=\ln{\sum_{\textbf{Z}}{p(\textbf{X}|\textbf{Z},\boldsymbol\theta_{t})p(\textbf{Z}|\boldsymbol\theta_{t})}}; \end{equation}
  • Maximization Step (M Step): Find the parameter vector that maximizes the log likelihood above and then update it as
    \begin{equation} \theta_{t+1}={\arg \, \max}_{\theta}{\mathcal{L}(\boldsymbol\theta|\boldsymbol\theta_{t})}. \end{equation}

There are two things that should be noted here:

  • There are two categories of EM: hard EM and soft EM. The algorithm illustrated above is soft EM, because the log likelihood in the E step is weighted upon all possible \(\textbf{Z}\) with their probabilities. While in hard EM, instead of using weighted average, we simply select the most probable \(\textbf{Z}\) and then move forward. The k-means algorithm is a good example of hard EM algorithm.
  • The EM algorithm typically converges to a local optimum, and cannot guarantee global optimum. With this being said, the solution might differ with different initialization, and it is possibly helpful to try more than one initialization when applying EM practically.

Gaussian Mixture Model (GMM)

In the motivating example, a GMM with two Gaussian distributions was introduced. Here we are going to extend it to a general case with \(K\) Gaussian distributions, and the data points will be generalized to be multidimensional. At the same time, we will discuss how it can be used for clustering.

Given a data set containing \(N\) data points, \(\mathcal{D}=\{\textbf{x}_1,\textbf{x}_2,...,\textbf{x}_N\}\), in which each data point is a \(M\)-dimensional column vector and comes from one of \(K\) Gaussian distributions. Here we will introduce \(\mathcal{Z}=\{z_1,z_2,...,z_N\}\) with \(z_i\in\{1,2,...,K\}\) as latent (hidden) variables to represent the cluster membership of the data points in \(\mathcal{D}\). The \(K\) Gaussian distributions are characterized by \(\mathcal{N}(\boldsymbol\mu_j,\boldsymbol\Sigma_j)\) for \(j=1,2,...,K\), and the \(j\)th distribution has a weight of \(\pi_j\) accounted in the overall distribution. Let us first try to map this GMM model to the EM algorithm component by component:

  • \(\mathcal{D}\) is the observed data;
  • \(\mathcal{Z}\) is the missing observations;
  • \(\boldsymbol\mu_j\), \(\boldsymbol\Sigma_j\) and \(\pi_j\) are the unknown model parameters.

Following the EM algorithm, we will start with a guess on the unknown parameters, and then iteratively applying E step and M step until convergence. In the E step, we calculate the log likelihood based on given model parameters by

\begin{align} \begin{aligned} LL&=\ln{p(\textbf{x}_1,\textbf{x}_2,...,\textbf{x}_N)}\\ &=\ln{\prod_{i=1}^{N}{p(\textbf{x}_i)}}\\ &=\ln{\prod_{i=1}^{N}{\sum_{j=1}^{K}{p(z_i=j)p(\textbf{x}_i|z_i=j)}}}\\ &=\sum_{i=1}^{N}{\ln\left(\sum_{j=1}^{K}{\pi_j\mathcal{N}(\textbf{x}_i|\boldsymbol\mu_j,\boldsymbol\Sigma_j)}\right)}. \end{aligned} \end{align}

In the M step, we maximize the log likelihood by solving the optimization problem below:

\begin{align} \begin{aligned} \max_{\boldsymbol\mu_j,\boldsymbol\Sigma_j,\pi_j}~~&{LL}\\ &s.t.~\sum_{j=1}^{K}{\pi_j}=1. \end{aligned} \end{align}

We can apply Lagrange multiplier to solve the problem above. Let

\begin{equation} L=\sum_{i=1}^{N}{\ln\left(\sum_{j=1}^{K}{\pi_j\mathcal{N}(\textbf{x}_i|\boldsymbol\mu_j,\boldsymbol\Sigma_j)}\right)}-\lambda\left(\sum_{j=1}^{K}{\pi_j}-1\right), \end{equation}

where \(\lambda\) is the Lagrange multiplier. Taking partial derivatives and setting them to zero, we can obtain the optimal parameters as below:

\begin{equation} \label{Eqn:Maximization_1} \boldsymbol\mu_j=\frac{\sum_{i=1}^{N}{\gamma_{ij}}\textbf{x}_i}{\sum_{i=1}^{N}{\gamma_{ij}}}, \end{equation}
\begin{equation} \label{Eqn:Maximization_2} \boldsymbol\Sigma_j=\frac{\sum_{i=1}^{N}{\gamma_{ij}}(\textbf{x}_i-\boldsymbol\mu_j)(\textbf{x}_i-\boldsymbol\mu_j)^T}{\sum_{i=1}^{N}{\gamma_{ij}}}, \end{equation}
\begin{equation} \label{Eqn:Maximization_3} \pi_j=\frac{1}{N}{\sum_{i=1}^{N}{\gamma_{ij}}}, \end{equation}

where \(\gamma_{ij}=p(z_i=j|\textbf{x}_i)\) is the cluster membership, which can be calculated using Bayes theorem,

\begin{equation} \begin{split} \gamma_{ij}&=p(z_i=j|\textbf{x}_i)\\ &=\frac{p(z_i=j)p(\textbf{x}_i|z_i=j)}{\sum_{j=1}^{K}{p(z_i=j)p(\textbf{x}_i|z_i=j)}}\\ &=\frac{\pi_j\mathcal{N}(\textbf{x}_i|\boldsymbol\mu_j,\boldsymbol\Sigma_j)}{\sum_{j=1}^{K}{\pi_j\mathcal{N}(\textbf{x}_i|\boldsymbol\mu_j,\boldsymbol\Sigma_j)}}. \end{split} \end{equation}

To summarize, the GMM model can be learned using EM algorithm as in the following steps:

  1. Initialize \(\boldsymbol\mu_j\), \(\boldsymbol\Sigma_j\) and \(\pi_j\) for \(j=1,2,...,K\);
  2. Repeat the following two steps until the log likelihood converges:
    • E Step: Estimate cluster membership \(\gamma_{ij}\) by the equation right above for all data point \(x_i\) and cluster \(z_i=j\);
    • M Step: Maximize the log likelihood and update the model parameters by (\ref{Eqn:Maximization_1})-(\ref{Eqn:Maximization_3}) based on cluster membership \(\gamma_{ij}\).


  1. Wikipedia, Expectation–maximization algorithm, accessed on Mar 12, 2016.