Are you studying machine learning and want to know more about Gaussian Mixture Models? You’ve come to the right place. I have found other online resources to be difficult to approach and/or lacking crucial details. Here I will try to explain GMMs in plain language.

0%
0%
0%

Overview

Gaussian Mixture Models are used to group data points into different clusters. In this way, it is similar to K-means clustering, where each cluster is specified by a point at its center. In a way, GMMs are a generalization of K-means, where we relax the assumption that the clusters are circular (or spherical, or n-dimensional sphere, depending on how many dimensions your features have). Gaussians are specified by three parameters. Just like how the K-means algorithm seeks a point which best “fits” your data points, the GMM algorithm will seek a gaussian which best fits them.

Any GMM model you will be implementing will probably have a fit and a predict method. The goal of the fit method is to find a number of gaussians to cluster your data points. The predict method takes a new point (or a list of new points) and predicts which gaussian is closest to it. So you may have a skeleton class like so:

class GMM():
    def __init__(self, n_clusters, covariance_type):
      pass

    def fit(self, features):
      pass

    def predict(self, features):
      pass

The fit() method

This method has two major sub-parts: the expectation step, and the Maximization step. These two parts will be run inside of an update loop which typically terminates after a set number of iterations or the algorithm has stopped improving.

I personally find that examples are the best way to understand many concepts, so in this article I will use a dataset with 4 features, each of the features having 3 dimensions:

feature d0 d1 d2
x0 x00 x01 x02
x1 x10 x11 x12
x2 x20 x21 x22
x3 x30 x31 x32

Expectation Step

The purpose of this step is to calculate for each of our features, the probability of that feature belonging to cluster k. In this example we have 4 clusters and 4 data points. So we need to fill in the following table:

feature k0 k1 k2 k3
x0 γ00 γ01 γ02 γ03
x1 γ10 γ11 γ12 γ13
x2 γ20 γ21 γ22 γ23
x3 γ30 γ31 γ32 γ33

Maximization Step

In this step we need to update three variables for each of the clusters: the weight, the mean, and the diagonal covariance. In this article I will use wj to represent the weight of the jth gaussian, μj to represent the jth mean, and σj to represent the jth variance.

Calculating wj

The first variable will be rj, which is the sum of gammas corresponding to a given cluster kj. In other words,

$ r_{j} = \sum_{i=0}^3 γ_{ij} $

Basically, rj is the sum of the gammas in the jth column. We will use this rj to calculate a wj for each cluster, the “mixing weight” of the gaussian. Divide by the number of features (in this case 4):

$ w_{j} = {r_{j} \over 4} $

Calculating μj

Next we calculate the means μj of the gaussians i.e. the point in n-dimensional space over which the (n+1)-dimensional bell curve is centered. Because in our example each feature xi contains 3 dimensions (x0 = [d00, d01, d02]), the means will also need to have 3 dimensions each.

μj is calculated as follows:

$$ μ_{j} = {1 \over r_{j} } \sum_{i=0}^3 γ_{ji} x_{i} $$

Let’s work through the first μ. For simplicity we will multiply by rj for now. For the first mean μ0:

$$ r_j μ_0 = γ_{00} x_{0} + γ_{01} x_{1} + γ_{02} x_{2} + γ_{03} x_{3} $$

$$ = γ_{00} \begin{bmatrix} d_{00} \\ d_{01} \\ d_{02} \end{bmatrix} + γ_{01} \begin{bmatrix} d_{10} \\ d_{11} \\ d_{12} \end{bmatrix} + γ_{02} \begin{bmatrix} d_{20} \\ d_{21} \\ d_{22} \end{bmatrix} + γ_{03} \begin{bmatrix} d_{30} \\ d_{31} \\ d_{32} \end{bmatrix}$$

So we are essentially multiplying the gammas associated with a gaussian by all of the features we have. This matrix multiplication can be rolled up into

$$ = \begin{bmatrix} d_{00} & d_{10} & d_{20} & d_{30} \\ d_{01} & d_{11} & d_{21} & d_{31} \\ d_{02} & d_{12} & d_{22} & d_{32} \end{bmatrix} \begin{bmatrix} γ_{00} \\ γ_{01} \\ γ_{02} \\ γ_{03} \\ \end{bmatrix} = \begin{bmatrix} μ_{00} \\ μ_{01} \\ μ_{02} \end{bmatrix} $$

Don’t forget to divide by rj again! It makes sense that each μ should have 3 dimensions, since our features also each have 3 dimensions.

Calculating σj

Technically this is σ2 if you’re here from statistics, but since we never take the square root it doesn’t matter for our purposes.

All together now. E step with code.

The full equation for the expectation step is

$$ \gamma_{ij} = \frac{ w_j \, \mathcal{N}\!\left(x_i \mid \mu_j, \Sigma_j\right) }{ \sum_{k=0}^{K-1} w_k \, \mathcal{N}\!\left(x_i \mid \mu_k, \Sigma_k\right) } $$

Alright, enough math. As fun as it is for you to read, it’s even more fun to write in markdown.

All of the math above for calculating w, μ and σ can be written succinctly in python. Here I am using numpy to take advantage of some pretty slick matrix operations they have available.

"""
Fit GMM to the given data using self.n_clusters number of gaussians.
Features may be multi-dimensional.

Args:
  features: numpy array containing inputs of size
    (n_samples, n_dimensions)
  expectations: numpy array containing gammas from E step. Size is
    (n_features, n_clusters)
Returns:
  means: updated means
  covariances: updated covariances
  mixing_weights: updated mixing weights
"""

def maximization_step(self, features, expectations):
  # initialize lists to hold our parameters
  mixing_weights, means, covariances = [], [], []

  for cluster in range(self.n_clusters):
    # gammas will hold a column from the expectations matrix
    gammas = expectations[:,cluster]

    # r is just the sum of a column
    r = np.sum(gammas)

    # calculate mixing weight
    w = r / len(features)
    mixing_weights.append(w)

    # calculate means
    # the @ symbol is a slick shortcut for matrix multiplication
    mu = (gammas @ features) / r
    means.append(mu)

    # calculate covariances
    diff_sq = (features-mu) ** 2
    cov = (gammas @ diff_sq) / r
    covariances.append(cov)

  return means, covariances, mixing_weights

The predict() step

Once the model is trained, we have learned:

  • Mixing weights wj
  • Means μj
  • Covariances Σj

To classify a new data point x we compute its responsibilities using the same formula from the E-step:

$$ \gamma_j(x) = \frac{ w_j \, \mathcal{N}\!\left(x \mid \mu_j, \Sigma_j\right) }{ \sum_{k=0}^{K-1} w_k \, \mathcal{N}\!\left(x \mid \mu_k, \Sigma_k\right) } $$

This yields a probability distribution over all clusters. If we only want the most likely cluster,

$$ \text{cluster}(x) = \arg\max_{j} \, \gamma_j(x) $$

def predict_probs(self, features):
    gamma, _ = self._e_step(features)
    return gamma

def predict(self, features):
    return np.argmax(self.predict_probs(features), axis=1)