Are you studying machine learning and want to know more about Gaussian Mixture Models? You’ve come to the right place. I have found other online resources to be difficult to approach and/or lacking crucial details. Here I will try to explain GMMs in plain language.

Position

Overview

Gaussian Mixture Models are used to group data points into different clusters. In this way, it is similar to K-means clustering, where each cluster is specified by a point at its center. In a way, GMMs are a generalization of K-means, where we relax the assumption that the clusters are circular (or spherical, or n-dimensional sphere, depending on how many dimensions your features have). Gaussians are specified by three parameters. Just like how the K-means algorithm seeks a point which best “fits” your data points, the GMM algorithm will seek a gaussian which best fits them.

Any GMM model you will be implementing will probably have a fit and a predict method. The goal of the fit method is to find a number of gaussians to cluster your data points. The predict method takes a new point (or a list of new points) and predicts which gaussian is closest to it. So you may have a skeleton class like so:

class GMM():
    def __init__(self, n_clusters, covariance_type):
      pass

    def fit(self, features):
      pass

    def predict(self, features):
      pass

The fit() method

This method has two major sub-parts: the expectation step, and the Maximization step. These two parts will be run inside of an update loop which typically terminates after a set number of iterations or the algorithm has stopped improving.

I personally find that examples are the best way to understand many concepts, so in this article I will use a dataset with 4 features, each of the features having 3 dimensions:

feature	d₀	d₁	d₂
x₀	x₀₀	x₀₁	x₀₂
x₁	x₁₀	x₁₁	x₁₂
x₂	x₂₀	x₂₁	x₂₂
x₃	x₃₀	x₃₁	x₃₂

Expectation Step

The purpose of this step is to calculate for each of our features, the probability of that feature belonging to cluster k. In this example we have 4 clusters and 4 data points. So we need to fill in the following table:

feature	k₀	k₁	k₂	k₃
x₀	γ₀₀	γ₀₁	γ₀₂	γ₀₃
x₁	γ₁₀	γ₁₁	γ₁₂	γ₁₃
x₂	γ₂₀	γ₂₁	γ₂₂	γ₂₃
x₃	γ₃₀	γ₃₁	γ₃₂	γ₃₃

Maximization Step

In this step we need to update three variables for each of the clusters: the weight, the mean, and the diagonal covariance. In this article I will use w_j to represent the weight of the j^th gaussian, μ_j to represent the j^th mean, and σ_j to represent the j^th variance.

Calculating w_j

The first variable will be r_j, which is the sum of gammas corresponding to a given cluster k_j. In other words,

$ r_{j} = \sum_{i=0}^3 γ_{ij} $

Basically, r_j is the sum of the gammas in the j^th column. We will use this r_j to calculate a w_j for each cluster, the “mixing weight” of the gaussian. Divide by the number of features (in this case 4):

$ w_{j} = {r_{j} \over 4} $

Calculating μ_j

Next we calculate the means μ_j of the gaussians i.e. the point in n-dimensional space over which the (n+1)-dimensional bell curve is centered. Because in our example each feature x_i contains 3 dimensions (x₀ = [d₀₀, d₀₁, d₀₂]), the means will also need to have 3 dimensions each.

μ_j is calculated as follows:

$$ μ_{j} = {1 \over r_{j} } \sum_{i=0}^3 γ_{ji} x_{i} $$

Let’s work through the first μ. For simplicity we will multiply by r_j for now. For the first mean μ₀:

$$ r_j μ_0 = γ_{00} x_{0} + γ_{01} x_{1} + γ_{02} x_{2} + γ_{03} x_{3} $$

$$ = γ_{00} \begin{bmatrix} d_{00} \\ d_{01} \\ d_{02} \end{bmatrix} + γ_{01} \begin{bmatrix} d_{10} \\ d_{11} \\ d_{12} \end{bmatrix} + γ_{02} \begin{bmatrix} d_{20} \\ d_{21} \\ d_{22} \end{bmatrix} + γ_{03} \begin{bmatrix} d_{30} \\ d_{31} \\ d_{32} \end{bmatrix}$$

So we are essentially multiplying the gammas associated with a gaussian by all of the features we have. This matrix multiplication can be rolled up into

$$ = \begin{bmatrix} d_{00} & d_{10} & d_{20} & d_{30} \\ d_{01} & d_{11} & d_{21} & d_{31} \\ d_{02} & d_{12} & d_{22} & d_{32} \end{bmatrix} \begin{bmatrix} γ_{00} \\ γ_{01} \\ γ_{02} \\ γ_{03} \\ \end{bmatrix} = \begin{bmatrix} μ_{00} \\ μ_{01} \\ μ_{02} \end{bmatrix} $$

Don’t forget to divide by r_j again! It makes sense that each μ should have 3 dimensions, since our features also each have 3 dimensions.

Calculating σ_j

Technically this is σ² if you’re here from statistics, but since we never take the square root it doesn’t matter for our purposes.

All together now. E step with code.

The full equation for the expectation step is

$$ \gamma_{ij} = \frac{ w_j \, \mathcal{N}\!\left(x_i \mid \mu_j, \Sigma_j\right) }{ \sum_{k=0}^{K-1} w_k \, \mathcal{N}\!\left(x_i \mid \mu_k, \Sigma_k\right) } $$

Alright, enough math. As fun as it is for you to read, it’s even more fun to write in markdown.

All of the math above for calculating w, μ and σ can be written succinctly in python. Here I am using numpy to take advantage of some pretty slick matrix operations they have available.

"""
Fit GMM to the given data using self.n_clusters number of gaussians.
Features may be multi-dimensional.

Args:
  features: numpy array containing inputs of size
    (n_samples, n_dimensions)
  expectations: numpy array containing gammas from E step. Size is
    (n_features, n_clusters)
Returns:
  means: updated means
  covariances: updated covariances
  mixing_weights: updated mixing weights
"""

def maximization_step(self, features, expectations):
  # initialize lists to hold our parameters
  mixing_weights, means, covariances = [], [], []

  for cluster in range(self.n_clusters):
    # gammas will hold a column from the expectations matrix
    gammas = expectations[:,cluster]

    # r is just the sum of a column
    r = np.sum(gammas)

    # calculate mixing weight
    w = r / len(features)
    mixing_weights.append(w)

    # calculate means
    # the @ symbol is a slick shortcut for matrix multiplication
    mu = (gammas @ features) / r
    means.append(mu)

    # calculate covariances
    diff_sq = (features-mu) ** 2
    cov = (gammas @ diff_sq) / r
    covariances.append(cov)

  return means, covariances, mixing_weights

The predict() step

Once the model is trained, we have learned:

Mixing weights w_j
Means μ_j
Covariances Σ_j

To classify a new data point x we compute its responsibilities using the same formula from the E-step:

$$ \gamma_j(x) = \frac{ w_j \, \mathcal{N}\!\left(x \mid \mu_j, \Sigma_j\right) }{ \sum_{k=0}^{K-1} w_k \, \mathcal{N}\!\left(x \mid \mu_k, \Sigma_k\right) } $$

This yields a probability distribution over all clusters. If we only want the most likely cluster,

$$ \text{cluster}(x) = \arg\max_{j} \, \gamma_j(x) $$

def predict_probs(self, features):
    gamma, _ = self._e_step(features)
    return gamma

def predict(self, features):
    return np.argmax(self.predict_probs(features), axis=1)

A Practical Guide to Gaussian Mixture Models