Gaussian mixture models

Model data as a weighted sum of K Gaussians. Soft clustering, density estimation, and the canonical EM example.

Reviewed April 20, 2026 · 3 min read

One-line definition

A Gaussian mixture model assumes data is generated by sampling a component $k \in {1, \dots, K}$ from a categorical distribution with weights $π_{k}$ and then sampling $x \sim N (μ_{k}, Σ_{k})$ :

p (x) = k = 1 \sum K π_{k} N (x; μ_{k}, Σ_{k}) .

Why it matters

GMMs are the canonical:

Soft clustering algorithm (each point has a posterior responsibility distribution over components, not a hard assignment).
Continuous density estimator with adjustable expressiveness via $K$ .
Building block for HMMs (state emissions), VAEs (mixture priors), and many speech / vision systems.
EM teaching example. The simplest non-trivial latent-variable model with closed-form EM updates.

The latent variable view

Define a one-hot latent $z \in {1, \dots, K}$ with $p (z = k) = π_{k}$ . Then $p (x ∣ z = k) = N (μ_{k}, Σ_{k})$ . Marginalize: $p (x) = \sum_{k} π_{k} N (x; μ_{k}, Σ_{k})$ .

The latent $z$ is what makes GMM a soft-clustering algorithm: posterior $p (z = k ∣ x)$ tells you how much “responsibility” component $k$ has for $x$ .

Training: EM

Initialize $π_{k}, μ_{k}, Σ_{k}$ (k-means++ for means, identity for covariances). Iterate until log-likelihood stabilizes.

E-step: for each point $x_{i}$ and component $k$ :

r_{ik} = p (z_{i} = k ∣ x_{i}) = \frac{π _{k} N ( x _{i} ; μ _{k} , Σ _{k} )}{\sum _{j} π _{j} N ( x _{i} ; μ _{j} , Σ _{j} )} .

M-step:

N_{k} = i \sum r_{ik}, π_{k} = \frac{N _{k}}{n}, μ_{k} = \frac{1}{N _{k}} i \sum r_{ik} x_{i}, Σ_{k} = \frac{1}{N _{k}} i \sum r_{ik} (x_{i} - μ_{k}) (x_{i} - μ_{k})^{⊤} .

Covariance structure

Trade expressiveness for parameter count by restricting $Σ_{k}$ :

Form	Parameters per component	Use case
Full	$d (d + 1) /2$	Default; expressive
Diagonal	$d$	High-d data, fewer parameters
Spherical $σ_{k}^{2} I$	1	Approximates k-means with soft assignments
Tied (shared across components)	$d (d + 1) /2$ once	Used in linear discriminant analysis

Choosing $K$

Same problem as k-means. Standard approaches:

Information criteria: BIC ( $lo g n \cdot params - 2 lo g L$ ) or AIC ( $2 \cdot params - 2 lo g L$ ). Pick $K$ minimizing.
Cross-validation log-likelihood.
Dirichlet process mixture (infinite GMM): non-parametric Bayesian alternative that lets $K$ grow with data.

GMM vs. k-means

Property	k-means	GMM
Cluster shape	Spherical, equal radius	Arbitrary ellipsoid (full $Σ$ )
Assignment	Hard (0/1)	Soft (probability over components)
Parameters	$K$ centroids	$K$ centroids + $K$ covariances + $K$ weights
Compute per iter	$O (n K d)$	$O (n K d^{2})$ for full $Σ$
Soft probabilities	No	Yes
Density estimation	No (Voronoi cells)	Yes

GMM is k-means generalized: k-means = GMM with shared identity covariance, equal weights, and hard assignments.

Use cases

Soft clustering with ellipsoidal clusters.
Density estimation for low-d continuous data.
Anomaly detection: low-likelihood points are anomalies.
Speaker diarization: GMMs over MFCC features (legacy speech).
Background subtraction in video: GMM per pixel.
Mixture of experts gating in neural networks (related but distinct).

For high-dimensional data: pure GMMs struggle (covariance estimation requires $O (d^{2})$ parameters per component). Use diagonal covariance, dimensionality reduction, or normalizing-flow alternatives.

Common pitfalls

Covariance singularities. A component centered on a single point has $Σ_{k} \to 0$ and log-likelihood diverges. Add jitter $Σ_{k} + ε I$ .
Initialization. Random init often gives degenerate clusters; use k-means++ centroids as initial means.
Reading mixture weights as cluster sizes. $π_{k}$ is the prior; effective cluster size is $N_{k} = \sum_{i} r_{ik}$ .
Comparing GMM and k-means on cluster counts. GMM components and k-means clusters are not directly comparable; soft mass distributes differently.

Expectation-Maximization. The training algorithm.
k-means clustering. Special case of GMM.