Skip to content
mentorship

concepts

Gaussian mixture models

Model data as a weighted sum of K Gaussians. Soft clustering, density estimation, and the canonical EM example.

Reviewed · 3 min read

One-line definition

A Gaussian mixture model assumes data is generated by sampling a component from a categorical distribution with weights and then sampling :

Why it matters

GMMs are the canonical:

  • Soft clustering algorithm (each point has a posterior responsibility distribution over components, not a hard assignment).
  • Continuous density estimator with adjustable expressiveness via .
  • Building block for HMMs (state emissions), VAEs (mixture priors), and many speech / vision systems.
  • EM teaching example. The simplest non-trivial latent-variable model with closed-form EM updates.

The latent variable view

Define a one-hot latent with . Then . Marginalize: .

The latent is what makes GMM a soft-clustering algorithm: posterior tells you how much “responsibility” component has for .

Training: EM

Initialize (k-means++ for means, identity for covariances). Iterate until log-likelihood stabilizes.

E-step: for each point and component :

M-step:

Covariance structure

Trade expressiveness for parameter count by restricting :

FormParameters per componentUse case
FullDefault; expressive
DiagonalHigh-d data, fewer parameters
Spherical 1Approximates k-means with soft assignments
Tied (shared across components) onceUsed in linear discriminant analysis

Choosing

Same problem as k-means. Standard approaches:

  • Information criteria: BIC () or AIC (). Pick minimizing.
  • Cross-validation log-likelihood.
  • Dirichlet process mixture (infinite GMM): non-parametric Bayesian alternative that lets grow with data.

GMM vs. k-means

Propertyk-meansGMM
Cluster shapeSpherical, equal radiusArbitrary ellipsoid (full )
AssignmentHard (0/1)Soft (probability over components)
Parameters centroids centroids + covariances + weights
Compute per iter for full
Soft probabilitiesNoYes
Density estimationNo (Voronoi cells)Yes

GMM is k-means generalized: k-means = GMM with shared identity covariance, equal weights, and hard assignments.

Use cases

  • Soft clustering with ellipsoidal clusters.
  • Density estimation for low-d continuous data.
  • Anomaly detection: low-likelihood points are anomalies.
  • Speaker diarization: GMMs over MFCC features (legacy speech).
  • Background subtraction in video: GMM per pixel.
  • Mixture of experts gating in neural networks (related but distinct).

For high-dimensional data: pure GMMs struggle (covariance estimation requires parameters per component). Use diagonal covariance, dimensionality reduction, or normalizing-flow alternatives.

Common pitfalls

  • Covariance singularities. A component centered on a single point has and log-likelihood diverges. Add jitter .
  • Initialization. Random init often gives degenerate clusters; use k-means++ centroids as initial means.
  • Reading mixture weights as cluster sizes. is the prior; effective cluster size is .
  • Comparing GMM and k-means on cluster counts. GMM components and k-means clusters are not directly comparable; soft mass distributes differently.