One-line definition
A Gaussian mixture model assumes data is generated by sampling a component from a categorical distribution with weights and then sampling :
Why it matters
GMMs are the canonical:
- Soft clustering algorithm (each point has a posterior responsibility distribution over components, not a hard assignment).
- Continuous density estimator with adjustable expressiveness via .
- Building block for HMMs (state emissions), VAEs (mixture priors), and many speech / vision systems.
- EM teaching example. The simplest non-trivial latent-variable model with closed-form EM updates.
The latent variable view
Define a one-hot latent with . Then . Marginalize: .
The latent is what makes GMM a soft-clustering algorithm: posterior tells you how much “responsibility” component has for .
Training: EM
Initialize (k-means++ for means, identity for covariances). Iterate until log-likelihood stabilizes.
E-step: for each point and component :
M-step:
Covariance structure
Trade expressiveness for parameter count by restricting :
| Form | Parameters per component | Use case |
|---|---|---|
| Full | Default; expressive | |
| Diagonal | High-d data, fewer parameters | |
| Spherical | 1 | Approximates k-means with soft assignments |
| Tied (shared across components) | once | Used in linear discriminant analysis |
Choosing
Same problem as k-means. Standard approaches:
- Information criteria: BIC () or AIC (). Pick minimizing.
- Cross-validation log-likelihood.
- Dirichlet process mixture (infinite GMM): non-parametric Bayesian alternative that lets grow with data.
GMM vs. k-means
| Property | k-means | GMM |
|---|---|---|
| Cluster shape | Spherical, equal radius | Arbitrary ellipsoid (full ) |
| Assignment | Hard (0/1) | Soft (probability over components) |
| Parameters | centroids | centroids + covariances + weights |
| Compute per iter | for full | |
| Soft probabilities | No | Yes |
| Density estimation | No (Voronoi cells) | Yes |
GMM is k-means generalized: k-means = GMM with shared identity covariance, equal weights, and hard assignments.
Use cases
- Soft clustering with ellipsoidal clusters.
- Density estimation for low-d continuous data.
- Anomaly detection: low-likelihood points are anomalies.
- Speaker diarization: GMMs over MFCC features (legacy speech).
- Background subtraction in video: GMM per pixel.
- Mixture of experts gating in neural networks (related but distinct).
For high-dimensional data: pure GMMs struggle (covariance estimation requires parameters per component). Use diagonal covariance, dimensionality reduction, or normalizing-flow alternatives.
Common pitfalls
- Covariance singularities. A component centered on a single point has and log-likelihood diverges. Add jitter .
- Initialization. Random init often gives degenerate clusters; use k-means++ centroids as initial means.
- Reading mixture weights as cluster sizes. is the prior; effective cluster size is .
- Comparing GMM and k-means on cluster counts. GMM components and k-means clusters are not directly comparable; soft mass distributes differently.
Related
- Expectation-Maximization. The training algorithm.
- k-means clustering. Special case of GMM.