One-line definition
Expectation-Maximization (Dempster, Laird & Rubin, 1977) is an iterative algorithm for finding MLE / MAP parameter estimates in latent-variable models. Each iteration alternates:
- E-step: compute the posterior over latents.
- M-step: update .
EM monotonically increases the log-likelihood until convergence to a local optimum.
Why it matters
When you have a latent variable model and the latents are unobserved, direct MLE involves a marginalization that is usually intractable. EM avoids this by alternately filling in expected values of and optimizing on the completed data.
EM underlies:
- Gaussian mixture model (GMM) fitting.
- Hidden Markov model (HMM) parameter estimation (Baum-Welch is EM).
- Probabilistic PCA, factor analysis, ICA.
- LDA topic models.
- Many missing-data imputation methods.
The two steps
For a model with observed , latent , parameters :
E-step
Given current parameters , compute the posterior over latents:
For models with conjugate or finite latent structure, this is closed-form (GMM: posterior responsibilities; HMM: forward-backward).
M-step
Update to maximize the expected complete-data log-likelihood:
Often this expectation reduces to weighted MLE over the data with the latents replaced by their expected values.
Why it works
EM maximizes a lower bound on :
The first two terms are the ELBO (evidence lower bound). The KL is non-negative. EM:
- E-step sets → KL = 0 → ELBO = .
- M-step maximizes ELBO over holding fixed → never decreases .
So . Convergence to a local optimum is guaranteed.
Canonical example: GMM
A GMM has Gaussian components with mixture weights , means , covariances . Latent assigns each point to a component.
E-step: posterior responsibility of component for point :
M-step: weighted MLE updates:
Iterate until log-likelihood stabilizes.
Variants
- Hard-assignment EM (k-means): replace soft responsibilities with hard 0/1 assignments. k-means is EM for a GMM with shared identity covariance and .
- Stochastic EM: sample instead of computing expected values; useful for large .
- Variational EM: replace exact posterior with a variational approximation; modern incarnation is the VAE.
- MAP-EM: include a prior ; M-step maximizes .
Limitations
- Local optima: EM converges to a local maximum; results depend on initialization. Multiple random restarts are standard.
- Slow near optimum: linear convergence; gets sluggish near a flat region.
- Requires known model structure: number of components, latent dimensions, etc.
Common pitfalls
- Initializing GMM means at the same point. All components collapse to the same Gaussian; initialize by k-means++ or random data points.
- Singular covariances. A component centered on a single point gets ; log-likelihood diverges. Add regularization .
- Comparing log-likelihood across runs without checking convergence. EM monotonically increases it within a run, but different inits give different local optima.
- Confusing EM with k-means. k-means is hard-assignment EM with restricted GMM; EM gives soft assignments and arbitrary covariances.
Related
- Gaussian mixture models. The canonical EM application.
- Hidden Markov models. Sequential model trained by EM (Baum-Welch).