Skip to content
mentorship

concepts

Expectation-Maximization (EM)

Iterate between estimating latent variables given parameters (E-step) and updating parameters given latents (M-step). The standard tool for latent-variable MLE when the latents are unobserved.

Reviewed · 3 min read

One-line definition

Expectation-Maximization (Dempster, Laird & Rubin, 1977) is an iterative algorithm for finding MLE / MAP parameter estimates in latent-variable models. Each iteration alternates:

  1. E-step: compute the posterior over latents.
  2. M-step: update .

EM monotonically increases the log-likelihood until convergence to a local optimum.

Why it matters

When you have a latent variable model and the latents are unobserved, direct MLE involves a marginalization that is usually intractable. EM avoids this by alternately filling in expected values of and optimizing on the completed data.

EM underlies:

  • Gaussian mixture model (GMM) fitting.
  • Hidden Markov model (HMM) parameter estimation (Baum-Welch is EM).
  • Probabilistic PCA, factor analysis, ICA.
  • LDA topic models.
  • Many missing-data imputation methods.

The two steps

For a model with observed , latent , parameters :

E-step

Given current parameters , compute the posterior over latents:

For models with conjugate or finite latent structure, this is closed-form (GMM: posterior responsibilities; HMM: forward-backward).

M-step

Update to maximize the expected complete-data log-likelihood:

Often this expectation reduces to weighted MLE over the data with the latents replaced by their expected values.

Why it works

EM maximizes a lower bound on :

The first two terms are the ELBO (evidence lower bound). The KL is non-negative. EM:

  • E-step sets → KL = 0 → ELBO = .
  • M-step maximizes ELBO over holding fixed → never decreases .

So . Convergence to a local optimum is guaranteed.

Canonical example: GMM

A GMM has Gaussian components with mixture weights , means , covariances . Latent assigns each point to a component.

E-step: posterior responsibility of component for point :

M-step: weighted MLE updates:

Iterate until log-likelihood stabilizes.

Variants

  • Hard-assignment EM (k-means): replace soft responsibilities with hard 0/1 assignments. k-means is EM for a GMM with shared identity covariance and .
  • Stochastic EM: sample instead of computing expected values; useful for large .
  • Variational EM: replace exact posterior with a variational approximation; modern incarnation is the VAE.
  • MAP-EM: include a prior ; M-step maximizes .

Limitations

  • Local optima: EM converges to a local maximum; results depend on initialization. Multiple random restarts are standard.
  • Slow near optimum: linear convergence; gets sluggish near a flat region.
  • Requires known model structure: number of components, latent dimensions, etc.

Common pitfalls

  • Initializing GMM means at the same point. All components collapse to the same Gaussian; initialize by k-means++ or random data points.
  • Singular covariances. A component centered on a single point gets ; log-likelihood diverges. Add regularization .
  • Comparing log-likelihood across runs without checking convergence. EM monotonically increases it within a run, but different inits give different local optima.
  • Confusing EM with k-means. k-means is hard-assignment EM with restricted GMM; EM gives soft assignments and arbitrary covariances.