Expectation-Maximization (EM)

Iterate between estimating latent variables given parameters (E-step) and updating parameters given latents (M-step). The standard tool for latent-variable MLE when the latents are unobserved.

Reviewed November 9, 2025 · 3 min read

One-line definition

Expectation-Maximization (Dempster, Laird & Rubin, 1977) is an iterative algorithm for finding MLE / MAP parameter estimates in latent-variable models. Each iteration alternates:

E-step: compute the posterior $q (z) = p (z ∣ x; θ_{t})$ over latents.
M-step: update $θ_{t + 1} = ar g max_{θ} E_{q (z)} [lo g p (x, z; θ)]$ .

EM monotonically increases the log-likelihood until convergence to a local optimum.

Why it matters

When you have a latent variable model and the latents are unobserved, direct MLE involves a marginalization $lo g p (x; θ) = lo g \sum_{z} p (x, z; θ)$ that is usually intractable. EM avoids this by alternately filling in expected values of $z$ and optimizing $θ$ on the completed data.

EM underlies:

Gaussian mixture model (GMM) fitting.
Hidden Markov model (HMM) parameter estimation (Baum-Welch is EM).
Probabilistic PCA, factor analysis, ICA.
LDA topic models.
Many missing-data imputation methods.

The two steps

For a model with observed $x$ , latent $z$ , parameters $θ$ :

E-step

Given current parameters $θ_{t}$ , compute the posterior over latents:

q_{t} (z) = p (z ∣ x; θ_{t}) .

For models with conjugate or finite latent structure, this is closed-form (GMM: posterior responsibilities; HMM: forward-backward).

M-step

Update $θ$ to maximize the expected complete-data log-likelihood:

θ_{t + 1} = ar g θ max E_{z \sim q_{t}} [lo g p (x, z; θ)] .

Often this expectation reduces to weighted MLE over the data with the latents replaced by their expected values.

Why it works

EM maximizes a lower bound on $lo g p (x; θ)$ :

lo g p (x; θ) = E_{q} [lo g p (x, z; θ)] - E_{q} [lo g q (z)] + KL (q ∥ p_{θ} (z ∣ x)) .

The first two terms are the ELBO (evidence lower bound). The KL is non-negative. EM:

E-step sets $q = p_{θ} (z ∣ x)$ → KL = 0 → ELBO = $lo g p (x; θ)$ .
M-step maximizes ELBO over $θ$ holding $q$ fixed → never decreases $lo g p (x; θ)$ .

So $lo g p (x; θ_{t + 1}) \geq lo g p (x; θ_{t})$ . Convergence to a local optimum is guaranteed.

Canonical example: GMM

A GMM has $K$ Gaussian components with mixture weights $π_{k}$ , means $μ_{k}$ , covariances $Σ_{k}$ . Latent $z_{i} \in {1, \dots, K}$ assigns each point to a component.

E-step: posterior responsibility of component $k$ for point $i$ :

r_{ik} = \frac{π _{k} N ( x _{i} ; μ _{k} , Σ _{k} )}{\sum _{j} π _{j} N ( x _{i} ; μ _{j} , Σ _{j} )} .

M-step: weighted MLE updates:

μ_{k} = \frac{\sum _{i} r _{ik} x _{i}}{\sum _{i} r _{ik}}, π_{k} = \frac{1}{n} i \sum r_{ik}, Σ_{k} = \frac{\sum _{i} r _{ik} ( x _{i} - μ _{k} ) ( x _{i} - μ _{k} ) ^{⊤}}{\sum _{i} r _{ik}} .

Iterate until log-likelihood stabilizes.

Variants

Hard-assignment EM (k-means): replace soft responsibilities with hard 0/1 assignments. k-means is EM for a GMM with shared identity covariance and $π_{k} = 1/ K$ .
Stochastic EM: sample $z_{i}$ instead of computing expected values; useful for large $z$ .
Variational EM: replace exact posterior with a variational approximation; modern incarnation is the VAE.
MAP-EM: include a prior $p (θ)$ ; M-step maximizes $lo g p (θ) + E_{q} [lo g p (x, z; θ)]$ .

Limitations

Local optima: EM converges to a local maximum; results depend on initialization. Multiple random restarts are standard.
Slow near optimum: linear convergence; gets sluggish near a flat region.
Requires known model structure: number of components, latent dimensions, etc.

Common pitfalls

Initializing GMM means at the same point. All components collapse to the same Gaussian; initialize by k-means++ or random data points.
Singular covariances. A component centered on a single point gets $Σ_{k} \to 0$ ; log-likelihood diverges. Add regularization $Σ_{k} + ε I$ .
Comparing log-likelihood across runs without checking convergence. EM monotonically increases it within a run, but different inits give different local optima.
Confusing EM with k-means. k-means is hard-assignment EM with restricted GMM; EM gives soft assignments and arbitrary covariances.

Gaussian mixture models. The canonical EM application.
Hidden Markov models. Sequential model trained by EM (Baum-Welch).