One-line definition
A variational autoencoder (Kingma & Welling, 2013) is a generative model with a latent variable , learned encoder , and decoder , trained to maximize the evidence lower bound (ELBO):
Why it matters
VAEs introduced amortized variational inference to deep learning: a neural network learns to predict the posterior of a latent given the input, enabling end-to-end training of latent variable models with backprop. This idea now powers:
- Diffusion models (often built on top of a VAE in latent space. Stable Diffusion).
- Disentanglement research (β-VAE, factor-VAE).
- Generative pretraining for tabular and molecular data.
- Probabilistic recsys and time series.
The VAE is also the canonical example of the reparameterization trick, a tool used everywhere in modern probabilistic deep learning.
The model
- Prior: .
- Encoder: . A neural net outputs mean and diagonal covariance.
- Decoder: . A neural net mapping back to a distribution over (Gaussian for continuous, Bernoulli for binary, categorical for discrete).
The ELBO
The ELBO has two terms:
- Reconstruction: rewards the decoder for assigning high probability to given samples from the encoder.
- KL: penalizes the encoder for diverging from the prior. Keeps the latent space compact and continuous.
Maximizing ELBO ≤ maximizing log-likelihood (true objective). The gap is .
The reparameterization trick
To backprop through sampling, write with . The randomness is now external; the gradient flows through and deterministically. Without this, the gradient of an expectation over a parameter-dependent distribution would require REINFORCE (high variance).
What VAEs are good and bad at
Good:
- Smooth, continuous latent space useful for interpolation and editing.
- Stable training (unlike GANs).
- Good likelihood estimation (after IWAE-style correction).
- Excellent as compressors / latent encoders for downstream models (Stable Diffusion’s first stage).
Bad:
- Image samples are blurry compared to GANs and diffusion. The Gaussian decoder + per-pixel MSE penalizes high-frequency detail.
- KL penalty causes posterior collapse in some configurations (decoder ignores , output becomes nearly mean-only).
- Lower-quality samples than diffusion at the same parameter count.
Common pitfalls
- Posterior collapse. When the decoder is too powerful relative to the encoder, and the latent becomes useless. Mitigations: KL annealing, free bits, reduce decoder capacity, β-VAE with early in training.
- Forgetting the reparameterization trick. Sampling inside the network and trying to backprop through
z = sample(N(mu, sigma))doesn’t work; usez = mu + sigma * epsilon. - Treating ELBO as the model’s likelihood. ELBO is a lower bound; for likelihood comparison use IWAE estimates.
- Using VAEs as competitive standalone image generators. They aren’t anymore; use them as latent compressors with diffusion / AR on top.
Variants
- β-VAE: scale the KL term by . Higher encourages disentanglement; lower improves reconstruction.
- VQ-VAE: discrete (categorical) latent via vector quantization. Used in language-image models, audio.
- IWAE (importance-weighted): tighter ELBO via -sample importance weighting.
- NVAE, Hierarchical VAEs: deep hierarchical latents for higher-fidelity generation.
Related
- Reparameterization trick. The gradient enabler.
- Autoregressive vs. diffusion. Alternative generative paradigms.