One-line definition
For probability distributions and over the same space:
It’s the expected log-ratio of to under . Measuring how much information is lost using to encode samples from .
Why it matters
KL divergence is the fundamental object of statistical learning. It connects:
- Maximum likelihood (minimizing ).
- Variational inference (minimizing ).
- Cross-entropy loss = entropy of data + KL.
- Information bottleneck and mutual information.
- Policy gradient methods in RL (TRPO, PPO use KL constraints).
- Knowledge distillation (student matches teacher distribution via KL).
Properties
- Non-negative: , with equality iff (Gibbs’ inequality).
- Asymmetric: in general. Choose direction based on whether you are “fitting to ” or vice versa.
- Not a metric: no triangle inequality, not symmetric.
- Infinite if where : must cover the support of .
- Information-theoretic: equals expected extra bits (or nats) per sample needed to encode using a code optimized for .
Forward vs. reverse KL
The asymmetry matters in practice. For approximating with :
- Forward KL, : penalizes for missing modes of (“mean-seeking”. tries to cover all of ). Used in standard MLE.
- Reverse KL, : penalizes for placing mass where has none (“mode-seeking”. collapses to one mode). Used in variational inference.
For a multimodal , forward KL gives a broad average; reverse KL picks one mode. Visualize on a 2-Gaussian mixture: forward = average ellipse over both; reverse = one of the two.
Connection to cross-entropy
For empirical distribution over a finite dataset:
Cross-entropy = entropy + KL. Since entropy of the data doesn’t depend on , minimizing cross-entropy = minimizing KL = MLE. This is why “the loss is cross-entropy” and “we’re minimizing KL to the data” are the same statement.
Common usage in ML
| Use case | Direction |
|---|---|
| Classification cross-entropy loss | Forward |
| Variational inference (ELBO) | Reverse |
| RLHF / PPO penalty | Reverse . Keep new policy close to reference |
| Knowledge distillation | Forward with temperature |
| t-SNE | Reverse on pairwise similarities |
Common pitfalls
- Computing KL between distributions with different supports. If and , KL is .
- Confusing JS divergence (symmetric) with KL. GANs originally used JS; modern variants (Wasserstein) avoid both.
- Forgetting the asymmetry direction. Forward and reverse KL produce qualitatively different optimizers.
- Using KL on samples without density estimates. KL is between distributions, not between sample sets; sample-based estimators are noisy and biased.