Skip to content
mentorship

concepts

KL divergence

Asymmetric distance between probability distributions. Cross-entropy minus entropy. The mathematical glue holding most of probabilistic ML together.

Reviewed · 3 min read

One-line definition

For probability distributions and over the same space:

It’s the expected log-ratio of to under . Measuring how much information is lost using to encode samples from .

Why it matters

KL divergence is the fundamental object of statistical learning. It connects:

  • Maximum likelihood (minimizing ).
  • Variational inference (minimizing ).
  • Cross-entropy loss = entropy of data + KL.
  • Information bottleneck and mutual information.
  • Policy gradient methods in RL (TRPO, PPO use KL constraints).
  • Knowledge distillation (student matches teacher distribution via KL).

Properties

  • Non-negative: , with equality iff (Gibbs’ inequality).
  • Asymmetric: in general. Choose direction based on whether you are “fitting to ” or vice versa.
  • Not a metric: no triangle inequality, not symmetric.
  • Infinite if where : must cover the support of .
  • Information-theoretic: equals expected extra bits (or nats) per sample needed to encode using a code optimized for .

Forward vs. reverse KL

The asymmetry matters in practice. For approximating with :

  • Forward KL, : penalizes for missing modes of (“mean-seeking”. tries to cover all of ). Used in standard MLE.
  • Reverse KL, : penalizes for placing mass where has none (“mode-seeking”. collapses to one mode). Used in variational inference.

For a multimodal , forward KL gives a broad average; reverse KL picks one mode. Visualize on a 2-Gaussian mixture: forward = average ellipse over both; reverse = one of the two.

Connection to cross-entropy

For empirical distribution over a finite dataset:

Cross-entropy = entropy + KL. Since entropy of the data doesn’t depend on , minimizing cross-entropy = minimizing KL = MLE. This is why “the loss is cross-entropy” and “we’re minimizing KL to the data” are the same statement.

Common usage in ML

Use caseDirection
Classification cross-entropy lossForward
Variational inference (ELBO)Reverse
RLHF / PPO penaltyReverse . Keep new policy close to reference
Knowledge distillationForward with temperature
t-SNEReverse on pairwise similarities

Common pitfalls

  • Computing KL between distributions with different supports. If and , KL is .
  • Confusing JS divergence (symmetric) with KL. GANs originally used JS; modern variants (Wasserstein) avoid both.
  • Forgetting the asymmetry direction. Forward and reverse KL produce qualitatively different optimizers.
  • Using KL on samples without density estimates. KL is between distributions, not between sample sets; sample-based estimators are noisy and biased.