KL divergence

One-line definition

For probability distributions $p$ and $q$ over the same space:

KL (p ∥ q) = x \sum p (x) lo g \frac{p ( x )}{q ( x )} (or \int p lo g (p / q) d x for continuous) .

It’s the expected log-ratio of $p$ to $q$ under $p$ . Measuring how much information is lost using $q$ to encode samples from $p$ .

Why it matters

KL divergence is the fundamental object of statistical learning. It connects:

Maximum likelihood (minimizing $KL (\overset{p}{^}_{data} ∥ p_{θ})$ ).
Variational inference (minimizing $KL (q_{ϕ} ∥ p)$ ).
Cross-entropy loss = entropy of data + KL.
Information bottleneck and mutual information.
Policy gradient methods in RL (TRPO, PPO use KL constraints).
Knowledge distillation (student matches teacher distribution via KL).

Properties

Non-negative: $KL (p ∥ q) \geq 0$ , with equality iff $p = q$ (Gibbs’ inequality).
Asymmetric: $KL (p ∥ q) \neq = KL (q ∥ p)$ in general. Choose direction based on whether you are “fitting $q$ to $p$ ” or vice versa.
Not a metric: no triangle inequality, not symmetric.
Infinite if $q (x) = 0$ where $p (x) > 0$ : $q$ must cover the support of $p$ .
Information-theoretic: equals expected extra bits (or nats) per sample needed to encode $p$ using a code optimized for $q$ .

Forward vs. reverse KL

The asymmetry matters in practice. For approximating $p$ with $q$ :

Forward KL, $KL (p ∥ q)$ : penalizes $q$ for missing modes of $p$ (“mean-seeking”. $q$ tries to cover all of $p$ ). Used in standard MLE.
Reverse KL, $KL (q ∥ p)$ : penalizes $q$ for placing mass where $p$ has none (“mode-seeking”. $q$ collapses to one mode). Used in variational inference.

For a multimodal $p$ , forward KL gives a broad average; reverse KL picks one mode. Visualize on a 2-Gaussian mixture: forward = average ellipse over both; reverse = one of the two.

Connection to cross-entropy

For empirical distribution $\overset{p}{^}$ over a finite dataset:

H (\overset{p}{^}, p_{θ}) = - E_{\overset{p}{^}} [lo g p_{θ}] = H (\overset{p}{^}) + KL (\overset{p}{^} ∥ p_{θ}) .

Cross-entropy = entropy + KL. Since entropy of the data doesn’t depend on $θ$ , minimizing cross-entropy = minimizing KL = MLE. This is why “the loss is cross-entropy” and “we’re minimizing KL to the data” are the same statement.

Common usage in ML

Use case	Direction
Classification cross-entropy loss	Forward $KL (data ∥ model)$
Variational inference (ELBO)	Reverse $KL (q ∥ p)$
RLHF / PPO penalty	Reverse $KL (π_{θ} ∥ π_{ref})$ . Keep new policy close to reference
Knowledge distillation	Forward $KL (teacher ∥ student)$ with temperature
t-SNE	Reverse $KL (P ∥ Q)$ on pairwise similarities

Common pitfalls

Computing KL between distributions with different supports. If $p (x) > 0$ and $q (x) = 0$ , KL is $+ \infty$ .
Confusing JS divergence (symmetric) with KL. GANs originally used JS; modern variants (Wasserstein) avoid both.
Forgetting the asymmetry direction. Forward and reverse KL produce qualitatively different optimizers.
Using KL on samples without density estimates. KL is between distributions, not between sample sets; sample-based estimators are noisy and biased.