Knowledge distillation

One-line definition

Knowledge distillation trains a student model with a loss against a teacher’s soft predictions, not the hard label. The student learns the teacher’s full output distribution, which carries information about how classes relate (Hinton et al., 2015).

Why it matters

Hard labels say “this is a 7.” Teacher logits say “94 percent 7, 4 percent 1, 1 percent 9, everything else 0.01.” That extra structure tells the student that 7 looks more like 1 than like 9. A small model trained against this signal usually beats the same model trained from scratch on hard labels at matched compute.

Distillation is the dominant technique for shrinking large models in production. DistilBERT, TinyBERT, MobileBERT, and most production LLMs ship distilled variants. Often combined with pruning and quantization.

The mechanism

Given teacher logits $z^{T}$ , student logits $z^{S}$ , hard label $y$ , temperature $τ > 1$ :

L = α \cdot L_{CE} (y, softmax (z^{S})) + (1 - α) \cdot τ^{2} \cdot KL (softmax (z^{T} / τ) ∥ softmax (z^{S} / τ)) .

Temperature $τ$ softens both distributions. Higher $τ$ exposes more of the teacher’s “dark knowledge” about non-target classes. $τ = 2$ to $5$ is typical.
$τ^{2}$ scaling is needed because softening reduces gradient magnitude by $1/ τ^{2}$ .
$α$ weights the hard-label loss. $α = 0$ gives pure distillation; $α \in [0.1, 0.5]$ is common.

Variants

Variant	What it matches
Logit distillation (above)	Teacher output logits
Feature distillation (FitNets)	Intermediate hidden states
Attention distillation (TinyBERT)	Teacher attention maps
Sequence-level distillation (Kim & Rush, 2016)	Teacher’s most likely outputs (for autoregressive models)
Self-distillation	Teacher and student are the same architecture; sometimes the teacher is a previous training checkpoint

For LLMs, sequence-level distillation against teacher samples (or rejection-sampled teacher outputs) is the dominant recipe. Logit distillation is impractical at vocab size 100k+.

When it works and when it doesn’t

Works well when:

Teacher is significantly better than what the student could reach alone.
Student capacity is at least 10 to 20 percent of the teacher.
Training data overlaps the teacher’s training distribution.

Fails when:

Student is too small. Capacity gap is the dominant ceiling.
Teacher is already small. The “dark knowledge” margin is thin.
Distribution shift. Teacher predictions are unreliable on student’s deployment data.

Common pitfalls

Forgetting $τ^{2}$ scaling. Without it, the KL term has tiny gradients and the hard-label term dominates.
Distilling only logits when feature distillation would help. For very small students, intermediate matching is often required.
Skipping the temperature. $τ = 1$ collapses the teacher’s distribution to nearly one-hot for confident predictions; you lose most of the signal.
Training student on teacher-correct examples only. The interesting signal is on examples where the teacher is uncertain. Use the full training set.