Z-loss

One-line definition

Z-loss adds a term proportional to $(lo g Z)^{2}$ to the cross-entropy objective, where $Z = \sum_{v} exp (ℓ_{v})$ is the softmax partition function over the vocabulary and $ℓ_{v}$ are the pre-softmax logits.

L = L_{CE} + λ_{z} \cdot (lo g Z)^{2}

A typical coefficient is $λ_{z} = 1 0^{- 4}$ . The term encourages $lo g Z$ to stay near zero, which is equivalent to keeping the unnormalized logits bounded.

Why it matters

Two reasons, in order of how often they come up:

Stability. Without z-loss, large LLM training runs occasionally see the lm_head logits drift to very large magnitudes, which can cause numerical issues in the softmax (especially in BF16 or FP8). PaLM and DeepSeek both used z-loss for this.
Regularization on logit scale during long cooldowns. This is the use case Marin made explicit. Layer norms are typically excluded from weight decay, so during very low-LR cooldowns there is no remaining regularization pressure on the final layer norm or lm_head. Z-loss is the only thing that bounds them. Without it, training loss can slowly creep upward at very low LR even though nothing is technically diverging. (See the Marin 8B retrospective for the failure mode.)

The mechanism

The cross-entropy loss for a single token with target $y$ is:

L_{CE} = - ℓ_{y} + lo g Z

Note that $L_{CE}$ is invariant to adding a constant to all logits (because $lo g Z$ shifts by the same amount that $ℓ_{y}$ does). That invariance means cross-entropy alone provides no gradient pressure on the absolute scale of the logits. The model can drift the entire logit vector up or down without changing the loss.

Adding $λ_{z} (lo g Z)^{2}$ breaks the invariance. The gradient with respect to a logit $ℓ_{v}$ becomes:

\frac{\partial L}{\partial ℓ _{v}} = (p_{v} - 1 [v = y]) + 2 λ_{z} (lo g Z) \cdot p_{v}

The second term pulls $lo g Z$ toward zero, which keeps the logits centered around a reasonable scale. The pressure is small per-step but persistent.

When to use it

Use z-loss as a default if any of:

You’re doing pretraining at scale (>1B parameters, >100B tokens). The cost is negligible and it prevents a class of late-training failures that are hard to debug.
You’re doing a long or deep cooldown (LR decayed by 100x or more from peak). Marin observed slow loss creep at LR=1.7e-5 that was fixed by adding $λ_{z} = 1 0^{- 4}$ .
You’re training in low precision (FP8 forward, in particular) where logit blowup is more likely to cause numerical issues.

You probably don’t need it for short fine-tuning runs at moderate LR.

How to set the coefficient

$λ_{z} = 1 0^{- 4}$ is the de-facto default (PaLM, DeepSeek, Marin). The loss is dimensionless, so this is a tiny fraction of typical cross-entropy values (which are O(1) to O(10) bits per token), but enough to apply persistent pressure.

If you set $λ_{z}$ much larger, the model spends capacity matching $lo g Z = 0$ instead of fitting the data and downstream perplexity worsens. If you set it much smaller, you lose the regularization effect.

Subtle effects worth knowing

Subsequent Marin experiments showed that z-loss doesn’t just prevent logit blowup. In the steady state, z-loss increases the weight norm of the lm_head and decreases the scale of the final layer norm. This makes sense in retrospect: the final layer norm has a disproportionate impact on logit scale (one parameter affecting all logits), and it’s the only player not constrained by weight decay. Z-loss redistributes the pressure.

What an interviewer expects you to say

If asked about z-loss:

State that it’s a penalty on $(lo g Z)^{2}$ added to cross-entropy.
Explain that cross-entropy is invariant to adding a constant to all logits, so it provides no gradient on absolute logit scale. Z-loss breaks that invariance.
Mention both use cases: stability against logit blowup and regularization on logit scale during deep cooldowns.
Bonus: note that layer norms are not weight-decayed, so z-loss is the only regularization pressure on the final layer norm.