Skip to content
mentorship

concepts

Weight decay vs. L2 regularization

L2 adds ½λ‖θ‖² to the loss; weight decay shrinks θ multiplicatively at each step. They are equivalent under SGD but not under Adam. Which is why AdamW exists.

Reviewed · 2 min read

One-line definition

L2 regularization adds a penalty to the loss; weight decay multiplies parameters by at each step. Under vanilla SGD they are mathematically equivalent. Under adaptive optimizers like Adam they are not. And the difference is large enough that AdamW (Loshchilov & Hutter, 2019) is now the default for transformer training.

The two formulations

L2 (penalty added to loss)

Gradient: . Update under SGD: .

Weight decay (multiplicative shrink)

Same expression. Under vanilla SGD, the two are identical.

Why they diverge under Adam

Adam scales the gradient per parameter by before applying the update. The L2 contribution is part of the gradient, so it gets divided by too:

The effective decay on each parameter is now scaled by . Large for parameters with small gradient variance, small for parameters with large gradient variance. This couples regularization to the gradient history in an unintended way.

AdamW decouples them: apply Adam to the data loss only, and then shrink the parameters multiplicatively as a separate step:

The shrink term has no scaling. This recovers the SGD-equivalent behavior.

Empirical impact

Loshchilov & Hutter (2019) and many follow-up benchmarks show AdamW generalizes meaningfully better than Adam-with-L2 across vision and NLP. The exact gain depends on the task; on transformer LLM training the gap is large enough that essentially all modern training uses AdamW.

What to skip

Common practice: do not decay biases, LayerNorm parameters, or embeddings. These are 1D parameters with different statistical roles, and decaying them often hurts. Standard implementations construct two parameter groups: {decay: linear weights, conv kernels} and {no decay: biases, norms, embeddings}.

Common pitfalls

  • Using Adam with weight_decay > 0 in PyTorch. This applies L2-as-gradient, not AdamW. Use AdamW explicitly.
  • Decaying bias and LayerNorm parameters. Hurts performance; exclude them via parameter groups.
  • Picking from a CNN recipe. for ResNets; for AdamW transformer pretraining (with the no-decay carve-out). Different scale, different rule of thumb.
  • Forgetting that decay scales with LR. Effective shrink per step is . Halving LR halves effective decay; you may need to compensate.