Weight decay vs. L2 regularization

One-line definition

L2 regularization adds a penalty $\frac{λ}{2} ∥ θ ∥^{2}$ to the loss; weight decay multiplies parameters by $(1 - η λ)$ at each step. Under vanilla SGD they are mathematically equivalent. Under adaptive optimizers like Adam they are not. And the difference is large enough that AdamW (Loshchilov & Hutter, 2019) is now the default for transformer training.

The two formulations

L2 (penalty added to loss)

L_{total} (θ) = L_{data} (θ) + \frac{λ}{2} ∥ θ ∥^{2}

Gradient: $\nabla L_{total} = \nabla L_{data} + λ θ$ . Update under SGD: $θ_{t + 1} = θ_{t} - η (\nabla L_{data} + λ θ_{t})$ $= (1 - η λ) θ_{t} - η \nabla L_{data}$ .

Weight decay (multiplicative shrink)

θ_{t + 1} = (1 - η λ) θ_{t} - η \nabla L_{data}

Same expression. Under vanilla SGD, the two are identical.

Why they diverge under Adam

Adam scales the gradient per parameter by $1/ v_{t} + ε$ before applying the update. The L2 contribution $λ θ$ is part of the gradient, so it gets divided by $v_{t} + ε$ too:

θ_{t + 1} = θ_{t} - η \cdot \frac{m ^ _{t} + λ θ _{t}}{v ^ _{t} + ε}

The effective decay on each parameter is now scaled by $1/ v_{t}$ . Large for parameters with small gradient variance, small for parameters with large gradient variance. This couples regularization to the gradient history in an unintended way.

AdamW decouples them: apply Adam to the data loss only, and then shrink the parameters multiplicatively as a separate step:

θ_{t + 1} = (1 - η λ) θ_{t} - η \cdot \frac{m ^ _{t}}{v ^ _{t} + ε}

The shrink term has no $1/ v_{t}$ scaling. This recovers the SGD-equivalent behavior.

Empirical impact

Loshchilov & Hutter (2019) and many follow-up benchmarks show AdamW generalizes meaningfully better than Adam-with-L2 across vision and NLP. The exact gain depends on the task; on transformer LLM training the gap is large enough that essentially all modern training uses AdamW.

What to skip

Common practice: do not decay biases, LayerNorm parameters, or embeddings. These are 1D parameters with different statistical roles, and decaying them often hurts. Standard implementations construct two parameter groups: {decay: linear weights, conv kernels} and {no decay: biases, norms, embeddings}.

Common pitfalls

Using Adam with weight_decay > 0 in PyTorch. This applies L2-as-gradient, not AdamW. Use AdamW explicitly.
Decaying bias and LayerNorm parameters. Hurts performance; exclude them via parameter groups.
Picking $λ$ from a CNN recipe. $λ = 1 e- 4$ for ResNets; $λ = 0.1$ for AdamW transformer pretraining (with the no-decay carve-out). Different scale, different rule of thumb.
Forgetting that decay scales with LR. Effective shrink per step is $η λ$ . Halving LR halves effective decay; you may need to compensate.

Adam and AdamW. For the optimizer-side derivation.
Regularization. Broader survey of regularization techniques.