Skip to content
mentorship

concepts

Learning rate schedules: warmup and cosine decay

Why almost every modern training run linearly warms up the LR over a few hundred steps and then decays it on a cosine to near zero.

Reviewed · 2 min read

One-line definition

A learning-rate schedule is a function that varies the optimizer’s step size over training. The dominant 2026 default for LLMs is linear warmup for a few hundred to a few thousand steps, followed by cosine decay down to ~10% of the peak LR.

Why warmup

Adam-family optimizers track running variance estimates ( in Adam). At step 1, those estimates are noisy. The bias correction divides by a small denominator, and updates can be huge in magnitude. With a small LR for the first steps, the optimizer accumulates reliable second-moment statistics before the LR ramps up. Without warmup, transformer training routinely diverges in the first few hundred steps.

Typical warmup: to steps for pretraining; ~100 steps is enough for fine-tuning.

Why cosine

After warmup, hold near peak LR for most of training to make progress; near the end, decay so the optimizer settles into a flatter minimum. Cosine decay,

spends most of its budget near the peak and slows late, which empirically beats linear or step decay across most workloads (Loshchilov & Hutter, 2017).

Common practice: cosine to over the full training horizon .

How to set the peak LR

For Adam/AdamW on transformer training, the default starting point is (Karpathy’s “magic constant”) for moderate batch sizes. Larger batches scale up roughly linearly until the LR-batch tradeoff breaks ( for very large batch).

For SFT or task-specific fine-tuning of a pretrained model: 10–100× lower than pretraining ( to ).

A LR range test (Smith, 2017): sweep from to over a few hundred steps, plot loss vs. LR. Pick the point at the steepest descent (typically ~10× below where the loss diverges).

Other schedules

  • Constant: useful for online learning / RL where the data distribution shifts.
  • Inverse square root (original transformer paper): after warmup. Largely superseded by cosine.
  • One-cycle (Smith, 2018): warmup to a high peak, then decay aggressively. Used in some vision training; uncommon in LLMs.
  • WSD (Warmup-Stable-Decay): warmup, hold constant for most of training, decay sharply at the end. Used in some recent LLM training (Hu et al., 2024) for easier checkpoint resumption.

Common pitfalls

  • Skipping warmup. Adam without warmup on a transformer routinely diverges.
  • Decaying too fast. Aggressive decay limits how far the model can move; cosine-to-10% is a reliable default.
  • Forgetting LR scales with batch. Doubling the batch usually requires roughly doubling the LR.
  • Resuming a cosine schedule from a checkpoint. If the cosine is parameterized over total steps, resuming with a different total breaks the schedule. Save schedule state with the checkpoint.