Skip to content
mentorship

concepts

SGD with momentum

Add a moving average of past gradients to the update. Smoother trajectories, faster convergence in narrow valleys, and the foundation of Adam's first moment.

Reviewed · 2 min read

One-line definition

SGD with momentum maintains a velocity (an exponential moving average of past gradients) and updates parameters with instead of with the raw gradient. Typical .

Why it matters

Vanilla SGD bounces around in narrow loss valleys: gradients perpendicular to the valley axis cancel slowly, gradients along the axis are small. Momentum accumulates the consistent along-axis component while perpendicular components average to zero.

Empirically, momentum is essential for SGD to be competitive with adaptive optimizers on most problems. SGD without momentum is rarely used in modern training. Adam’s first-moment estimate is essentially momentum, which is why Adam inherits this benefit.

Two formulations

Classical momentum (Polyak, 1964)

Effective LR for a constant gradient: . With , that’s , so reducing effectively reduces the step size.

Nesterov momentum (Nesterov, 1983)

Compute the gradient at the look-ahead position instead of at . Updates:

In practice, only marginally better than classical momentum on most workloads. Used in some vision training (e.g., ResNet original paper).

Picking

Workload
ResNet / CNN training0.9
Reinforcement learning policy nets0.9
Very noisy gradients (RL, contrastive)0.95 or 0.99
Small batch with high noiselower (0.5–0.8) to track recent gradients

controls the effective averaging window: steps. averages over ~10 steps; averages over ~100.

When SGD+momentum vs. Adam

SituationDefault
Vision (CNN, ViT) classification with strong regularizationSGD + momentum + cosine LR
Transformers (NLP, LLM training)Adam / AdamW
Small datasets, fine-tuningAdam (less hyperparameter tuning needed)
Sparse gradients (recsys embeddings)Adam (per-parameter adaptive LR)
Reinforcement learningAdam (default in PPO / DQN implementations)

SGD+momentum often generalizes slightly better than Adam at convergence (sharp vs. flat minima discussion); Adam converges faster initially. For LLMs, Adam wins because the gradient distribution across parameters is highly non-uniform.

Common pitfalls

  • Forgetting that momentum scales the effective LR. Switching from to effectively multiplies the LR by 10.
  • Initializing without bias correction. Early steps have biased velocity; for SGD this is usually fine, but Adam explicitly bias-corrects.
  • Mixing momentum across LR changes. When the LR jumps, momentum carries old-LR-scaled velocities. Some implementations zero momentum at LR transitions.
  • Using SGD without momentum. Almost always strictly worse; pick momentum or use Adam.