Skip to content
mentorship

concepts

Backpropagation

Reverse-mode automatic differentiation applied to a computation graph. The algorithm that computes gradients for every parameter in one backward pass.

Reviewed · 3 min read

One-line definition

Backpropagation computes gradients of a scalar loss with respect to every parameter in a neural network in one backward pass through the computation graph, by applying the chain rule from the output back to the inputs and reusing intermediate computations.

Why it matters

Without backprop, training a deep network would require either:

  • Numerical differentiation: forward passes for parameters. Infeasible.
  • Forward-mode autodiff: as well; works for small parameter counts but not for neural nets.

Backprop computes all gradients in time where is the cost of a forward pass. Typically 2–3× the forward cost. This is the algorithmic enabler of all modern deep learning.

The algorithm

Compute the loss as a function of inputs and parameters by composing simple operations: . Each has known local derivatives.

Forward pass: compute and store along the way. The intermediates (activations) are needed for backward.

Backward pass: starting from , recursively apply:

The “gradient w.r.t. ” is the upstream gradient; the local Jacobian is multiplied in (as a vector-Jacobian product, never materialized as a full matrix).

Vector-Jacobian products (VJPs)

For an op where both are vectors, the Jacobian would be enormous. Backprop computes only the VJP: where .

Each elementary op has a hand-coded VJP rule. Frameworks (PyTorch, JAX, TensorFlow) compose them automatically.

Memory cost

Backprop must store all forward activations until the backward pass uses them. Memory is proportional to the depth of the network times batch size times activation size. Often dominating GPU memory in deep transformer training.

Mitigations:

  • Activation checkpointing: recompute selected activations during backward instead of storing.
  • Mixed precision: store activations in BF16 instead of FP32.
  • Sequence packing + smaller batch.

Connection to reverse-mode autodiff

Backprop is reverse-mode autodiff applied to scalar-output, vector-input functions (). Reverse mode is efficient when output dimension input dimension; for the opposite ( inputs, outputs with ), forward mode is preferred. Neural network gradients always have , so reverse is the right choice.

What backprop does NOT do

  • It is not learning. Backprop computes gradients; SGD / Adam uses them to update parameters.
  • It is not specific to neural networks. Any composition of differentiable ops with a scalar output can be backpropagated through.
  • It does not enforce convergence. The gradient may point downhill, but optimization may still get stuck.

Common pitfalls

  • Calling loss.backward() twice without retain_graph=True. Backward frees the graph by default; second call fails.
  • Forgetting optimizer.zero_grad(). Gradients accumulate by default; not zeroing means each step uses the sum of all past gradients (unintended, breaks convergence).
  • detach() errors. Tensors .detach()’d from the graph have no gradient; using them where you wanted gradients to flow gives subtle wrong learning.
  • Memory leaks from holding loss tensors. Keeping references to loss objects keeps the entire computation graph alive; use loss.item() for logging.
  • Confusing requires_grad with is_leaf. Parameters are typically leaves with requires_grad=True; intermediate tensors are non-leaves with requires_grad=True because they depend on params.