Skip to content
mentorship

concepts

Matrix calculus for ML

Gradients, Jacobians, and Hessians for vector- and matrix-valued functions. The minimum needed to derive backprop and second-order methods.

Reviewed · 3 min read

One-line definition

Matrix calculus extends the ordinary derivative to functions whose inputs and/or outputs are vectors or matrices. The two organizing objects are the gradient (for scalar-valued functions) and the Jacobian (for vector-valued functions); the Hessian is the matrix of second partial derivatives.

Why it matters

Every learning algorithm computes derivatives of a scalar loss with respect to vector- or tensor-valued parameters. Knowing the right shapes and conventions saves hours of debugging. Backpropagation is matrix calculus applied recursively through a computation graph.

The four shape combinations

Gradient Jacobian ,
Second derivative: Hessian , (rarely used; tensor-valued)

Two competing conventions exist:

  • Numerator layout (used in physics, calculus textbooks): is a column vector matching .
  • Denominator layout (used in stats, ML): same; modern ML universally uses gradient = column vector with the same shape as the parameter.

For matrix parameters , the gradient has the same shape as .

Identities you actually use

FunctionGradient
, symmetric
( PD)

Chain rule

For :

  • Scalar scalar scalar: .
  • Vector vector scalar: where .

Backprop is just this chain rule applied to a computation graph layer-by-layer, with the Jacobian-vector product computed implicitly to avoid materializing the full Jacobian.

Hessian and second-order methods

The Hessian describes local curvature. Second-order methods (Newton’s method, K-FAC, Shampoo) use it to scale gradients by inverse curvature: .

In modern deep learning, the Hessian is too large to materialize ( for parameters). Approximations:

  • Diagonal: keep only diagonal entries (RMSProp, Adam’s approximates this).
  • Block-diagonal: per-layer Fisher information (K-FAC).
  • Hessian-vector products: via two backward passes. Used in conjugate gradient, influence functions.

Common pitfalls

  • Mismatched layout conventions. Always check whether your reference uses numerator or denominator layout; the difference is a transpose.
  • Treating gradients as having the same shape as the loss. They have the shape of the parameter, not the loss.
  • Computing Jacobians explicitly. For a vector-to-vector function with both large, the Jacobian is entries. Use vector-Jacobian products via torch.autograd.grad or jax.vjp instead.
  • Forgetting bias dimensions. has summed over the batch dimension.